Under the AudioNet planning grant, researchers at the International Computer Science Institute set out to determine what types of data researchers working on automatic analysis of environmental audio most need to move their field forward and explore new approaches. In addition to surveys and informal conversations with audio researchers about needs and priorities, we analyzed requirements for video retrieval for use in the sciences. Finally, we experimented with procedures for potentially creating a large-scale human-annotated audio dataset based on the open-source YFCC100M video dataset.
As part of the needs analysis, PI Friedland co-authored a paper called "Audition for Multimedia Computing", published in 2018 in Frontiers of Multimedia Research. This paper defines and sets the stage for the developing field of Computer Audition; it suggests we need to look beyond simply describing and classifying sounds, and rather work towards systems that will be able to make actionable inferences about the world based on sound (e.g., identifying what probably happened to cause a sound). In the process, the authors analyze what kind of audio datasets will be needed to achieve those goals and to evaluate success, emphasizing the need to move beyond simple categorization to structured representation.
In addition, we collaborated with other multimedia researchers on a generalized framework for multimedia big data studies (MMBDS). The goal of this framework is to allow natural scientists, social scientists, and data scientists to quickly retrieve and filter videos and images relevant to their research from the YFCC100M. Our work with scientists to lay out the requirements and structure for this framework served in part as a needs assessment for AudioNet as well. It provided us with the opportunity to examine what kinds of strong audio annotations could help audio researchers provide the most useful (and intuitive) content-based retrieval models for the framework.
For our pilot experiments, we chose to focus on examining the trade-offs involved in providing precise timepoints for each labeled sound in a video (there is currently no large, open-source dataset of consumer-produced videos with such detailed meta-annotations), using more and less constrained labeling schemes. Timepoint localization is more time-consuming than simply annotating whether a given sound appears in a video, but is useful to researchers studying how to take advantage of the temporal nature of audio data, i.e., how patterns of co-occurrence and sound sequences can be used to make more fine-grained categorizations of videos. We described the results of our experiments in a report-back to interested researchers in the audio community.
Last Modified: 04/18/2019
Modified by: Julia Bernd