Award Abstract # 1629990
CI-P: Planning for AudioNet: A New Community Infrastructure for Audio Annotations for Acoustic Event Identification

NSF Org: CNS
Division Of Computer and Network Systems
Awardee: INTERNATIONAL COMPUTER SCIENCE INSTITUTE
Initial Amendment Date: June 21, 2016
Latest Amendment Date: June 21, 2016
Award Number: 1629990
Award Instrument: Standard Grant
Program Manager: Tatiana Korelsky
tkorelsk@nsf.gov
 (703)292-8729
CNS
 Division Of Computer and Network Systems
CSE
 Direct For Computer & Info Scie & Enginr
Start Date: July 1, 2016
End Date: December 31, 2018 (Estimated)
Total Intended Award Amount: $100,000.00
Total Awarded Amount to Date: $100,000.00
Funds Obligated to Date: FY 2016 = $100,000.00
History of Investigator:
  • Gerald  Friedland (Principal Investigator)
    fractor@icsi.berkeley.edu  (510)666-2900
  • Julia  Bernd (Co-Principal Investigator)
Awardee Sponsored Research Office: International Computer Science Institute
2150 Shattuck Ave, Suite 1100
Berkeley
CA  US  94704-1345
(510)666-2900
Sponsor Congressional District: 13
Primary Place of Performance: International Computer Science Institute
1947 Center Street, Suite 600
Berkeley
CA  US  94704-1159
Primary Place of Performance
Congressional District:
13
DUNS ID: 187909478
Parent DUNS ID:
NSF Program(s): CCRI-CISE Cmnty Rsrch Infrstrc
Primary Program Source: 040100 NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7359
Program Element Code(s): 7359
Award Agency Code: 4900
Fund Agency Code: 4900
CFDA Number(s): 47.070

ABSTRACT

This effort lays the groundwork for AudioNet, a public-domain corpus of audio labels for the nearly 800,000 videos in the open-access YFCC100M dataset. Audio information provides an important complement to visual information in the automatic analysis of video data, allowing systems to detect situations that may not be clearly identifiable from the visual stream alone. However, there are as yet no truly large-scale labeled audio datasets of the kind needed as input to build flexible, accurate analysis systems. Creating such a large-scale corpus will serve as an impetus for better multimedia algorithms to be developed by more researchers and computer science students, translating into an impact on the everyday life of the public at large. Social media videos are increasingly used for scientific research, as they provide an opportunity to observe and model many phenomena in the social sciences, economics, meteorology, and medicine. New capabilities for content analysis will therefore impact many scientific fields. In addition, audio analysis could be used in real-time security surveillance and in robotics applications like autonomous vehicles and household robots to aid and monitor the elderly.

AudioNet is part of a multi-institution collaboration, the Multimedia Commons initiative, which is developing a variety of resources around the YFCC100M dataset of Creative Commons-licensed photos and videos. AudioNet is annotating the audio tracks from the YFCC100M videos, focusing on audio concepts. Audio concepts can be thought of as acoustic "objects": concrete, localizable units of sound like "crowd cheering" or "fire alarm". The approach will be modeled on ImageNet, an image dataset labeled and organized using the WordNet hierarchy of synsets (groups of synonyms); ImageNet has enabled major enabled advances in image processing. However, while ImageNet focuses largely on entities (noun synsets), audio data is inherently temporal. The label set for AudioNet will therefore focus on events and actions, though similarly organized using semantic resources like WordNet.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Under the AudioNet planning grant, researchers at the International Computer Science Institute set out to determine what types of data researchers working on automatic analysis of environmental audio most need to move their field forward and explore new approaches. In addition to surveys and informal conversations with audio researchers about needs and priorities, we analyzed requirements for video retrieval for use in the sciences. Finally, we experimented with procedures for potentially creating a large-scale human-annotated audio dataset based on the open-source YFCC100M video dataset.

As part of the needs analysis, PI Friedland co-authored a paper called "Audition for Multimedia Computing", published in 2018 in Frontiers of Multimedia Research. This paper defines and sets the stage for the developing field of Computer Audition; it suggests we need to look beyond simply describing and classifying sounds, and rather work towards systems that will be able to make actionable inferences about the world based on sound (e.g., identifying what probably happened to cause a sound). In the process, the authors analyze what kind of audio datasets will be needed to achieve those goals and to evaluate success, emphasizing the need to move beyond simple categorization to structured representation.

In addition, we collaborated with other multimedia researchers on a generalized framework for multimedia big data studies (MMBDS). The goal of this framework is to allow natural scientists, social scientists, and data scientists to quickly retrieve and filter videos and images relevant to their research from the YFCC100M. Our work with scientists to lay out the requirements and structure for this framework served in part as a needs assessment for AudioNet as well. It provided us with the opportunity to examine what kinds of strong audio annotations could help audio researchers provide the most useful (and intuitive) content-based retrieval models for the framework.

For our pilot experiments, we chose to focus on examining the trade-offs involved in providing precise timepoints for each labeled sound in a video (there is currently no large, open-source dataset of consumer-produced videos with such detailed meta-annotations), using more and less constrained labeling schemes. Timepoint localization is more time-consuming than simply annotating whether a given sound appears in a video, but is useful to researchers studying how to take advantage of the temporal nature of audio data, i.e., how patterns of co-occurrence and sound sequences can be used to make more fine-grained categorizations of videos. We described the results of our experiments in a report-back to interested researchers in the audio community.


Last Modified: 04/18/2019
Modified by: Julia Bernd

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page