Skip to contents

National Science Foundation Logo

Search KDI Site

Banner: Multi-Disciplinary Research at NSF: Accomplishments of the KDI Initiative
 KDI Home    Contact Us   

Image: People Button

Image: Ideas Button

Image: Tools Buttong

 About KDI

 Behind the Scenes

 Taking Stock

 Links and Resources

  

Tools

  Quick Links

House Ear Institute Web site

Qualisys™ 3-D Motion Capture System

UCLA Phonetics Lab

Researchers Study How Speech is "Seen"

We often refer to "hearing people talk." But there's much more to it than that. What we see while people talk—how they move their lips, tongue, cheek, and jaws—adds to the information we get while "listening."

Much is known about auditory speech signals, but little is known about visual speech signals—what we see while listening. Even less is known about how we process these two types of information together.

A KDI grant team from the House Ear Institute and the University of California at Los Angeles looked at the relationship between audio and visual speech signals to find out what information the signals contain. They also looked at how people process the signals. The researchers hope to apply what they learned to creating realistic talking heads.

The researchers collected data on the audio and visual parts of speech signals. To do that, they built a lab where they recorded audio and visual signals simultaneously. With three cameras, they were able to capture the movement of the lips, cheeks, and jaw, as the speakers talked. They also captured the motion of the tongue in their mouths.

Image depicting the flow of recording speech

After collecting the data, the project team used engineering and computation modeling techniques to learn more about the relationship between the audio and visual signals. Lynn Bernstein, the principal investigator says, "It's fascinating because we perceive that what we hear and see is extremely different. But the thing to remember is that one system creates both of these. That system produces something to hear and something to see. Because the biomechanism is the same, we would assume that they would be correlated. We found that to be true. We can take the audio signals and produce visual speech signals, and we can take visual speech signals and produce audio speech signals."

The researchers also looked at how the brain processes these different signals. They ran experiments on how the brain responds when a person only sees and only hears. They also compared how the brain responds when a person sees and hears correctly combined signals, and sees and hears incorrectly combined signals (that would be a signal where the words spoken did not match the lip movements, for example). They found that the brain processes not just the individual signals but also their relationship. The brain responses of the participants recognized when the audio signals went with the visual signals and when they did not.

Summarizing the results of these studies, Dr. Bernstein explains, "The studies of brain response have shown us that audiovisual speech processing is more than the sum of its parts. When we look at visual speech processing of the brain, sets of areas are activated and many are different from the ones that are activated when audio processing occurs. When both audio and visual processing occur, additional parts of the brain become active. That processing seems to involve the correlation between the two signals."

The researchers also studied what speech characteristics are best for complementing audio speech in a noisy environment. They asked eight different talkers to say a large number of sentences. Participants were asked to type in what they thought the talker said. Some talkers' faces conveyed more information than others. By comparing data on the visual signals of the different talkers with the perception of what was said, they hope to learn more about what types of visual signals communicate most effectively.

Image showing points of data collectionThe researchers hope that this line of research will be particularly useful in creating artificial talking heads. They would like to create synthetic talkers that seem natural and provide as much information as possible.

These synthetic talkers could prove very useful in numerous applications. In noisy environments, it is much easier for us to understand speech when we can also see it. For example, if you were in a noisy airport terminal and needed information from a kiosk, a talking head would convey more information than an audio message. It would be easier not only to identify the words but also to understand the message.

Synthetic talkers could also be used for Web site applications. This tool would be especially helpful for those with hearing impairments who rely heavily on visual signals to perceive speech.

Synthetic talkers could also be used for second language instruction. If you are listening to a new language, the sounds are not the ones that your perceptual system is trained to hear. Synthetic talkers can give language learners additional information about how people form their lips when speaking and how they control their tongue and jaw. This additional information gives language learners additional insight into how speech is produced and how it should sound.

Synthetic talkers could have great appeal in Hollywood. Synthetic actors could be far cheaper and lower maintenance than their human counterparts!

The project team is moving forward. They are working on synthesizing visual speech and will study how the brain processes and combines audio and visual speech patterns. As Dr. Bernstein put it, "Basically, although most people's intuition is that they get information from their ears, in fact humans are designed to get both kinds of information in face-to-face situations. They're frequently at a disadvantage when they're not face to face."

 

Back to Top of Page


People | Ideas | Tools
About KDI | Behind the Scenes | Taking Stock | Links and Resources

KDI Home | Contact Us | Site Map
NSF Home | CISE Home | Privacy Statement | Policies | Accessibility

National Science Foundation: Celebrating 50 Years Logo The National Science Foundation
4201 Wilson Boulevard, Arlington, Virginia 22230, USA
Tel: 703-292-5111, FIRS: 800-877-8339 | TDD: 703-292-5090