Researchers Study How Speech is "Seen"
We often refer to "hearing people talk." But there's much
more to it than that. What we see while people talkhow they move their
lips, tongue, cheek, and jawsadds to the information we get while
Much is known about auditory speech signals, but little is
known about visual speech signalswhat we see while listening. Even less
is known about how we process these two types of information together.
A KDI grant team from the House Ear Institute and the
University of California at Los Angeles looked at the relationship between
audio and visual speech signals to find out what information the signals
contain. They also looked at how people process the signals. The researchers
hope to apply what they learned to creating realistic talking heads.
The researchers collected data on the audio and visual parts
of speech signals. To do that, they built a lab where they recorded audio and
visual signals simultaneously. With three cameras, they were able to capture
the movement of the lips, cheeks, and jaw, as the speakers talked. They also
captured the motion of the tongue in their mouths.
After collecting the data, the project team used engineering
and computation modeling techniques to learn more about the relationship
between the audio and visual signals. Lynn Bernstein, the principal
investigator says, "It's fascinating because we perceive that what we hear and
see is extremely different. But the thing to remember is that one system
creates both of these. That system produces something to hear and something to
see. Because the biomechanism is the same, we would assume that they would be
correlated. We found that to be true. We can take the audio signals and produce
visual speech signals, and we can take visual speech signals and produce audio
The researchers also looked at how the brain processes these
different signals. They ran experiments on how the brain responds when a person
only sees and only hears. They also compared how the brain responds when a
person sees and hears correctly combined signals, and sees and hears
incorrectly combined signals (that would be a signal where the words spoken did
not match the lip movements, for example). They found that the brain processes
not just the individual signals but also their relationship. The brain
responses of the participants recognized when the audio signals went with the
visual signals and when they did not.
Summarizing the results of these studies, Dr. Bernstein
explains, "The studies of brain response have shown us that audiovisual speech
processing is more than the sum of its parts. When we look at visual speech
processing of the brain, sets of areas are activated and many are different
from the ones that are activated when audio processing occurs. When both audio
and visual processing occur, additional parts of the brain become active. That
processing seems to involve the correlation between the two signals."
The researchers also studied what speech characteristics are
best for complementing audio speech in a noisy environment. They asked eight
different talkers to say a large number of sentences. Participants were asked
to type in what they thought the talker said. Some talkers' faces conveyed more
information than others. By comparing data on the visual signals of the
different talkers with the perception of what was said, they hope to learn more
about what types of visual signals communicate most effectively.
The researchers hope that this line of research will be
particularly useful in creating artificial talking heads. They would like to
create synthetic talkers that seem natural and provide as much information as
These synthetic talkers could prove very useful in numerous
applications. In noisy environments, it is much easier for us to understand
speech when we can also see it. For example, if you were in a noisy airport
terminal and needed information from a kiosk, a talking head would convey more
information than an audio message. It would be easier not only to identify the
words but also to understand the message.
Synthetic talkers could also be used for Web site
applications. This tool would be especially helpful for those with hearing
impairments who rely heavily on visual signals to perceive speech.
Synthetic talkers could also be used for second language
instruction. If you are listening to a new language, the sounds are not the
ones that your perceptual system is trained to hear. Synthetic talkers can give
language learners additional information about how people form their lips when
speaking and how they control their tongue and jaw. This additional information
gives language learners additional insight into how speech is produced and how
it should sound.
Synthetic talkers could have great appeal in Hollywood.
Synthetic actors could be far cheaper and lower maintenance than their human
The project team is moving forward. They are working on
synthesizing visual speech and will study how the brain processes and combines
audio and visual speech patterns. As Dr. Bernstein put it, "Basically, although
most people's intuition is that they get information from their ears, in fact
humans are designed to get both kinds of information in face-to-face
situations. They're frequently at a disadvantage when they're not face to
Back to Top of Page