Segmental and Prosodic Optical Phonetics for Human and
Machine Speech Processing
When someone speaks in a noisy environment, generally it
helps to be understood if the talker's face is within the field of eyesight of
the listener. The movements of the talker's lips, jaw, and facial muscles
provide optical cues to what he or she is saying. Similarly, among the deaf and
hearing-impaired, ability to read lips can enable at least partial
understanding of speech.
Such optical phonetics have been the focus of National Science
Foundation-supported research carried out at the House Ear Institute in
Los Angeles, in collaboration with electronics engineering and linguistics
experts from the (HEI) University of California at Los Angeles. Work on this
project has played a key role in developing the early professional career of a
Chinese Ph.D. student in UCLA's Department of Electrical Engineering, working
alongside senior researchers.
According to Lynne E. Bernstein, head of the HEI's Department of
Communication Neurosciences, the project has two main parts: (1) developing and
utilizing advanced equipment for accurately recording optical speech signals
simultaneously with acoustical components, and analyzing the relationship
between them; and (2) efforts to investigate how the speech-perceiving brain
takes the speech information and uses it, either separately for lip-reading or
for listening, or for a combination of the two.
"I think we've been extremely successful," Bernstein says. "The
engineering side has yielded, to begin with, a very large database of
recordings of talkers. The database has in it acoustic, video, and 3-D optical
recordings that are synchronized. Part of the database also has midsaggital
magnetometer signals. These are recordings of the talker's tongue. We've been
able to use these signals in studies that used multilinear regression to show
that there is a systematic relationship between acoustic signals and optical
signals. So it is possible to predict the acoustic signal from the optical
signal and the optical signal from the acoustic signal."
Asked about practical applications of the project's
findings, the HEI researcher says: "Our interest is in synthesis of visual
talking heads. The idea would be that rather than recording all the video you
need to have someone talk, maybe on a computer interface, you would record a
limited corpus and use the relationships between the acoustical and optical
characteristics of the talker to then synthesize the talker saying things he
had never said, for some future time or some future purpose. So we're going
ahead in that area."
Bernstein, who previously worked at Gallaudet University for
the deaf in Washington, DC, says such synthetic talking heads could be applied
in the future to dealing with deaf and hearing-impaired people who are
lip-readers. "We know these people can make very efficient use of combining
visual information with residual hearing, or just using the visual information
by itself. Up until now, and I would include our group, we do not have an
accurate reading of a synthetic talking face, so that a good deaf lip reader
could look at a synthetic talking face and lip read. That hasn't been done yet.
But our studies have that as a kind of goal for the future."
Previously, the House Ear Institute has pioneered with research to
help deal with hearing loss, including invention of the cochlear implant and
Bernstein observes, "I'm a big proponent of
multi-disciplinary research. My own work has always been multi-disciplinary. I
think that in the absence of the KDI program or other programs like that, it
would be very difficult to bring together the number of people who are needed,
in various [areas of] expertise, to do this kind of project."
The research team for the optical phonetics project has
included Jintao Jiang, a Ph.D. student at UCLA who is doing his dissertation in
this area. Jiang, who received his B.S. and M.S. degrees in electronic
engineering from Tsinghua University in China, says that after completing his
doctoral degree in December 2003 he hopes to obtain an R&D position either
at a U.S. university or in industry, in the field of audiovisual speech
processing and automatic speech recognition.
Bernstein says that Jiang has made a significant
contribution to the research. "He's been a major player in working out the
software and the procedures for actualizing the analyses. He has worked with me
also on perceptual experimentsfor example, one in which we obtained
lip-reading results from hearing perceivers and modeled those results, and then
correlated the modeled results with physical measurements from the talkers,
which showed directly that what the optical stimulus is doing is having a
direct effect on what those people are perceiving. Again, that is something
that has really never been done so precisely for visual speech."
Back to Top of Page