Skip to contents

National Science Foundation Logo

Search KDI Site

Banner: Multi-Disciplinary Research at NSF: Accomplishments of the KDI Initiative
 KDI Home    Contact Us   

Image: People Button

Image: Ideas Button

Image: Tools Buttong

 About KDI

 Behind the Scenes

 Taking Stock

 Links and Resources



  Quick Links

House Ear Institute Web site

Qualisys™ 3-D Motion Capture System

UCLA Phonetics Lab

Segmental and Prosodic Optical Phonetics for Human and Machine Speech Processing

When someone speaks in a noisy environment, generally it helps to be understood if the talker's face is within the field of eyesight of the listener. The movements of the talker's lips, jaw, and facial muscles provide optical cues to what he or she is saying. Similarly, among the deaf and hearing-impaired, ability to read lips can enable at least partial understanding of speech.

Such optical phonetics have been the focus of National Science Foundation-supported research carried out at the House Ear Institute in Los Angeles, in collaboration with electronics engineering and linguistics experts from the (HEI) University of California at Los Angeles. Work on this project has played a key role in developing the early professional career of a Chinese Ph.D. student in UCLA's Department of Electrical Engineering, working alongside senior researchers.

Image of Qualisys Recording PositionsAccording to Lynne E. Bernstein, head of the HEI's Department of Communication Neurosciences, the project has two main parts: (1) developing and utilizing advanced equipment for accurately recording optical speech signals simultaneously with acoustical components, and analyzing the relationship between them; and (2) efforts to investigate how the speech-perceiving brain takes the speech information and uses it, either separately for lip-reading or for listening, or for a combination of the two.

Image of 3-D model control points"I think we've been extremely successful," Bernstein says. "The engineering side has yielded, to begin with, a very large database of recordings of talkers. The database has in it acoustic, video, and 3-D optical recordings that are synchronized. Part of the database also has midsaggital magnetometer signals. These are recordings of the talker's tongue. We've been able to use these signals in studies that used multilinear regression to show that there is a systematic relationship between acoustic signals and optical signals. So it is possible to predict the acoustic signal from the optical signal and the optical signal from the acoustic signal."

Asked about practical applications of the project's findings, the HEI researcher says: "Our interest is in synthesis of visual talking heads. The idea would be that rather than recording all the video you need to have someone talk, maybe on a computer interface, you would record a limited corpus and use the relationships between the acoustical and optical characteristics of the talker to then synthesize the talker saying things he had never said, for some future time or some future purpose. So we're going ahead in that area."

Bernstein, who previously worked at Gallaudet University for the deaf in Washington, DC, says such synthetic talking heads could be applied in the future to dealing with deaf and hearing-impaired people who are lip-readers. "We know these people can make very efficient use of combining visual information with residual hearing, or just using the visual information by itself. Up until now, and I would include our group, we do not have an accurate reading of a synthetic talking face, so that a good deaf lip reader could look at a synthetic talking face and lip read. That hasn't been done yet. But our studies have that as a kind of goal for the future."

Image of EMA Recording PositionsPreviously, the House Ear Institute has pioneered with research to help deal with hearing loss, including invention of the cochlear implant and brainstem implant.

Bernstein observes, "I'm a big proponent of multi-disciplinary research. My own work has always been multi-disciplinary. I think that in the absence of the KDI program or other programs like that, it would be very difficult to bring together the number of people who are needed, in various [areas of] expertise, to do this kind of project."

The research team for the optical phonetics project has included Jintao Jiang, a Ph.D. student at UCLA who is doing his dissertation in this area. Jiang, who received his B.S. and M.S. degrees in electronic engineering from Tsinghua University in China, says that after completing his doctoral degree in December 2003 he hopes to obtain an R&D position either at a U.S. university or in industry, in the field of audiovisual speech processing and automatic speech recognition.

Bernstein says that Jiang has made a significant contribution to the research. "He's been a major player in working out the software and the procedures for actualizing the analyses. He has worked with me also on perceptual experiments—for example, one in which we obtained lip-reading results from hearing perceivers and modeled those results, and then correlated the modeled results with physical measurements from the talkers, which showed directly that what the optical stimulus is doing is having a direct effect on what those people are perceiving. Again, that is something that has really never been done so precisely for visual speech."


Back to Top of Page

People | Ideas | Tools
About KDI | Behind the Scenes | Taking Stock | Links and Resources

KDI Home | Contact Us | Site Map
NSF Home | CISE Home | Privacy Statement | Policies | Accessibility

National Science Foundation: Celebrating 50 Years Logo The National Science Foundation
4201 Wilson Boulevard, Arlington, Virginia 22230, USA
Tel: 703-292-5111, FIRS: 800-877-8339 | TDD: 703-292-5090