>Beyond Voice Recognition, to a Computer That Reads Lips
>
>September 11, 2003
>By ANNE EISENBERG
>
>
>
>
>
>
>PERSONAL computers have changed a lot in the last few
>decades, but not in the way that people communicate with
>them. Typing on a keyboard, with the help of a mouse,
>remains the most common interface.
>
>But pounding away at a set of keys is hard on the hands and
>tethers users to the keyboard. Automatic speech recognition
>offers some relief - the systems work reasonably well for
>office dictation, for instance. But voice recognition is
>not effective in noisy places like cars, train stations or
>the corner cash machine, and it may stumble even under the
>best of conditions. Humans are still much better than any
>computer at the subtleties of speech recognition.
>
>But teaching computers to read lips might boost the
>accuracy of automatic speech recognition. Listeners
>naturally use mouth movements to help them understand the
>difference between "bat" and "pat," for instance. If
>distinctions like this could be added to a computer's
>databank with the aid of cheap cameras and powerful
>processors, speech recognition software might work a lot
>better, even in noisy places.
>
>Scientists at I.B.M.'s research center in Westchester
>County, at Intel's centers in China and California and in
>many other labs are developing just such digital
>lip-reading systems to augment the accuracy of speech
>recognition.
>
>Chalapathy Neti, a senior researcher at I.B.M.'s Thomas J.
>Watson Research Center in Yorktown Heights, N.Y., has spent
>the past four years focusing on how to boost the
>performance of speech recognition with cameras. Dr. Neti
>manages the center's research in audiovisual speech
>technologies. "We humans fuse audio and visual perception
>in deciding what is being spoken," he said. A computer, he
>said, can be trained to do this job, too.
>
>At I.B.M., the process starts by getting the computer and
>camera to locate the person who is speaking, searching for
>skin-tone pixels, for instance, and then using statistical
>models that detect any object in that area that resembles a
>face. Then, with the face in view, vision algorithms focus
>on the mouth region, estimating the location of many
>features, including the corners and center of the lips.
>
>If the camera looked solely at the mouth, though, only
>about 12 to 14 sounds could be distinguished visually, Dr.
>Neti said - for instance, the difference between the
>explosive initial "p" and its close relative "b." So the
>group enlarged the visual region to include many types of
>movements. "We tried using additional visible articulators
>like jaw movements and the lower cheek, and other movements
>of tongue and teeth," he said, "and that turned out to be
>beneficial." Then the visual and audio features were
>combined and analyzed by statistical models that predicted
>what the speaker was saying.
>
>Using inexpensive laptop cameras, the group tested the new
>system repeatedly. When they introduced a lot of background
>audio noise, Dr. Neti said, the combination audio and
>visual analysis of speech worked well, demonstrating up to
>a 100 percent improvement in accuracy compared with using
>audio alone.
>
>These were promising results, but as Dr. Neti pointed out,
>a studio is not the world. Many camera-based systems that
>work well in the controlled conditions of a laboratory fail
>when they are tested in a car, for instance, where the
>lighting is uneven or people face away from the camera.
>
>To handle circumstances like this, he and his colleagues
>are developing several solutions. One is an audiovisual
>headset, now in prototype, with a tiny camera mounted on
>the boom. "This way, the mouth region can always be seen,"
>he said, independent of head movement or walking. I.B.M. is
>also exploring the use of infrared illuminators for the
>mouth region to provide constant lighting.
>
>Dr. Neti said that such headsets might prove useful in
>workplaces where people fill out forms or enter data by
>using speech recognition software.
>
>Another solution to changing video conditions is a feedback
>system devised by the I.B.M. research group. "Our system
>tracks confidence levels as it combines audio and visual
>features," making a decision on the relative weight of the
>two sources, Dr. Neti said. When a speaker faces away from
>the microphone, he said, the confidence level becomes zero
>and the system ignores the visual information and simply
>uses audio information. When the visual information is
>strong, it is included.
>
>"The more pixels you can get for the mouth region," he
>said, "the better information you'll have."
>
>The goal of the system is always to do better than when
>relying on an audio or video stream alone. "At worst, it is
>as good as audio," Dr. Neti said. "At best, it is much
>better."
>
>At Intel, too, researchers have developed software for
>combined audiovisual analysis of speech and released the
>software for public use as part of the company's Open
>Source Computer Vision Library, said Ara V. Nefian, a
>senior Intel researcher who led the project. "We extract
>visual features and then acoustic features, and combine
>them using a model that analyzes them jointly," he said. In
>tests, the system could identify four out of five words in
>noisy environments.
>
>"The results were as good for Chinese as for English," Dr.
>Nefian added, suggesting that the system could be
>introduced elsewhere.
>
>Aggelos Katsaggelos, a professor of electrical and computer
>engineering at Northwestern University in Evanston, Ill.,
>is also developing an audiovisual speech recognition
>system. He said that a future application might be improved
>security, using such a system, for instance, to determine
>whether recent videos that have surfaced indeed showed
>Saddam Hussein himself or an imposter. "In principle, if
>one can use both video and audio analysis one can have a
>better accuracy in identifying people," he said.
>
>Iain Matthews, a research scientist at Carnegie Mellon
>University's Robotics Institute who works mainly on face
>tracking and modeling, said that audiovisual speech
>recognition was a logical step. "Psychology showed this 50
>years ago," he said. "If you can see a person speaking, you
>can understand that person better."
>
>http://www.nytimes.com/2003/09/11/technology/circuits/11next.html?ex=1064384
>722&ei=1&en=cbbced2317c66237
>
>
>---------------------------------
>
>Get Home Delivery of The New York Times Newspaper. Imagine
>reading The New York Times any time & anywhere you like!
>Leisurely catch up on events & expand your horizons. Enjoy
>now for 50% off Home Delivery! Click here:
>
>http://www.nytimes.com/ads/nytcirc/index.html
>
>
>
>HOW TO ADVERTISE
>---------------------------------
>For information on advertising in e-mail newsletters
>or other creative advertising opportunities with The
>New York Times on the Web, please contact
>[log in to unmask] or visit our online media
>kit at http://www.nytimes.com/adinfo
>
>For general information about NYTimes.com, write to
>[log in to unmask]
>
>Copyright 2003 The New York Times Company
VICUG-L is the Visually Impaired Computer User Group List.
To join or leave the list, send a message to
[log in to unmask] In the body of the message, simply type
"subscribe vicug-l" or "unsubscribe vicug-l" without the quotations.
VICUG-L is archived on the World Wide Web at
http://maelstrom.stjohns.edu/archives/vicug-l.html
|