Catherine Alfieri
Fri, 12 Sep 2003 13:09:16 -0400
Beyond Voice Recognition, to a Computer That Reads Lips

September 11, 2003

PERSONAL computers have changed a lot in the last few
decades, but not in the way that people communicate with
them. Typing on a keyboard, with the help of a mouse,
remains the most common interface.

But pounding away at a set of keys is hard on the hands and
tethers users to the keyboard. Automatic speech recognition
offers some relief - the systems work reasonably well for
office dictation, for instance. But voice recognition is
not effective in noisy places like cars, train stations or
the corner cash machine, and it may stumble even under the
best of conditions. Humans are still much better than any
computer at the subtleties of speech recognition.

But teaching computers to read lips might boost the
accuracy of automatic speech recognition. Listeners
naturally use mouth movements to help them understand the
difference between "bat" and "pat," for instance. If
distinctions like this could be added to a computer's
databank with the aid of cheap cameras and powerful
processors, speech recognition software might work a lot
better, even in noisy places.

Scientists at I.B.M.'s research center in Westchester
County, at Intel's centers in China and California and in
many other labs are developing just such digital
lip-reading systems to augment the accuracy of speech

Chalapathy Neti, a senior researcher at I.B.M.'s Thomas J.
Watson Research Center in Yorktown Heights, N.Y., has spent
the past four years focusing on how to boost the
performance of speech recognition with cameras. Dr. Neti
manages the center's research in audiovisual speech
technologies. "We humans fuse audio and visual perception
in deciding what is being spoken," he said. A computer, he
said, can be trained to do this job, too.

At I.B.M., the process starts by getting the computer and
camera to locate the person who is speaking, searching for
skin-tone pixels, for instance, and then using statistical
models that detect any object in that area that resembles a
face. Then, with the face in view, vision algorithms focus
on the mouth region, estimating the location of many
features, including the corners and center of the lips.

If the camera looked solely at the mouth, though, only
about 12 to 14 sounds could be distinguished visually, Dr.
Neti said - for instance, the difference between the
explosive initial "p" and its close relative "b." So the
group enlarged the visual region to include many types of
movements. "We tried using additional visible articulators
like jaw movements and the lower cheek, and other movements
of tongue and teeth," he said, "and that turned out to be
beneficial." Then the visual and audio features were
combined and analyzed by statistical models that predicted
what the speaker was saying.

Using inexpensive laptop cameras, the group tested the new
system repeatedly. When they introduced a lot of background
audio noise, Dr. Neti said, the combination audio and
visual analysis of speech worked well, demonstrating up to
a 100 percent improvement in accuracy compared with using
audio alone.

These were promising results, but as Dr. Neti pointed out,
a studio is not the world. Many camera-based systems that
work well in the controlled conditions of a laboratory fail
when they are tested in a car, for instance, where the
lighting is uneven or people face away from the camera.

To handle circumstances like this, he and his colleagues
are developing several solutions. One is an audiovisual
headset, now in prototype, with a tiny camera mounted on
the boom. "This way, the mouth region can always be seen,"
he said, independent of head movement or walking. I.B.M. is
also exploring the use of infrared illuminators for the
mouth region to provide constant lighting.

Dr. Neti said that such headsets might prove useful in
workplaces where people fill out forms or enter data by
using speech recognition software.

Another solution to changing video conditions is a feedback
system devised by the I.B.M. research group. "Our system
tracks confidence levels as it combines audio and visual
features," making a decision on the relative weight of the
two sources, Dr. Neti said. When a speaker faces away from
the microphone, he said, the confidence level becomes zero
and the system ignores the visual information and simply
uses audio information. When the visual information is
strong, it is included.

"The more pixels you can get for the mouth region," he
said, "the better information you'll have."

The goal of the system is always to do better than when
relying on an audio or video stream alone. "At worst, it is
as good as audio," Dr. Neti said. "At best, it is much

At Intel, too, researchers have developed software for
combined audiovisual analysis of speech and released the
software for public use as part of the company's Open
Source Computer Vision Library, said Ara V. Nefian, a
senior Intel researcher who led the project. "We extract
visual features and then acoustic features, and combine
them using a model that analyzes them jointly," he said. In
tests, the system could identify four out of five words in
noisy environments.

"The results were as good for Chinese as for English," Dr.
Nefian added, suggesting that the system could be
introduced elsewhere.

Aggelos Katsaggelos, a professor of electrical and computer
engineering at Northwestern University in Evanston, Ill.,
is also developing an audiovisual speech recognition
system. He said that a future application might be improved
security, using such a system, for instance, to determine
whether recent videos that have surfaced indeed showed
Saddam Hussein himself or an imposter. "In principle, if
one can use both video and audio analysis one can have a
better accuracy in identifying people," he said.

Iain Matthews, a research scientist at Carnegie Mellon
University's Robotics Institute who works mainly on face
tracking and modeling, said that audiovisual speech
recognition was a logical step. "Psychology showed this 50
years ago," he said. "If you can see a person speaking, you
can understand that person better."



