EASI Archives

Equal Access to Software & Information: (distribution list)

EASI@LISTSERV.ICORS.ORG

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Condense Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Content-transfer-encoding:
7bit
Sender:
"* EASI: Equal Access to Software & Information" <[log in to unmask]>
Subject:
From:
Catherine Alfieri <[log in to unmask]>
Date:
Fri, 12 Sep 2003 13:09:16 -0400
Content-type:
text/plain; charset="US-ASCII"
Mime-version:
1.0
Reply-To:
"* EASI: Equal Access to Software & Information" <[log in to unmask]>
Parts/Attachments:
text/plain (181 lines)
Beyond Voice Recognition, to a Computer That Reads Lips

September 11, 2003
By ANNE EISENBERG






PERSONAL computers have changed a lot in the last few
decades, but not in the way that people communicate with
them. Typing on a keyboard, with the help of a mouse,
remains the most common interface.

But pounding away at a set of keys is hard on the hands and
tethers users to the keyboard. Automatic speech recognition
offers some relief - the systems work reasonably well for
office dictation, for instance. But voice recognition is
not effective in noisy places like cars, train stations or
the corner cash machine, and it may stumble even under the
best of conditions. Humans are still much better than any
computer at the subtleties of speech recognition.

But teaching computers to read lips might boost the
accuracy of automatic speech recognition. Listeners
naturally use mouth movements to help them understand the
difference between "bat" and "pat," for instance. If
distinctions like this could be added to a computer's
databank with the aid of cheap cameras and powerful
processors, speech recognition software might work a lot
better, even in noisy places.

Scientists at I.B.M.'s research center in Westchester
County, at Intel's centers in China and California and in
many other labs are developing just such digital
lip-reading systems to augment the accuracy of speech
recognition.

Chalapathy Neti, a senior researcher at I.B.M.'s Thomas J.
Watson Research Center in Yorktown Heights, N.Y., has spent
the past four years focusing on how to boost the
performance of speech recognition with cameras. Dr. Neti
manages the center's research in audiovisual speech
technologies. "We humans fuse audio and visual perception
in deciding what is being spoken," he said. A computer, he
said, can be trained to do this job, too.

At I.B.M., the process starts by getting the computer and
camera to locate the person who is speaking, searching for
skin-tone pixels, for instance, and then using statistical
models that detect any object in that area that resembles a
face. Then, with the face in view, vision algorithms focus
on the mouth region, estimating the location of many
features, including the corners and center of the lips.

If the camera looked solely at the mouth, though, only
about 12 to 14 sounds could be distinguished visually, Dr.
Neti said - for instance, the difference between the
explosive initial "p" and its close relative "b." So the
group enlarged the visual region to include many types of
movements. "We tried using additional visible articulators
like jaw movements and the lower cheek, and other movements
of tongue and teeth," he said, "and that turned out to be
beneficial." Then the visual and audio features were
combined and analyzed by statistical models that predicted
what the speaker was saying.

Using inexpensive laptop cameras, the group tested the new
system repeatedly. When they introduced a lot of background
audio noise, Dr. Neti said, the combination audio and
visual analysis of speech worked well, demonstrating up to
a 100 percent improvement in accuracy compared with using
audio alone.

These were promising results, but as Dr. Neti pointed out,
a studio is not the world. Many camera-based systems that
work well in the controlled conditions of a laboratory fail
when they are tested in a car, for instance, where the
lighting is uneven or people face away from the camera.

To handle circumstances like this, he and his colleagues
are developing several solutions. One is an audiovisual
headset, now in prototype, with a tiny camera mounted on
the boom. "This way, the mouth region can always be seen,"
he said, independent of head movement or walking. I.B.M. is
also exploring the use of infrared illuminators for the
mouth region to provide constant lighting.

Dr. Neti said that such headsets might prove useful in
workplaces where people fill out forms or enter data by
using speech recognition software.

Another solution to changing video conditions is a feedback
system devised by the I.B.M. research group. "Our system
tracks confidence levels as it combines audio and visual
features," making a decision on the relative weight of the
two sources, Dr. Neti said. When a speaker faces away from
the microphone, he said, the confidence level becomes zero
and the system ignores the visual information and simply
uses audio information. When the visual information is
strong, it is included.

"The more pixels you can get for the mouth region," he
said, "the better information you'll have."

The goal of the system is always to do better than when
relying on an audio or video stream alone. "At worst, it is
as good as audio," Dr. Neti said. "At best, it is much
better."

At Intel, too, researchers have developed software for
combined audiovisual analysis of speech and released the
software for public use as part of the company's Open
Source Computer Vision Library, said Ara V. Nefian, a
senior Intel researcher who led the project. "We extract
visual features and then acoustic features, and combine
them using a model that analyzes them jointly," he said. In
tests, the system could identify four out of five words in
noisy environments.

"The results were as good for Chinese as for English," Dr.
Nefian added, suggesting that the system could be
introduced elsewhere.

Aggelos Katsaggelos, a professor of electrical and computer
engineering at Northwestern University in Evanston, Ill.,
is also developing an audiovisual speech recognition
system. He said that a future application might be improved
security, using such a system, for instance, to determine
whether recent videos that have surfaced indeed showed
Saddam Hussein himself or an imposter. "In principle, if
one can use both video and audio analysis one can have a
better accuracy in identifying people," he said.

Iain Matthews, a research scientist at Carnegie Mellon
University's Robotics Institute who works mainly on face
tracking and modeling, said that audiovisual speech
recognition was a logical step. "Psychology showed this 50
years ago," he said. "If you can see a person speaking, you
can understand that person better."

http://www.nytimes.com/2003/09/11/technology/circuits/11next.html?ex=1064384
722&ei=1&en=cbbced2317c66237


---------------------------------

Get Home Delivery of The New York Times Newspaper. Imagine
reading The New York Times any time & anywhere you like!
Leisurely catch up on events & expand your horizons. Enjoy
now for 50% off Home Delivery! Click here:

http://www.nytimes.com/ads/nytcirc/index.html



HOW TO ADVERTISE
---------------------------------
For information on advertising in e-mail newsletters
or other creative advertising opportunities with The
New York Times on the Web, please contact
[log in to unmask] or visit our online media
kit at http://www.nytimes.com/adinfo

For general information about NYTimes.com, write to
[log in to unmask]

Copyright 2003 The New York Times Company

-----------------------
September online courses on accessible information technology:
Barrier-free Information Technology http://easi.cc/workshops/adaptit.htm
Advanced Barrier-free Web Design http://easi.cc/workshops/advwbsyl.htm
LD and Information Technology http://easi.cc/workshops/ld.htm
EASI Home Page http://www.rit.edu/~easi
CCourses and Clinics http://easi.cc/workshop.htm
To sign off this list
send e-mail to [log in to unmask] saying
signoff easi

ATOM RSS1 RSS2