EASI Archives

Equal Access to Software & Information: (distribution list)

EASI@LISTSERV.ICORS.ORG

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Catherine Alfieri <[log in to unmask]>
Reply To:
* EASI: Equal Access to Software & Information
Date:
Thu, 9 Aug 2001 23:08:11 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (207 lines)
I think this has a lot of possibilities for the disability market.
 
                  -------------------------------------------------/

Behind the Technology That Can Reproduce a Voice





FLORHAM PARK, N.J. -- JUERGEN SCHROETER, an expert in speech
synthesis at AT&T (news/quote) Labs, has a vision of the future.

 Imagine a booth — much like an instant- photo booth at a Wal-Mart
(news/quote) — except this booth takes snapshots of people's voices
instead of their faces. A man could walk into the booth, read a
hundred sentences and walk out with a program on a CD-ROM that
could transform written text into lines of speech uttered in his
voice — including words he never said. People could pop those CD's
into any device that delivers voice commands, like an automobile
that offers driving directions, and their own voices would guide
them along the way.

 "You could convince your favorite voice to go into the booth and
make a recording for you," he said.

 Dr. Schroeter's dream has just come closer to reality as a result
of work done by his research team here at AT&T Labs' Florham Park
campus. The lab announced last week that it had developed a product
that could reproduce the voice of anyone who submits to 10 to 40
hours of studio-quality recording. The product, which is part of a
new suite of voice technology called Natural Voices, is designed
for voice-automation companies that want to give their systems a
distinct, familiar sound.

 The customized voices, AT&T scientists said, could even be based
on archival recordings, bringing voices back from the dead. Imagine
getting in a car, for example, and hearing James Dean mumble a
reminder to wear your seat belt.

 It remains to be seen, of course, whether the technology actually
delivers what it promises. What's more, as AT&T's competitors point
out, the technology is designed to operate from computer servers
that can process large amounts of data. It is not yet ready to run
on desktop computers, much less low-power mobile devices like cell
phones. In the next few weeks, Lernout & Hauspie, a speech
technology company, expects to be selling a more compact speech
program for the automotive industry.

 Still, the dawn of AT&T's technology is a sign that speech
synthesis is finally getting close to the industry's ultimate goal:
computer-generated speech that sounds so natural, so human, that it
is indistinguishable from that of a real person. When that level of
speech is achieved, scientists will have cleared the first of two
enormous hurdles that are impeding the way toward humanlike
interaction between computers and people. (The next hurdle, which
stands far higher, is speech recognition technology that enables
computers to understand the meanings behind human utterances.)

 Beyond last week's news, however, lies a more realistic and
perhaps more reassuring story about the quest to make a machine
that sounds like another person. The human voice is so full of
complexity, from the use of inflections and emotion to the
swallowed syllables between quickly spoken words, that replicating
the nuances of speech is anything but easy. Ask any scientist who
has spent hours under the earphones listening, over and over, to
the garbled sounds of synthetic voices.

 "It's hard to believe," said H. David Maxey, a former I.B.M.
(news/quote) researcher who labored over speech synthesis
throughout the 1960's, "but it took really top-notch people decades
to do this."

 It could be said that the drive for human- sounding speech
technology started far earlier than decades ago. The first speech
machine was designed in the mid-1700's, by a inventor in Vienna
named Wolfgang von Kempelen. He was able to produce a few words and
short sentences by manipulating leather bellows, which sent air
through a wooden box and out through a bell-shaped piece of rubber
that acted as a mouth.

 At the 1939 World's Fair in New York City, AT&T Bell Laboratories,
a forerunner of AT&T Labs, unveiled a speech machine called the
Voder. Six women were trained to operate the contraption, which was
played like a pipe organ. When the machine said, "Good afternoon,
radio audience," it sounded like an alien speaking under water.

 In the coming years, researchers at several companies and
universities tried to improve upon the technology, said Mr. Maxey,
who is documenting the history of speech synthesis for the
Smithsonian Institution. Some spent months slicing magnetic tapes
from recordings of human voices and rearranging the tiny pieces in
an attempt to create new words. Mr. Maxey and his colleagues at
I.B.M. decided to forgo the cutting knife and attempted to generate
sounds without prerecordings.

 "The problem is that human speech varies so much that when you try
to cut it up and rearrange it in some order, the discontinuities
are just too disturbing to the ear," Mr. Maxey said. He said that
he would go to bed at night with the repetition of nonsense sounds
ringing in his head.

 "I listened to the sound `dah' over and over," he said. "I spent
thousands of hours listening to this stuff."

 Instead, the I.B.M. group created line graphs that represented the
frequencies of sounds, fed them into a desk-size scanner and then
listened to the sounds generated by a nearby cabinet of
synthesizing equipment.

 The evolution of computers in the 1960's provided another boost:
Because a computer could run through millions of mathematical
equations in minutes, sounds extracted from databases could be
matched on the fly. John Holmes, a British scientist, used such a
technique when he exhibited a synthesizer that could replicate this
sentence: "I enjoy the simple life." Lawrence R. Rabiner, vice
president of AT&T Labs, who visited Dr. Holmes at the time, said
the rendition sounded identical to words coming out of Dr. Holmes's
own mouth. But there was one catch: That one sentence took nearly a
year's worth of work.

 It was not until the 1980's that commercial products using speech
synthesis hit the market, many of which were developed from
research conducted by Dennis Klatt, a speech expert at the
Massachusetts Institute of Technology.

 But scientists found that the voices still rang with mechanized,
unnatural vibrations. They did not live up to the vision that had
been etched upon their minds in 1968 after watching HAL, the
talking computer, in the film "2001: A Space Odyssey." (Stanley
Kubrick visited AT&T Labs before making the movie and used a
version of some of its early technology to depict the voice of HAL
as he was being unplugged. From the calm, natural tones of Douglas
Rain, the actor who provided the voice of the computer, HAL's voice
degraded slowly during a rendition of "A Bicycle Built for Two,"
created by AT&T's machines.)

 Instead of relying entirely on computers to generate sounds, a
handful of scientists, including Dr. Schroeter, continued to work
on snippets of prerecorded speech. Their challenge was to figure
out how to chop up the recordings so that they could be re-
assembled to sound more natural. ATR, a Japanese company, was one
of the companies that tackled the problem. It created a massive
database that contained thousands of variations of sounds. In 1996,
AT&T licensed ATR's technology, which provided the basis for
today's product. "They saved us from doing years of research," Dr.
Schroeter said of the scientists at ATR.

 Alistair Conkie, a speech researcher at AT&T, made the next leap.
A thin, friendly engineer with a wild mane of gray hair, Dr. Conkie
advocated slicing the sounds into pieces that were half the size of
phonemes, which are families of speech sounds that make up
language. (The vowel "a," for example, is a phoneme.) By employing
half phonemes, more natural-sounding words could be constructed.
Dr. Conkie's work — paired with linguistic analysis by Ann Syrdal
and programming development by Mark Beutnagel, another AT&T
scientist — resulted in the database of sounds behind AT&T's new
product.

 Here at the lab last week, Dr. Conkie played a snippet of that
database — a mush of sounds from "a" to "zh" (pronounced "juh").
The room filled with what sounded like the sonorous moans of a
long-winded whale. But it made every scientist there smile. After
years of painstaking research, they said, they have finally created
a well- labeled database of sounds and found the right algorithms
to piece them together on the fly. Is the voice booth next?

 "We're not there yet," Dr. Schroeter said. But his boss, Dr.
Rabiner, is confident that some version of Dr. Schroeter's dream
will arrive someday as the time required to replicate a human voice
narrows further.

 "When we started, it took a year," Dr. Rabiner said. "Now we have
it down to a month."

 

http://www.nytimes.com/2001/08/09/technology/circuits/09VOIC.html?ex=9983618
96&ei=1&en=fb61c1db51be23be

/-----------------------------------------------------------------\


Visit NYTimes.com for complete access to the
most authoritative news coverage on the Web,
updated throughout the day.

Become a member today! It's free!

http://www.nytimes.com?eta


\-----------------------------------------------------------------/

HOW TO ADVERTISE
---------------------------------
For information on advertising in e-mail newsletters
or other creative advertising opportunities with The
New York Times on the Web, please contact Alyson
Racer at [log in to unmask] or visit our online media
kit at http://www.nytimes.com/adinfo

For general information about NYTimes.com, write to
[log in to unmask]  

Copyright 2001 The New York Times Company

ATOM RSS1 RSS2