VICUG-L Archives

Visually Impaired Computer Users' Group List

VICUG-L@LISTSERV.ICORS.ORG

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Kennedy, Bud" <[log in to unmask]>
Reply To:
Kennedy, Bud
Date:
Thu, 11 May 2000 08:43:30 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (150 lines)
I appologize if I posted this before.  I just reread the article and it
seems interesting.

PCs Get Ready To Speak-And Listen
By Fred Langa, Byte.com
        Feb 22, 2000 (1:43 PM)
        URL: http://www.byte.com/column/BYT20000222S0002
        The pieces have fallen into place: Soon, instead of using keyboards
and mice, you'll be able to interact with your PC via human-like software
robots that speak and listen. It's been the Holy Grail of UI design for
years-computers that can listen to your spoken commands, interpret what you
want, and communicate back to you via natural-sounding synthetic human
speech.
        Why? It's not that keyboards and mice are awful (They're not), but
they are limited and limiting. They're useless if your hands are full, for
example; you have to put something down before you can use the keyboard or
mouse. Keyboards and mice also are not terribly beneficial for people whose
physical problems limit use of their fingers, hands, wrists, or arms.
        And it's not that video screens are awful for displaying what your
PC is doing. But to use a screen well, you do have to be seated in front of
it, in a controlled-lighting environment, and your attention must be focused
on the screen itself. Screens can be useless for conveying information at a
distance or out of a direct line of sight or for people with no or limited
vision.
        But today, the pieces are falling into place and soon-perhaps as
soon as this April-we'll begin to see new hybrid technologies that will
begin to reduce our dependence on screens, keyboards, and mice.  Instead,
you'll be able to use a commercially available human-like software robot.
        Let's look at the pieces.
        Audio output alone isn't that hard. With a sound chip and a modest
collection of phonemes and pronunciation rules it's not that hard to produce
a serviceable simulation of a human voice. In fact, a decade ago, Creative
Labs used to give away a text-to-speech (TSS) software app with every
Soundblaster card it sold.
        But while it's easy to produce basically intelligible synthetic
human speech, it's very hard to make it sound completely natural. Even very
sophisticated synthetic voices produced by top-dollar commercial-grade
systems often contain unnatural lilts, nuances, and pacing that, at best,
make them sound foreign. (To American ears, the net effect is often vaguely
reminiscent of a native-born Swede speaking English. I have no idea how the
voice sounds to native-born Swedes!) Take, for example, the synthetic voice
in Pink Floyd's "Keep Talking" (from the Division Bell CD): The voice is
clear and easy to understand, but still sounds a little like a Swede trapped
at the bottom of a well.Telephony applications were among the first to make
widespread use of synthetic speech, and the companies with deep telephony
roots are among the leaders in the general applications of artificial
speech. For example, Lucent/Bell Labs has had a text-to-speech demo site
running for quite some time at
http://www.bell-labs.com/project/tts/voices.html. You can type or paste just
about anything to the page, and the Web page will speak it back in your
choice of several wholly synthetic voices: a man, a big man, a woman, a
child, a gnat, a raspy, a way-fast "coffee drinker," and a strange sounding
"ridiculous" voice. Each voice also can be optimized (via varying accent and
pronunciation rules) for English, German, French, Spanish, Mandarin, Italian
and ... pig Latin! It's software tour-de-force-although all the voices, in
all their permutations, retain a distinctly non-human quality to them. The
people at AT&T Labs have their own demo at
http://www.research.att.com/projects/tts/. It produces output that (while
still obviously artificial) sounds somewhat better to my ears.  The demo
page lacks the many language and speaking style choices of the Lucent page,
however.
        Those sites are just the beginning: the Linguistic Data Consortium
at the University of Pennsylvania maintains a list of text-to-speech sites:
It currently lists 20 heavy-duty text-to-speech sites located around the
world, operating in 13 languages
(http://morph.ldc.upenn.edu/cgi-bin/ltts/list#S). It also lets you test-hear
the outputs of the various TTS apps.
        There also are lower-end TTS products available, such as ReadPlease;
        ReadPlease even has a completely free version available at
http://www.readplease.com/. Many commercial applications such as Dragon
Systems' NaturallySpeaking, and IBM's Via Voice also come bundled with
decent text-to-speech applications.
        So TTS in itself is quite common, although the quality of the voice
is also commonly, well, non-human. So what about something more realistic?
        AnanovaMeet Ananova, a new and specialized text-to-speech software
robot that also uses real-time image-rendering to create a female animated
"virtual newscaster" that is "...28 years old, 5ft 8ins tall, with a
pleasant, quietly intelligent manner that makes people feel relaxed when
they engage with her." Starting perhaps as soon as this April, subscribers
to a British company's Web news and information service will use a
customized Ananova as their front end to access the information on the
system: an animated software agent that will help personalize and-in a
manner of speaking-humanize the service's content.
        But TTS processing--- even dressed up in the body of a virtual GenX
female-is still just TTS processing. What about interactivity? If you're
past a certain age, you'll remember "Eliza," a simple yet effective program
that you could chat with via typed exchanges using plain English. It was
actually quite good at simulating the comments of a mediocre nondirective
psychotherapist; Eliza would ask questions based on simple keywords it
parsed from your typed input. It was good enough, in fact, that it passed
the Turing Test for at least some nonperceptive souls who never realized it
wasn't a human being typing the responses they saw. Although Eliza's direct
descendants still abound on the Web some software has taken the concept of
keyword-based responses to new heights. For example, the Verbots ("Verbal
Software Robots") at http://www.vperson.com/index2.html look superficially
like Ananova, but rather than reading canned news feeds, can actually
conduct an interactive conversation of sorts. What's more, each Verbot
(there are several) has a distinct personality, so the responses you get
from one Verbot will differ from those you might get from another.
        So let's see--- personality, good looks, live animation, and
high-accuracy text-to-speech translation. What's missing?
        Input.
        I've already mentioned NaturallySpeaking and ViaVoice, the top two
natural-language processing applications. Full natural-language processing
is still computationally daunting: A simple phrase such as "Don't you want
to talk to your PC?" in natural, or "continuous speech" comes out something
like: "Doanchawannatawktuyapeecee?" It takes a lot of processing power and
fairly extensive user-specific training to let a PC figure out the word
breaks and the words with acceptable accuracy.
        But fixed-vocabulary speech recognition-where the number of possible
utterances is modest and finite-is much more doable. It's possible that
you've already used a speaker-independent, fixed-vocabulary speech
recognition system. Many phone companies have a form of automated directory
assistance, where you speak the name of the town and listing you want:
Software tries to parse and answer your request, calling for a human
attendant only as a last resort.
        And on PCs, software such as Microsoft Agent lets applications
developers and Web designers add interactive sprites that can recognize
limited subsets of speaker-independent utterances, and respond appropriately
via a built-in TTS engine.(See
http://winweb.winmag.com/people/mheller/agent.htm and
http://www.sls.lcs.mit.edu/sls/whatwedo/applications/jupiter.html) The
Agents also can interact programmatically with any running application, or,
in theory, any data set. If you think about that for a moment, I'm sure
you'll see the potential.
        And there's one more piece: VoiceXML-already part of a W3C standard
-- is catching on in a variety of handheld and palmtop devices (such as Web
phones) that can speak parts of any appropriately-coded website.
        Put it together and you have all the pieces for the beginnings of
something impressive: A new way of interacting with your PC via a human-like
character that's animated in real time. The first instances will no doubt be
relatively simple guides and news readers.  Second-generation characters
will accept limited voice input, along the lines of MS Agent software.
        And from there the Holy Grail is just a short step away: Controlling
a PC by simply talking and listening as easily and as naturally as you do
with a fellow human.
        You can contact Fred at [log in to unmask] or via his website at
http://www.langa.com.
For More of Fred's columns, visit the Monitor Index Page
        Copyright 1998 CMP Media Inc.


VICUG-L is the Visually Impaired Computer User Group List.
To join or leave the list, send a message to
[log in to unmask]  In the body of the message, simply type
"subscribe vicug-l" or "unsubscribe vicug-l" without the quotations.
 VICUG-L is archived on the World Wide Web at
http://maelstrom.stjohns.edu/archives/vicug-l.html


ATOM RSS1 RSS2