The Future of Speech Technologies
prostoalex writes "PC Magazine is running an interview with two of the research leaders in IBM's speech recognition group, Dr. David Nahamoo, manager of Human Language Technologies, and Dr. Roberto Sicconi, manager of Multimodal Conversational Solutions. They mainly discuss the status quo of speech technologies, which prototypes exist in IBM Labs today, and where the industry is headed." From the article: "There has to be a good reason to use speech, maybe you're hands are full [like in the case of driving a car]. ... Speech has to be important enough to justify the adoption. I'd like to go back to one of your original questions. You were saying, 'What's wrong with speech recognition today?' One of the things I see missing is feedback. In most cases, conversations are one-way. When you talk to a device, it's like talking to a 1 or 2 year old child. He can't tell you what's wrong, and you just wait for the time when he can tell you what he wants or what he needs."
I have a solution to the "one-way" communication problem.
More popups.
Audio popups!
Heads-up display popups!
Holy blackberries! Get me my patent attorney!
"I'm sorry, Dave. I'm afraid I can't do that"
mast and the stand can't aches.
(the future of speech technology must understand context)
I've been waiting for years for speach recognition technology to get to an acceptable standard and over that time I've used a couple, the one i got lately (dragonsoft I think) was ok, but they need to come quite a bit further before I'll be adopting all the way.
I'm looking forward to when I can say "computer, open openoffice for me mate" and it'll go "sure"... That'll be sweet.
*''I can't believe it's not a hyperlink.''
Hey, it looks like you're trying to write a letter! Let me..."
.357 does blow Clippy's head "clean off"!
*BANG*
*monitor explodes*
Me (covered in monitor pieces): So, Dirty Harry was right after all! A bullet from a
What's wrong with speech recognition today?
I took a brief poll, and nobody seems to have a problem:
Bruce: I sure like being inside this fancy computer.
Vicki: Isn't it nice to have a computer that will talk to you?
Agnes: Isn't it nice to have a computer that will talk to you?
Kathy: Isn't it nice to have a computer that will talk to you?
Except the trinoids, who complained:
We can not communicate with these carbon units.
I wasn't sure which Carbon they were talking about.
"In this 10-year time frame, I believe that we'll not only be using the keyboard and the mouse to interact, but during that time we will have perfected speech recognition and speech output well enough that those will become a standard part of the interface."
"There has to be a good reason to use speech, maybe you're hands are full"
That's 'your', not 'you're'
When can online 'journalists' stop making mistakes like this?
I'm a linguist, and it seems to me that Speech Recognition would be incredibly, incredibly useful in the research that's going on right now into Language Acquisition.
You see, the problem right now is that there's really not much data that's in the public domain for linguists/psychologists/what-have-you to study, because it's incredibly, incredibly laborious to do longitudinal studies of children's utterances, or of input to the child. People spend hours and hours and hours transcribing 20 minutes of tape. They're understandably reticent to just share their data out of the goodness of their hearts. Even when they do, it's never a large sampling of children-and-their-interlocutors from-birth-to-age-X, it's usually just one child and maybe his or her parents from age 8 months to 3 years.
So we have arguments about whether or not kids hear certain forms of input (Have you used passive voice with your child recently? Where's your child going to learn subjacency?) that go back and forth between psychologists and linguists, and people perform corpus studies on 3 children and feel that that's representative -- never mind the fact that these three kids were all harvested from the MIT daycare centre, and were the children of grad students or faculty members, and thus may not be representative of the population at large.
Speech recognition would make it much, much easier to amass large corpora of data for larger samples of the population. It'd make it much more likely for people to share their data. And, what's more, it'd likely be possible to have a phonetic and syntactic-word-stub (for lack of a better word) transcription made from the same recording. We'd have a better idea of how the input determines how language is acquired by children, and what sorts of stages children go through.
Scansoft, who earlier all but cornered the market for Optical Character Recognition (OCR) technology, did the same with speech recognition by acquiring the largest players in this space, SpeechWorks and Nuance. Scansoft changed their name to Nuance as a part of that last acquisition.
IBM, meanwhile, has been struggling to find a market for their "Superhuman" (sneer) speech reco technology. A few years ago, they sold distribution of their retail desktop product, ViaVoice, to (wait for it) Scansoft. Their commercial product was RS/6000-AIX-only until a couple of years ago, when they ported it to more platforms, including Windows and Linux, and integrated it more tightly with their Rational and WebSphere marketing platforms.
The current enterprise product sounds really sexy, at least for Rational-WebSphere shops. You can develop your WebSphere VXML application in Eclipse and leverage all those groovy WebSphere services you've built. No (or not much) special skill required!
The problem is that their target market is Telecom Managers, who face a choice between IBM, with a few hundred ports installed, and Nuance (-ScanSoft-SpeechWorks), with tens- or hundreds-of-thousands of installed speech reco ports. Telecom Managers live in a world where their clients expect six-sigma/five-nines reliability. This is a hard sell to make.
The question is, how long can IBM keep pouring money into speech R&D and product development in the face of dismal sales? Some in the industry expect the answer is, "Not too much longer." And that. of course, makes nervous enterprise buyers even more nervous and less likely to buy.
personally, i can't wait till they take speech recognition and couple it with natural language processing as a standard part of the desktop interface. it should be quite feasible now that we're seeing affordable 64-bit computing with fast memory and bus speeds. imagine excel with a speech-recognition interface, so instead of typing and filling formulae you would just tell it to "sum the row labeled timing, but only include values greater than 10". ok, back to work...
An old-timer with old-timey ideas.
Robyn Peterson has the same problem as the speech recognition systems he's reporting on -- he can't hear the difference between "you're" and "your".
Apple has had speech technology for years!
PUNK!
In Soviet Russia, speach recognizes you!
Q: Why did the chicken cross the road?
A: To get to the other side!
Neither can I, so I have to cheat and use these things called "context" and "rules of English grammar".
A few years ago my wife was thinking about studying to become a court reporter. The training is very demanding, and I heard the dropout rate is about 95%, but the pay is good if not great.
In any case, I warned her about the potential for voice recognition technology to render court reporters obsolete. It probably won't happen, but the mere prospect tipped her in the direction of foregoing the opportunity. Was that a mistake?
The same concern applies also to medical transcription.
I watch Brit Hume on Fox News
Try TellMe. Call 1-800-555-TELL. It's a voice portal. Buy movie tickets. Get driving directions. News, weather, stock quotes, and sports. All without looking at the phone. So what's the problem?
This has nothing to do with voice recognition.
IBM wants money from every business for its patents portfolio, but nobody knows what they've invented in the software business, it just seems like they have a lot of failed products.
So today's PR is Voice Recognition and yesterdays was explaining how Linux memory management works (as though it came from them), and tomorrows will be something different.
And when they lobby for software patents, they will try to look like the good guy, an inventive company and not the biggest technology leach of the lot.
...the point of our multimodal work is that you can have a two way dialog with the device, as well as have visual feedback to the interaction. See http://ibm.com/pvc/multimodal for some examples.
My name is Dr. Sbaitso. I am here to help you. Say whatever is in your mind freely, our conversation will be kept in strict confidence. Memory contents will be wiped off after you leave. So, tell me about your problems.
You know this technology will be a big hit in the porn industry when the big man of the area says
"There has to be a good reason to use speech, maybe your hands are full"
Now, what if the mouth is full too? Ventriloquism?
One great thing about keyboards and typing is that it's relatively private. Like phone menus. I hate when they ask me to speak my choice or answer a question or recite my account number just let me freakin type.
// in 1988.
Babblin' all over the place is dumb.
Instead of speech recognition let's work on better speech synthesis. Here we are in 2006 and the average synthesized voice sounds hardly better than my freakin' Phasor card I had for my Apple
Doctors in Finland are starting to use speech recognition to update patient records. I think it is in testing at the moment, check the following link for details.
1 6080;163;9862
http://www.tietoenator.com/default.asp?path=1;93;
XP Pro (W/AT&T voices, Office language Bar W/Word, firefox W/Foxy Voice) and OSX are a bit more polished. But I had a similar voice recognition/TTS setup in 1993. And what I concluded was that it is far simpler to interact physically (double click) with a GUI than to tell the computer to double click. What is needed is a different type of interface for speech to take off mainstream. However it is sad that Windows will not read dialog boxes. And that's a pretty obvious useful feature that the Mac has enjoyed since system 7 or 8! Windows is the norm with the largest desktop penetration. And the norm blows in this case. Just like the m-i-c-r-o-s-oft s-a-m voice. There were better TTS voices available to System 6.0.8! Ouch! This is one of the few areas that the Mac people truly and justly get to laugh at the PC.
It's Daleks all the way down.
... and then they built the supercollider.
I'm convinced speech technologies have a fantastic future when they are used for improving human communications like providing for an electronic bablefish. However it looks like most are concentrating on using speech as a way to interact with machines.
Which is so terribly ineffient and cumbersome. You really don't want to spend the time to socially interact with your coffeemachine at 7am.
Unless it's able to go to the shop, put in exactly the right amount of coffee and is able to turn itself to on once it hears you stumbling out of bed. It's next to useless if the only added value is to switch itself to on after you grunted "on" to it.
they have been promising good speech recongition software for years! im still waiting...
Good speech recognition would be great for searching audio. We could index webcastings, not only text. It would also be great for reporting meetings and conferences.
Rethinking email
the company keeps changing, but what was once scansoft (dragon dictate) had a bunch of really big patents. its my understanding taht they did what any true capitalist should do once they gain complete monolopy over something; they sat on it and milked the big fat tit they'd engineered themselves. and thats what they're doing today. just think of the god damned margins on something like that...
and tahts why speech recognition 2006 is the exact same as speech recognition 1997.
FUCK YOU CAPITALISM. FUCK YOU.
god, are you seriously spamming ads for an online prostitue service on /.? I sure hope /. captured your IP and stored it in the database that holds the post you made so they can ban your spammin ass from posting... your lucky they probly wont find it since they arent likely to go read every post for shit like that...
Bring on the system that learns language in simlar way that a human does...of course it would come out of the box with a reasonable starting point. Then the ultimate backend would be a HAL-like system (2010 not 2001), hopefully not a skynet-like, borg, VGER, or the trapper keeper from southpark. VGER wouldn't be too bad once it knew about carbon based infestations.
Anyone know of a project to simulate human life starting at a fertilized egg? That would be sweet once we understood all of the chemical processes that govern cell growth etc, couldn't that be simulated? In a crude way, just create a detailed physics simulation and put the right virtual ingredients in the right places. Grow it, teach it, then lock that sucker up in a space ship and point it toward the closest known rock/ice planet in hybernation mode with a decent stock of terraforming DNA and a robot body to do the manual labor on arrival and teach the babies. Bam! instant SCI-FI novel. Probably already written though.
When they said "maybe you're hands are full" (btw, noticed the you're/your typo?) I admitt that the first example that went through my mind wasn't the case of driving a car.
Many people out here must know how it can be unconvenient to type with one hand, mostly when it's the left hand, and as for the car example, what would you need speech recognition for anyways, doing word processing while driving, or driving while you have both of your arms broken?
You just got troll'd!
Something that has not been mentioned, because, evidently, no one has actually worked with it, is that it is seriously annoying to work in the proximity of someone USING speech recognition. I worked with a fellow that had speech recognition on his machine who used it for programming. YOU try working on YOUR own code when someone is droning in the background: "for left paren int i equals zero semi-colon i less than mumble mumble delete word delete word ..." ALL DAY LONG! Even with head phones on it sometimes seemed like he was asking a question and I'd remove the head phones and say "What was that?" "Nothing delete word". ARGGHHH. Leave me the heck away from people with speech recognition.
Tom.
speech recognition
O /software.htmlc h/software/
http://www.speech.cs.cmu.edu/sphinx/
image+speech recognition
http://sourceforge.net/projects/opencvlibrary/
Desktop voice commands
http://perlbox.sourceforge.net/
Others
http://www.tldp.org/HOWTO/Speech-Recognition-HOWT
http://www.cavs.msstate.edu/hse/ies/projects/spee
Do you know about other usable open source speech solutions?
...is the *biggest* problem with speech recognision. I used it extensively for a good period of time, but it's not reliable. Someone walks into the room/some music plays. etc. Speech recognision would greatly benefit from either the computer getting an audio & visual input to determine the source, or better yet, adopting the military throat microphones that only pick up vibrations directly from the skin (even whispers)
Rich Gentlemen Hide - The Existential Comic
"There has to be a good reason to use speech, maybe you're hands are full..."
"Computer, play video!"
"Hmm, to much talk..."
"Computer, fast forward!"
"Wow, nice!"
"Computer, resume normal play!"
"Mmmm"
"Computer, play that scene again..."
(Girlfriend comes home)
"Computer, stop playback! Stop! Shut down!"
Very relaxing.
How many beans make five, anyhow ?
I en tee space main open-parenthesis i en tee space a ar gee cee comma cee aitch a ar asterisk space a ar gee vee open-bracket close-bracket close-parenthesis open-curly-bracket...
Me lost me cookie at the disco.
I see great potential for interfaces which make use of whispered speech recognition (referred to in some papers as "non-audible murmurs"). Using a contact microphone that picks up vibrations transmitted through your jawbone rather than ones travelling through the air, you can have effective speech recognition without speaking out loud. This eliminates the problem of annoying your coworkers with loud dictation in a shared office, allows passwords to actually remain secret, and has even been documented to work well in environments full of background noise.
I see this as the perfect complement to tablet-based computers... add a bluetooth-based contact microphone to a keyboardless, touchscreen-based PC, and you finally have a computer suitable for word processing on the subway. The touchscreen means that you can use a stylus for navigation and widget manipulation, allowing the speech recognition software to be dedicated to text entry; this avoids the awkwardness of switching between command/control and continuous recognition modes.
A related benefit of whisper recognition is that it is algorithmically simple to differentiate between whispered speech and conventional vocalisations. As such, there would be no need for an additional interface to tell the computer that you are talking to it, versus pausing to speak to the person sitting next to you.
Searching around, I've only found a few papers on the subject of whisper recognition interfaces, and neither free nor commercial software which implements it. Why isn't this a hot topic, or am I just searching under the wrong name?
But my grandest creation, as history will tell,
Was Firefrorefiddle, the Fiend of the Fell.