Is Speech Recognition Finally 'Good Enough'?
jcatcw writes "Speech recognition software is fast, but it still may not be accurate enough. Clerical jobs usually ask for 40 wpm, but speech recognition software can keep up with someone speaking at 160 wpm. In Lamont Wood's demo it did very well at too/two/to and which/witch, but will it still render 'I really admire your analysis' as "I really admire urinalysis'? At 95% accuracy, people aren't jumping on the bandwagon. Wood's typing speed is about 60 wpm with 93% accuracy, so he found that using speech recognition was about twice as fast as typing. Those who type at hunt-and-peck speeds will experience results that are even more dramatic. There's really only one product on the US market: Dragon NaturallySpeaking from Nuance Communications. The free versions from Microsoft aren't up to the task and IBM sold ViaVoice to Nuance, where it's treated as an entry-level product."
Is spinachry ignition rivaly gooery stuff? What the hell are you talking about?
As a foreigner it is really hard to get the pronounciation right enough.
Also command execution by others in the room is a problem.
How about listening to music, or TV, and having the computer interpreting it.
If you mod this up, your slashdot background will turn into a beautiful sunset!
Dear aunt, let's set so double the killer delete select all.
I wonder if I use bold in my signature, people will notice my posts.
In fact, I'm using it to write this Dear aunt, let's set so double the killer delete select all
For typing up an inter-office memo in Word, most likely. But I'm a programmer, and I can barely read out loud some perfectly fine code, I can't imagine trying to enter it all with voice recognition, no matter how good it gets.
Curiosity was framed, Ignorance killed the cat.
Speech recognition, handwriting recognition, species recognition... all of these suck, and will CONTINUE to suck, until strong AI is developed.
And by that time, there will be a lot more important problems to worry about than making a computer understand Bubba Sixpack who can't type-- such as keeping the robots from taking over the planet in a bloody war.
With spending like this, exactly what are "conservatives" conserving?
I use it myself. It's wonder full. delete that. delete that. delete that. double the killer delete select all
Dear aunt, let's set so double the killer delete select all.
Well, there's spam egg sausage and spam, that's not got much spam in it.
TFA mentions that many people stop using speech recognition software because of poor accuracy. I don't think that's the major reason. I think they start using it because it's a neat idea that seems to have a lot of promise, but quickly realize there are only a few situations where it's actually helpful. The end of the article mentions rough drafts; I'd also say it might be a decent choice
For the majority of office tasks, it just isn't a good fit.
So if the "good enough" is being useful in any way whatsoever, it sounds like we're almost there.
I used to work for a company that has the words "new directions" in their name. When I told people where I worked I would make a rather long pause between the "new" and "directions" so as not to sound like I was saying something else. I wonder how this software would render it...
I'm using Dragon NaturallySpeaking. Right now, as I write this calm it, comet, post, and it sure as hacking beats typing.
Actually, I am using Dragon NaturallySpeaking right now, and it works very well. It actually works better if you speak quickly (as you normally would) and it's pretty good at inserting grammar along the way. I have bilateral tendinitis, and the software has been a godsend for me. I was even able to finish writing my book, a task that was becoming just too painful typing manually.
Oh, and you are probably wondering how long it takes to train the software? About a half an hour, and I find the accuracy at around 95%.
SEO Copywriter. Just Say ON
95 percent is pretty good, only one word in twenty. I wouldn't have a problem with a 5% error ate.
For command control of a system where we need both hands free. It's pretty good, much better than stopping and typing, clicking or pressing buttons during a repetitive manual process.
We're using an older version of Microsoft's product and it seems the microphone quality is important.
To be fair, that's a problem with the IVR coder, not the voice recognition engine.
-Rick
"Most people in the U.S. wouldn't know they live in a tyrannical state if it walked up and grabbed their junk." - MyFirs
Press or say one to speak with a representative in english...
One
When you hear the option you are calling about you may say it at any time. If you are calling about a billing problem, say billing. If you are calling about a technical issue, say technical. If you are calling about new service, say new customer. If you are...
Billing
I'm sorry, that is not an option. When you hear the option you are calling about you may say it at any time. If you are calling about a billing problem, say billing. If you are calling about a technical issue, say technical. If you are calling about new...
Billing!
I'm sorry, that is not an option. When you hear the option...
Billing billing billing!
I'm sorry, that is not an option. When you...
Fuck you! Give me a human! Human human human!
I'm sorry, that is not an option. When you hear the option...
Yeah, but if you put a beat to it, you've got something.
{ } . ! /
& ; ^ # -
< > @ \
{ } _ SYSTEM HALTED
"Left titty, right titty, dot bang slash.
Ampersand semicolon, caret pound dash.
Less than greater than, at back slash,
left titty, right titty, under score crash!"
* # ! ! (
~ & | )
' " . . DEL
# ^G ! ! working... done.
"Star pound bang bang, open-paren.
Tilde and pipe, close-paren.
One quote, two quote, dot dot delete,
pound bell, bang bang, process complete!"
Google's USENET archive dates it back to 1990, but it predates the 1990 post ("Stuck Shift Key Poetry") to rec.humor.funny by several years.
You haven't lived until you've seen a dozen drunken geeks trying to sing "Waka Waka", or the entirety of "Hatless Atlas", while seeing only one character at a time. Well, maybe you have, but this is Slashdot.
Instead of asking if speech recognition is "good enough", maybe we should be asking whether or not it's actually useful for anything in the first place. I mean, is it good enough... to do what?
Can you imagine being in a cubicle farm full of people talking to their computers? Or trying to talk to your computer on the bus? You have to imagine that as computers become more ubiquitous, input methods will have to adjust alongside, and I simply can't see (or hear) speech recognition doing that very well.
What is is all that is. Isn't that obvious?
I am presently a financial customer of an enterprise speech recognition product that Nuance offers. For several years now, the speech recognition software industry has been under consolidation, with Nuance buying a few different competitors and technologies. Most recently, this dance has continued with Nuance being acquired by ScanSoft, a company known for specializing in type recognition.
Nuance support is marginal at best, and through all the consolidations, understanding even within their own company of how the product works is quite lacking. We have found our own developers often times educating the Nuance support folks in various aspects of how the product is working, and then inquiring as to whether this is intended behavior or not. Crickets can often be heard finishing these types of conversations. We normally would have moved to another product under these conditions, but simply put - Nuance acquired what little was left, and now has no competition in the market. Competition is what spurs innovation, and so with the continued consolidation, it is hard to see significant advances in the technology without free help from academia.
If you think the Microsoft monopoly is bad, imagine if they absorbed Apple and somehow took over Linux leaving you with a few "choices", but all under the Microsoft moniker. The technology is very neat and the enterprise level products do some basic things quite well, but there is still some glaring room for innovation that I don't expect anytime soon under present industry conditions.
Seriously, the only things speech recognition is good for are bulk text entry and simple navigation. I imagine trying to use voice commands to operate modern software would be similar to letting my four-year-old help make pancakes — yes, it gets done, but it's so much easier and faster to just do it yourself. Imagine trying to edit a document using just voice commands. Is your WP going to be smart enough you can tell it "find all occurrences of 'scum-sucking bottom feeders' and replace it with 'esteemed colleagues'". Or are you going to have to say "Find. Scum hyphen sucking bottom feeders. Tab. Esteemed colleagues. Replace all." Face it, GUIs have rendered speech recognition for command and navigation moot. Most operations you perform don't have a verbal description, or at least not one that is quicker to say than to do.
I also can't imagine it'd be that useful for actually writing things. I don't think I'm the only one who revises as they write. I think I actually write better when I write things out by hand, because it's slower so I tend to think my phrasing and sentence structure through more before I commit anything to paper. If I could suddenly type two or three times faster, I think it'd probably make my text even more incomprehensible than it usually is...
Just junk food for thought...
Yeth.
The Kruger Dunning explains most post on
Any speech recognition software worth the $ should be able to detect and translate NATO letter names: "hotel tango tango papá colon slash slash sierra leema alpha sierra hotel delta oscar tango dot org".
I'm using Dragon NaturallySpeaking 9 right now. I've been using it for several months, and I have written a dozen articles on it. I think it works fantastic, but you definitely have to learn how to write all over again. Out of the box it trains extremely quickly, if you do not want to train it at all you can just start talking and it will eventually catch up with you. (Note it caught catch up and not ketchup) I started using it as a preventative means of avoiding repetitive stress injuries. I cannot use it to code, however I can definitely use it for my writing. Using Dragon NaturallySpeaking, I can easily push out five to 15,000 words a day. (notice it used the word five and then a number) Ultimately it provides you very accurate writing. It's almost impossible to have a spelling error, however word substitution errors are still very common. If you attempt to compare your typing accuracy versus your dictation accuracy, you will often see spelling errors in the typing and word substitution errors in the dictation. That means that when you go back and edit your own work you have to spend a good deal more time editing because you're not used to editing the type of dictation errors that you make because you have years of experience editing the normal types of spelling errors that you made. You also have to learn how to compose sentences by speaking as opposed to composing to your fingertips. This definitely exercises a different area of your brain and I'm sure you will find that you are not as good of a writer when you speak as you all are when you type. However with practice you can get up to speed dictating and you will then definitely benefit from the ability to type at 150 words a minute without breaking a sweat, stressing out your wrists, or even suffering from eyestrain. Dragon NaturallySpeaking definitely helps people to avoid eyestrain because you don't have to stay focused on the computer monitor while you're typing you can look around the room, or outside or anywhere. Touch diapers (s/b touch typers!) can do this also however good ergonomics dictates that you sit in positions that align your body correctly to avoid repetitive stress injuries and this includes pointing your face for words (forwards!) towards the computer screen. With Dragon NaturallySpeaking I can face in any direction I like in the program will keep up. Downside it does substitute words and on occasion it skips words entirely. I run at least a gigabyte of RAM in my computer and I was would suggest double that amount. Dragon NaturallySpeaking is a bit of a resource hog, however it's worth it and it's not as bad as Firefox. I should have purchased it years ago and definitely do not regret the purchase nor my new attempts to learn how to write all over again. I had to learn to write with pencil and paper, and then with pen and paper and then with a manual typewriter and then with an electric typewriter and then with my trs 80 and then a laptop and my treo and yada yada yada I can sure learn to do it with my voice.
Sounds like someone wants to use Vi with their speech recognition engine!
The Sphinx project is the current 'gold standard' in open source speech recognition. It can be found at
Sphinx Project at CMU
I have used a variety of open source libraries in addition to 'rolling my own' and for general purposes Sphinx is certainly the most mature option.
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety."
"I am using PowerScribe, which is a radiology speech dictation system. It is fairly accurate in the doming [domain] of medical transcription, and particularly in the doming of radiology, but it not very useful for free pexed [text] speech.
For example, there [here] is a sample of the typical chest report: Hazy groundglass opacities noted with both lungs, particularly the right middle lobe as well as the left lower lobe, with no evidence of effusion, pneumothorax, or consolidation. [this is pretty much verbatim what I said].
[But here's a free text example:] However, if a Type II right a regular letter to a friend, [if I try to type a regular...] for example setting the following, [for example, saying the following...] Yesterday was a very nice state [day]. The clots [clouds are] gone, and only a little brain [rain] remains. Today it is supposed to be even warmer outside, I think elbow [I'll go] injected [and check] with the right knob. [the weather right now]"
The biggest problem with this system, particularly for medical transcription purposes, is that it only gets about 95-97% right. That means, it's wrong at least 3% of the time. Worse yet, whenever it's not sure, it just inserts random garbage! Whatever the closest match is, which is often wrong, and sometimes fundamentally changes the meaning of what I intended.
Human transcriptionists, on the other hand, will insert a blank if they're not sure, to alert the dictating physician. This fscking system has no clue when it's wrong, which makes it very dangerous in my opinion!
We were testing an edition of Dragon Naturally Speaking back in 2000, when an Asian-American woman on our team took the microphone. She had a heavy accent, and the software interpreted her words as... nothing.
She stood there, trying to get it to write something, and finally ended up repeating, "It not woking! Why it is not woking?"
We were afraid to laugh, fearing a trip to HR... we all stood there, biting the insides of our cheeks, until she gave up and left the room; then, we collapsed on the floor, literally ROTFL.
Speech recognition has been at a standstill for years now, it's been "almost there now" for well over five years. As mentioned in other posts, there has been a lot of consolidation and that has really hurt growth. Lernout & Hauspie and Dragon were constantly going back and forth a few years ago trying to get a leg up on each other. When L&H got into all of their accounting problems and shut down, that left Dragon and IBM. IBM's product went to Scansoft and went to Nuance where it languishes until somebody pulls the plug (for example, if you call for support on ViaVoice and mention you have XP SP2, they will tell you it is not a supported platform).
Most of the improvement in the Dragon and ViaVoice over the last couple of years has been in the reduction of training required to get to the high-ninety's level of accuracy (assuming noise-cancelling mic in a quiet room and you do not have a cold/sore-throat). The advancements in training have not corresponded to much in the way of translation accuracy. A "trained" Dragon 7 recognizes speech pretty much as well as Dragon 9 (I haven't played with Dragon 10 yet).
Most of the real speech recognition advancement these days is focused on discrete word sets for voice mail trees and other interactive systems. When you are on the phone giving your credit card number, two/to/too is all the same thing. While speech recognition in its current incarnation is good for people who can't type (disabilities, carpal-tunnel, etc.) it is not a replacement for typing, and isn't any closer today than it was five years ago.
Speech recognition generally comes in two flavors: Command and Dictation. Most voice recognition engines can handle either, but the implementations are very different. Command mode is handled by providing a list of "command" words that are valid at any given point and operates much like a state machine. Dictation is a completely different beast and does a variety of things under the hood to increase accuracy.
"Good enough" is very vague as applied to voice recognition. For command stuff, "good enough" has been here for about 7+ years. Even MS's free engine does a great job at that.
I used Via Voice years ago and it worked pretty well. But here's the thing: Have you ever tried to dictate something? It's definitely a skill. I'm sure some people have a natural ability for it, but I certainly didn't. I tried dictating stuff and it's tough. You hit a pause mid-sentence trying to figure out how you want to phrase something and suddenly there's a period and you're beginning a new sentence. Try dictating several sentences of original material and keeping it going without pauses and "um"s and so forth and you'll see, it's not quite as easy as it seems. I suspect one of the reasons voice recognition hasn't been a hit, is that people don't expect that. They try it for a few days think, "Hell,it's easier just to type," and give up. That's why I don't use it for writing. I can type faster and more accurately than I can dictate. I'm sure if it's something I wanted to work on, I could develop the skill, but my point is, I think that's probably why a lot of people give up on it.
I honestly think that voice recognition in command mode could be really useful at speeding things up, if software were designed to take advantage of it. But it's not easy to add it as an afterthought and it adds significant work, even if it's done with forethought. It's a chicken and the egg thing. If a lot of software supported it, I think people would see a gain in productivity using whatever software they use daily. I don't mean just using voice recognition, but in combination with a mouse and keyboard. For example: "Execute Browser. google dot com. flying burrito brothers. google search". Saying that would be a pretty fast way of opening your web browser, typing "google.com" and then typing "flying burrito brothers" and then clicking the "Google Search" button. Replace "Google Search" with hitting the enter key and even faster.
But as I said, it's a chicken and the egg thing. Software doesn't support it because there's no demand and there's no demand because people haven't really experienced software that supports it.
Another issue (and I'm sure this has been mentioned by others), is background noise. I like to listen to music or watch TV while I work. Those don't mix well with voice recognition, at least not at the volumes I listen to them. Until voice recognition can get around that and recognize my voice amidst background noise and do it accurately AND software out there generally supports it, it's not going to go mainstream.
I'm still waiting for speech recognition to come to our elevators so I don't have to touch the dirty buttons.
Also so I can pretend I'm on the Enterprise.
-HobophobE
Nothing laughs forever.
I mean really, until I can say to my computer things like:
/dev/audio on the neighbors computer. What use will it be?
u ter() == true) then /music -type f -name \"*trent*reznor*\" | xargs -t cat - | ssh hackeduser@neighborcomputer \"cat - > /dev/audio\"");
Find all mp3's that were created by Trent Reznor and pipe them to
I can't program in it can I?
if(i_can_write_code_I_mean_speak_code_to_the_comp
i_might_use_it_a_bit();
else
system("find
endif
But that is just me.
Do we really need this? All this is good for is for the people who can't type 100wpm with reasonable accuracy. I don't think I would be able to speak much faster (at a normal speed) any faster than I could type. Plus, I only think so fast. So...Everyone should learn to type at 100wpm and the problem is solved. Also, who wants to hear a bunch of chatering at the library with people "typing" on the computers verses very loud obnoxiouse 100wpm typing sounds that make the people typing at 40wpm drop their jaws.
Dragon requires MONTHS of training (literally), and even then it will make mistakes exactly like the one you noted. The plus side is that Dragon works pretty decently under WINE, but apart from their Linux "support", it's a complete mess.
Screen readers aren't much better; they have the accuracy, but are hard to understand.
For a little geeky fun, I had Kurzweil read a few English papers to Dragon. Even after some training, Dragon still couldn't get above 80% accuracy on a computer generated, 100% reliable, voice. Now that's just sad.
Obligatory Soundbite Catchphrase