Rest In Peas — the Death of Speech Recognition
An anonymous reader writes "Speech recognition accuracy flatlined years ago. It works great for small vocabularies on your cell phone, but basically, computers still can't understand language. Prospects for AI are dimmed, and we seem to need AI for computers to make progress in this area. Time to rewrite the story of the future. From the article: 'The language universe is large, Google's trillion words a mere scrawl on its surface. One estimate puts the number of possible sentences at 10^570. Through constant talking and writing, more of the possibilities of language enter into our possession. But plenty of unanticipated combinations remain, which force speech recognizers into risky guesses. Even where data are lush, picking what's most likely can be a mistake because meaning often pools in a key word or two. Recognition systems, by going with the "best" bet, are prone to interpret the meaning-rich terms as more common but similar-sounding words, draining sense from the sentence.'"
Buffalo buffalo Buffalo buffalo buffalo, buffalo Buffalo buffalo.
I the method which comes to make probably, them who see the work of the speech recognition software which is honest is suitable and language translation asserts! where
I certainly hope that TFA title is intentional...
I refuse to partake in a short sleep cycle process while lying in small round vegetables otherwise!
Goodnight, Sir!
> meaning often pools in a key word or two
It's true.
My own hearing is not great. I often miss just a word or two in a sentence. But they are often key words, and missing them leaves the sentence meaningless. If I counted the words I understand correctly I'd probably have a 95% success rate. But if I counted the sentences I understand correctly, I'd be around 80%. So I get by, but I tend to annoy people when I ask for repeats over one missed word.
I hardly type anything in to my HTC Incredible. Google's voice recognition, which is enabled on every textbox works just about perfectly.
Seriously, get an Android phone, try out the speech recognition text entry, and then tell me speech recognition is dead.
That summary was written with speech recognition software?
Years ago I used viavoice on Warp4, and it had a pretty decend recognitation rate ..
it was even better understanding my needs than I can get Windows7 understand mine by mice commands ..
I miss those times .. when grey was a chique color for OSes
It's close enough to usually understand. But I'm not sure if it's a computer translation or a bunch of pigeons typing to translate.
Natural language processing *is* AI. And high accuracy speech recognition requires natural language processing if we expect to have accuracy rates approaching that of a human. Humans hear words partially or incorrectly all the time. We fill in the gaps from context, and we correct if the course of the conversation reveals that the original interpretation is wrong. Expecting computers to do better, when half the time the problem is the speaker, not the listener, means you need it to be able to make the same corrections from limited information on the fly, and after the fact that a human brain makes.
$_ = "wftedskaebjgdpjgidbsmnjgcdwatb"; tr/a-z/oh, turtleneck Phrase Jar!/; print
It only flatlined because nobody tried to write speech recognition software in perl*.
*Disclaimer: Poster is not responsible for attempts resulting in unintended AI development and/or end of the world scenarios brought on by such an irresponsible endeavor.
Motorcycles, Robots, Space Gossip and More!
Even humans mishear speech.
"'Scuse me while I kiss this guy"
That misheard lyric is so common that there's a book about misheard lyrics with that as the title.
--
BMO
When talking to someone else, we can politely stop them and ask : "Sorry, what did you say?". As someone whose first language is not english, I tend to use these words a lot, mostly because of differences in pronunciation. Computers, on the other hand, are supposed to get everything right the first time! Why can't they, like us, ask those simple words instead of making stupid guesses??
One estimate puts the number of possible sentences at 10^570
What a completely useless metric. It makes sense to examine the context and meaning of speech in order to accurately transcribe words, but the number of possible sentences doesn't seem to accurately describe the problem here...
"Before criticizing someone, first walk a mile in his shoes. Then, you'll be a mile away... and you'll have his shoes."
I've been using VR in Win7 for a few weeks now. I can honestly say that after a few trainings, I'm near 100% accuracy. Which is 15% better than my typing!
I doubt it is completely dead. I have yet to hear it from the researchers working on AI. I work in affective computing, so I am thinking that it is possible that the missing component could be emotion or another way to increase the understanding and ability of computers to learn. In addition, even if it is not possible to increase speech recognition capabilities in this model of computing, in another model of computing this and more would be possible. I am not believing it until I hear it from researchers who have tried most possible options for improvement.
Speech recognition mechanisms/algorithms are not entirely
the problem. What needs to back them up is called a "world
model," and, as the name implies, this can be large and open
ended. Humans being able to correct spoken/heard errors
on the fly is because of having an underlying world model.
Would that I had mod points today.
The above is a valid English sentence and a poignant example of how difficult it is to parse language without knowledge of semantics.
"What, all of us?"
"Flyin' in just a sweet place,
Never been known to fail..."
I've wondered why we can't meet computers half-way. Just design a constructed language that avoids the unsolvable problems. If operating computers by speech is truly better then learning the language would be akin to learning to type.
OTOH, if it's an attempt to simplify computing for those who don't wish to learn, well, that's an impossible task. The problem lies in the fact that such people don't give explicit commands, and even humans take quite a bit of intuition to figure out what they're implying.
Having said that, Dragon works fairly well, provided you modulate your speech.
If you want a laugh with Dragon, turn away from the screen and talk normally, then look at what it has transcribed..
http://slashdot.org/~GuyFawkes/journal
speech recognition requires training because it lies on Machine Learning algorithms. Nobody has time to train their computer. I mean, even us humans need 2-3 years of such "training" in order to start recognizing words.
Intelligence is basically composed of pattern recognition, with two general categories. One) Specific pattern recognition is logic, math, etc. It requires incredibally exact matches. Yes or no. 1.0, not 1.00001. Computers are very very good at that.
Two) General pattern recognition is creativity, art appreciation, and our capacity to invent. It requires people to ignore a ton of irrelevant data and instea focus on only one aspect of identity, recognizing it despite the large amounts of irrelevant data. That tree kind of looks like a face, that falling object is like all other falling objects. Computers have always been very very BAD at this. Humans do it much much better than animals, but even a monkey is better at general pattern recognition than a computer is.
I am sure that we can make computers slightly better at speech recognition - enough to recognize all of a limited set of comand words like print, attach, email, open, run. Individual programs would have to include codes for their names and specific commands. But I think it will take a true Artificial Intelligence to recognize speech as well as a human. In fact, I would make that my Turing Test. I would also add that I don't think an intelligence built using current theory could become a true Artificial Intelligence. We would need to design a computer that is a non-determenistic device -one that does not rely soley on pure mathematical logic, but is itself based on an entirely new design. No I can't describe it - because if I could I would build one and be rich.
excitingthingstodo.blogspot.com
Futurists should really learn what the word "plateau" means. The death of any given technical progression, particularly one that deals with information procesing, tends to be announced early and often, right up to the point where progress becomes meaningful again and then all of a sudden everyone saw it coming, and oh by the way where's my flying car?
Don't tell the people actually doing it. They don't know that the author of this piece says it won't work. So they keep making it work. We don't want to upset them. Ssssh. ... well translates. info here Kinda puts the knosh on this article. Speech recognition as a part of translation is a new application of the tech that is growing by leaps and bounds. 10 years ago we had to do text to text translation, now it's speech to voice. Then you have companies like Voxify,TuVox and others replacing routine call center calls with realistic voice recognition. Far from being a dead animal. It has moved from the realm of fantasy to the realm of direct application.
Speech recognition and translation is becoming a highly effective and proficient tool for the US military. You see it fit's in your iPod... and
I'm sorry, I'm to tired to be witty at the moment so this message will have to do.
I see a lot of claims, but not much evidence. If we're going to use perceptions and anecdotes as evidence, my impression is that speech recognition has always been considered vaguely stalled. In 2000, people didn't think much progress had been made since 1991 besides some commercialization of stuff academia already knew how to do. In 2010, this guy doesn't think much progress has been made since 2001 besides some commercialization of stuff academia already knew how to do. Yet I think some progress has been made over the past 20 years. There just haven't been any breakthroughs, which is maybe what he's expecting, given his vague suggestion that "AI", a pretty vague concept, is our hope.
I'm also skeptical that accuracy has flatlined, though it's possible that's true in some areas. My impression is that multi-speaker recognition, use of large corpora to improve accuracy, and use of language modeling to improve accuracy, have all improved over the past 10 years. Of course, not all improvements go everywhere: the speech recognition running in real-time on a mobile ARM processor is not using every possible state-of-the-art technique. The advance there is that you can run speech recognition in real-time on a mobile ARM processor at all, and get performance that was once only possible on pretty hefty workstations.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
It works great for small vocabularies on your cell phone
No. It doesn't.
It works great for small vocabularies on your cell phone if you happen to live in the same neighbourhood as the developer where "everyone talks this way". For the rest of the world, attempting to talk with a nasal American twang in order to get the phone to understand you, is shit.
Deleted
Blame Startrek for making it look flawless. Speech recognition is just like fusion technology, 20 years away from properly working - just like it has been for the last 20 years.
-RANT- I cant stand voice recognition systems that don't at least give you an option to press a number. Especially when they are out of tune and pick up back ground noises as voice. Please, please, please - always give the option to press a number instead of having to voice everything!!
The radiology voice dictation transcription system at my former employer was horrible. Having to read the dictated reports was equally appalling considering there was a radiologist signing off on their accuracy, and they were certainly not completely accurate. The irony is that the things the system frequently had trouble with were simple words like "not" and recognizing quantities appropriately, whereas more complicated things such as "gastroschisis" would be dictated correctly.
I never understood it, but since I was not the radiologist, I didn't care either. I mostly was entertained by listening to them repeat the same stupid, simple word over and over trying to get the dictation system to behave, when it would have taken a fraction of the time to manually edit the document with a keyboard.
The eighties were like half as groovy as the seventies, but twice as cool as the nineties.
<Complete your profile by adding a signature!>
Yay Linux! Boo Microsoft!
I win! Give me all your speech recognition monies.
Wait, what do you mean you don't believe I'm an AI? ... er, I mean ... Wait, what do you mean you do not believe I am an Artificial Intelligence?
Buffalo buffalo
Likewise, Badger badgers Badger badgers badger, badger Badger badgers. (UW taxideans harassed by UW taxideans harass other UW taxideans.) Oh, and mushroom mushroom.
Comment removed based on user account deletion
Didn't IBM a few years ago announce a big five-year-program to crack speech recognition? Whatever came of that?
How hard is it for a computer to understand the sentence: "Tea, Earl Grey, Hot"? That takes care of 90% of the use case scenarios right there. Next is "Computer, initiate auto-destruct sequence" is the next 8%.
This blog post is retarded. The author is correlating a drop in internet news articles about Dragon NaturallySpeaking with a flatlining of speech recognition accuracy rate.
The Slashdot editor Soulskill is retarded for both not realizing this and for not reading the anonymously-submitted blog post (hmm no way it could have been the author) before approving it for the Slashdot front page. The guy is just out for more traffic to his rather pointless tech news commentary blog.
Decline of Slashdot, internet signal-to-noise ratio, get off my lawn, etc.
Alpha Kenny 1
And there's this nice meme from a couple years ago:
http://www.google.com/search?q=ken+lee
http://www.youtube.com/watch?v=_RgL2MKfWTo
this is the reason that millions of americans are faster with the thumb than Buddy Rich with the drumsticks... you can't see the finger move as they type 30 zeroes in a row to escape the mumblebots.
if this is supposed to be a new economy, how come they still want my old fashioned money?
They gave this information to the "public" by handing it over to the LCD? It costs $150 to obtain a non-commercial license from LCD. This is ridiculous but i guess money is the best way to control information.
My understanding, from the people that use Dragon, it competes well against paying someone else to type. First it is a couple of orders cheaper. Second, if you pay someone to type, you still have to read and edit, and dragon is accurate enough. Of course you have to train yourself to use the technology, but that is the same with any technology. It is naive to think that we don't make subtle and not so subtle changes in ourselves so that we can benefit from the technology.
I think speech recognition is going to expand in the future. Beyond the dictation process, there is also simple commands. I don't use the voice controls on the iPhone, but it seems something that people like. I have used the voice controls on my Mac. Furthermore, i can certainly imagine a time when my fingers are not so limber that I might depend on something like Dragon.
I don't see the technology so commoditized that MS includes it in the 2015 version of MS Office, but I do have beilieve there is always room for improvement.
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
Any discussion of the history of speech recognition is incomplete without a reference to Microsoft's famous Windows Vista "double the killer delete select all" botch-up: http://www.youtube.com/watch?v=klU2zt1KdUY
>> Standing on head makes smile of frown, but rest of face also upside down.
"Time flies like an arrow; fruit flies like a banana."
Is a time fly an archer or a DDR player?
PS1 music game Parappa the Rapper turns "There's a bathroom on the right" into a rap song.
I'd settle for a grammar checker. From the fine summary:
"Even where data are lush"
A good one would have saved this summary from sounding stupid.
"Ssssh"? "it fit's in your iPod"? "puts the knosh on this article"? "Far from being a dead animal. It has moved"?
Apparently your speech recognition software still needs a bit more R&D. In case you can correct it for the future, it should probably be "Shhhh", "it fits in your iPod", "puts the kibosh on this article", and "Far from being a dead animal, it has moved".
dom
There won't be any meaningful development in speech recognition (or machine translation) until context is taken seriously. Context is an inseparable part of speech.
Right now the problem being solved is audio->text. This is the wrong problem, and why the results are so lame. The real problem is audio+context->text+new context. This takes some pretty intelligent computing and not the same old probabilistic approaches.
Somehow Slashdot chose an apt fortune: "The sixth sheik's sixth sheep's sick." Let me know how your speech recognition software does on that sentence!
dom
Maybe we just need to speak binary.
A few years back I worked for an awesome company that did a IVR (interactive voice recording) systems.
We had voice driven interactive systems that would provide the caller with a variety of different mental health tests (we work a lot with identifying depression, early onset dementia, Alzheimer, and other cognitive issues.
The voice recognition wasn't perfect, but we had a review system that dealt with a "gold standard". I wrote a tool that would allow a human being to identify individual words and to label them. Then we would run a number of different voice recognition systems against the same audio chunk and compare their output to the human version. It effectively allowed us to unit test our changes to the voice recognition software.
Dialing in a voice recognition system is an amazing process. The amount of properties, dictionaries, scripting, and sentence forming engines are mind blowing.
Two of the hardest tests for our system were things like: Count from 1 to 20 alternating between numbers and letters as fast as you can, for example 1-A-2-B-3-C. And list every animal you can think of.
The 1-A-2-B was killer because when people speak quickly, their words merge. You literally start creating the sound of the A while the end of the 1 is still coming. It makes it extremely difficult to identify word breaks and actual words. And if you dial in a system specifically to parse that, you'll wind up with issues parsing slower sentences.
The all animals question had a similar issue, people would slur their words together, and the dictionary was huge. It was even more challenging when one of the studies that was nation wide. We had to deal with phonetic spellings from the north east coast and southern states accents. What was even worse was that there was no sentences. We couldn't count on predictive dictionary work to identify the most likely word out of those that would match the phonetics.
That said, getting voice recognition to work on pre-scripted commands and sentences was pretty easy.
And I can only imagine the process has been improving in the years since. Although we were looking into SMS based options, not for a dislike of IVR, but because our usage studies with children were showing most of them were skipping the voice system and using the key pad anyway. So why bother with IVR if the study's target demographic was the youth.
-Rick
"Most people in the U.S. wouldn't know they live in a tyrannical state if it walked up and grabbed their junk." - MyFirs
Once we get out of the eighties, the nineties are gonna make the sixties look like the fifties.
When you have a minute, go to YouTube and bring up an old Star Trek episode (not the CBS ones with very loud commercials).
Then turn on Google captions. More fun than a barrel of Rigelian monkeys!
About every third sentence gets a close or exact rendering, but oh, the other two! I should sue them for laugh-muscle strains.
Long ago - decades, before Bill Gates was invented, a lot of research went into what would be required for actual voice recognition.
A counterexample was given, about an engineering marvel (of the time) that would recognise when someone said the word "watermelon". For a long time, people in the industry assumed that the path to voice recognition consisted of building more and better watermelon boxes.
Several authors, including Alan Turing himself, argued that actual voice recognition could never be accomplished with a large array of watermelon boxes. Current VR software divides input into a series of hyperplanes, and attempts to build a best match from the classification tree.
THis is the 2010 version of the watermelon box.
Real voice recognition won't be practical until the input is parsed, matched against context, and structured much akin to diagramming a sentence in those old English (or other) classes. In short, matching against a vocabulary is trying to solve an exponential problem with a (large) polynomial engine.
It won't be until the computer actually understands what is said that VR is likely to be practical in a global sense.
As a person who has been building computer systems for 35 years, it bothers me to see a huge body of research done into subjects like these ignored, because someone thinks that none of it applies to PC's.
Don't take life too seriously; it isn't permanent.
speech software has been evolving at a steady pace. but the issue isn't that its the fact 90% of the users out there don't use it. if you live in a loud place with kids or other noise it will not work well. windows 7 has built in speech software and how many people use it. i played with the latest dragon speech software and i gotta admit its very good even without traning it. i did emails with it without any issue. but as i said speech software is more a toy then anything usefull. as people said it probly will have a good use on a cell phone rather then on a pc being it would be a easy way to chat rather then using the cell phones keypad. .
Most people won't benefit from speech recognition software in any manner that is critical, or might automate the mundane to the point that their lives might yield great benefit to mankind overall. If there's anyone out there, aside from the physically handicapped, who thinks they need speech recognition software to perform any task that isn't repetitive and it truly important for the greater good, I assert that it would be better for all if they had proteges who could learn from them and not machines facilitate isolation.
There is also the problem of meaningful work from those who might serve as assistants, and automation for the sake of automation didn't do the Luddites any good, albeit notwithstanding the motivation to rebel against already cruel and inhumane conditions of employment.
Obligatory UF
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
Errr...I believe the precise number would be 10^571.
I know it's just an imaginary example of how bad text-to-speech is... but it is realistic and disappointing.
Even an idiot like me knows what Markov chain is. Perhaps the standard voice apps are so entrenched they're not recoding their apps to take advantage of huge leaps in memory capacity compared to when they first started selling.
Cwm, fjord-bank glyphs vext quiz
The article doesn't really make the case. There are two interesting charts, and one is BS (measuring Google News hits for Dragon). He is trying to draw a deep result from the fact that the NIST data he cites ends in 2002. What happened in the last eight years? Lots of arm-waving in the article, but no hard data.
...this would turn out to be the case. I should have published a book on how little this would work. But I did have my doubts, way back in the 90's. It came down to a simple question for me; could a speech recognizer ever "get" irony. I came to the conclusion that it would be very difficult at that time. Guess I was right.
Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
I certainly hope they have perfect voice recognition systems before they have the perfect nanoassembler.
Otherwise we will start seeing a lot of eleven inch pianists everywhere.
I don't know how anyone else feels about this, but I wish the use of voice recognition in company phone systems would die. Seriously, please just let me press 1 for Yes and 2 for No. And stop programming them to be conversational with phrases like, "Let me see if I have this right..."
It must have been something you assimilated. . . .
Speech recognition does *not* "work great" for cell phones. Every new phone and/or new firmware upgrade, I try again to teach my phone to understand me, and each time I get embarrassed the first time I try to use it in public. The experience is similar to William H. Macy's in Wild Hogs.
"Call mother-in-law"
"Did you say, 'Hot Mothers in Slaw'?"
"Call mother in law!"
"Did you say 'my brother's my pa'?"
"Call. Mother. In. Law."
"Did you say, 'Call Hooters'?"
"What?"
"Did you say, 'What'?"
I do not know why this is so, but speech recognition does not work reliably enough to be other than a toy in any application I've ever seen. It exists for the amusement of those watching the poor sucker trying to use it. Sometimes I imagine a bunch of programmers in Taiwan laughing their asses off.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Considering Google is now offering automatic transcription of all YouTube videos, I'd say they certainly haven't given up on speech recognition yet.
English, I would think is a pretty daunting language for speech recognition, what with a substantial array of homophones, but I wonder if other languages fare better. Maybe Spanish or, say, Japanese would be better since (I'm guessing) there is a closer relation to the written script and the actual sound that it makes.
Once I was a four stone apology. Now I am two separate gorillas.
http://dl.dropbox.com/u/454062/IMG_0006.JPG
I have been flamed more than a few times around here for suggesting Computer Science has not got a clue what they are doing when it comes to AI. Philosophy has been at this problem and more for the better part of the last 400+ years (more like a 1,000 years) in a serious way. The stock b.s., I get from the science fiction fan boys is that somehow natural language is a problem that can just be brute forced as if you were trying to figure out the password you forgot to your email account. Good luck with that.
By the way, language "recognition" by a computer is likly the easy part of the problem for AI researchers to crack. It is still not going to yield any real AI, just better cars and toasters.
Living in Chile
Shoot, read almost any GV translation for a good laugh. Though strangely every once in a while it gets almost every word right.
Having said that, Dragon works fairly well, provided you modulate your speech.
If you want a laugh with Dragon, turn away from the screen and talk normally, then look at what it has transcribed..
For real hilarity, try chasing the dragon: http://www.southparkstudios.com/clips/155898
The only possible interpretation of any research whatever in the 'social sciences' is: some do, some don't
What a pile of stinky! Speech recognition was 20x better than the best humans 10 years ago. The US Navy got the technology, and its basically useless crap from Dragon and whatever still pissing around. I saw their stuff about 1993 and it was bad then, and still really bad now. The Berger-Liaw speech recognition system (the one the Navy got), was wicked though. The big difference between their system (with about 6 neural nodes) and the others (with maybe 10,000) is that theirs kept temporal information, the others use a stock computer clock (oscillator). The video gives you an idea of the capabilities of this system. The US Navy got it all. I always wondered why it wasn't a common feature on computers already. So you can wring your hands and put up silly slashdot articles, but its all a lie!
Fork 'andles
Your dear information processing seems to be the culprit with flying cars...
One that hath name thou can not otter
Better yet, turn on captions while watching a Day Job Orchestra Trek dub!!
"Slow down, Cowboy! It has been 3 years, 7 months and 26 days since you last successfully posted a comment."
I predict a triumphal return of neural nets, predicated on memresistors and recent advances in large scale quantum superpositions.
The time for a new paradigm is upon us.
... have you mis-understood spoken words? I have, many times. Speech recognition is directly related with the quality of microphones used to process the speech as well as the quality and articulation of the person speaking.
TOP DSLR Cameras Reviews of the top DSLRs
It's not as easy as it sounds!
And it's been a problem that's been solved since the late nineties.
The problem giving everyone fits is continuous speech recognition which is another problem entirely. It was a sad day for most of the disability community when all the speech reg vendors abandoned their discrete speech products in favor of continuous recognition.
When I started on my Ph.D., I started out majoring in AI. One of several reasons I changed to computer architecture (CPU design, etc.) is because I just couldn't stand the broken ways that people were doing stuff. Actually computer vision stuff isn't so bad -- at least there's room for advancement. But the speech recognition state of the art is just awful. I couldn't stand the way they did much of anything in pursuit of human language understanding.
With automatic speech recognition (ASR), the first problem is the MFCCs. (Mel-frequency cepstral coefficients.) What they essentially do is take a fourier transform of a fourier transform of the data. This filters out not only amplitude but also frequency, leaving you only with the relative pattern of frequency. Think of this as analogous to taking a second derivative, where all you get is accelerating, leaving out position and velocity. You lose a LOT of information. Then once the MFCC's are computed, they're divided up into the top 13 (or so) dominant MFCCs, plus the first and second step-wise derivatives, giving you a 39D vector. Then the top N most common ones are tallied, and code-booked, mapping the rest to the nearest codes, leaving you with a relatively small number of codes (maybe a few hundred).
So to start with, the signal processing is half deaf, throwing away most of the information. I get why they do it, because it's speaker independent, but you completely lose some VERY valuable information, like prosodic stress, which would be very useful to help with word segmentation. Instead, they try to guess it from statistical models.
Next, they apply a hidden Markov model (HMM). Instead of inferring phones from the signal, the way they model it is as a sequence of hidden states (the phones) that cause the observations (the codes). This statistical model seems kinda backwards, although it works quite well, when trained properly. To train it, you need a lot of labeled data, where people have taken lots of speech recordings and manually labeled the phonetic segments. What is usually learned is mostly a unigram, where what you know are the a priori probabilities of each phone label (the hidden states), and the posterior probability of each phone given each possible prior phone. Given a sequence of codes, you find the most likely sequence of phones by computing the viterbi path through the HMM.
Honestly, I can't complain too much about the HMM. What I do complain about is the fact that the "cutting edge" is to replace the HMM with a markov random field (just remove the arrows from the HMM), and conditional random fields (which are markov random fields with extra inputs).
My response to using MRFs and CRFs is "big whoop", because all you're doing is replacing the statistical model, which doesn't dramatically improve recognition performance, because they haven't fixed the underlying problem with the signal processing.
Then on top of the phone HMM, they layer ANOTHER HMM on top of it to infer words and word boundaries, based on a highly inaccurate phone sequence.
The main problem with all of this is not that the reseachers are idiots. They're not. The problem is that the people with the funding are totally unwilling to fund anything really interesting or revolutionary. The existing methods "work", so the funding sources figure that we can just make incremental changes to existing technologies. Which is wrong. Unfortunately, any radically new technology would be highly experimental, with a high risk of failure, and would take a long time to develop. No one wants to fund anything that iffy. As a result, all the scientists working in this are spend their time on nothing but boring tweaks of a broken but "proven" reasonably effective technology.
So I don't blame people for the conundrum, but I see no opportunity to do anything interesting, so I just couldn't stand studying it.
If you call your bank or credit card company, or really any large company's support lines, you're likely to encounter an IVR system that basically does what you describe. Hacking down the range of responses to make it easier for the computer is great for some stuff (well, ok, it's passing in most cases, and miserable in some), but is utterly useless for a lot of applications. The range of accents and vocal quirks of the human voice is pretty amazing, so the fact that they work as well as they do is impressive.
Voice recognition systems for command and control systems are just one, very very, small part of the overall use for the technology. Automated transcript generation for TV broadcasts (either for CCAP or other purposes), intelligence and defense applications (this is a very active area in NLP), even stuff like dictation systems.
Unless you're proposing that everyone begin speaking in computer comprehensible English, all the time, which is silly and utterly unrealistic.
Its just a speed bump on the way to thought recognition, which will be far more useful.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
is actually infinite not 10^570th. English is recursive so that any final count of sentences can be increased by prefacing the phrase, "Here is a list of all sentences that has now been made longer by the addition of this sentence."
What are TV stations using for this? Sometimes I turn it on just for the lulz.
This really happened:
TV weather dude: If you're indoors tonight...
Closed captions: IF URINE DOORS TONIGHT...
I would have to agree with the contents of the original posted article. The hospital at which I presently work is switching slowly from transcriptionists to Dragon as a 'cost-saving' move. Maybe it has saved money, but it's cost a lot in frustration.
It's made it a lot harder to dictate reports. In plain text, it's almost able to cope, although I've seen embarrassing mistakes slip through ('and in' once came out as 'anus'). Even when I try to 'teach' and 'train' it, it consistently records 'core needle biopsy' as 'corneal biopsy' and the letters A, B, E, and C are absolute crapshoots. Template reports, which are in vogue in my field, are a headache to dictate. Often Dragon misunderstands my command to move to another template field as a request to add in the words 'next field.' If I have the patience, and speak to the Dragon as I would to a mentally retarded but well-meaning child, I can hope for only two or three errors per report. When I'm in a tearing hurry, as I am most of the time, I type out my own reports.
The worst part is that all the transcription errors come out as correctly spelled words, so they're even harder to detect than they were before. 80% accuracy seems about right to me. Luddite as it sounds, I'd rather have a human transcribing my speech over a machine.
All of the speech recognition systems to date try to fake it - essentially all they do is match speech waveforms to their library and some do some very simple syntax checking. This is useful for situations where the vocabulary is small and the number of human speakers is also small. These systems don't work like we do and to achieve significantly better results a very difficult problem will have to be "solved" first.
Our method of communicating is both more and less than it appears. At a basic level, what we're communicating is not contained in the words we use - we use words as symbols and the listener "looks up" that symbol and applies meaning to the word. So when I say "horse" you access your knowledge of horse and that provides the meaning to the word. If you'd never seen or heard of a horse then this communication would fail.
It's shared knowledge - literally "common sense" that makes verbal communication possible. Our brains devote a lot of "processing" to this task - they have to not only recognize the word symbols, they have to cross-reference them to memory and do it in real time. We continue to make strides in increasing the amount of CPU power we can devote to problems like this one and we'll probably reach "human equivalent" processing power in our lifetimes. Even so, the machines won't be able to converse with us because they won't have the "common sense" needed to understand what the symbols mean. Without that, they can't make any kind of valid judgements about sentence structure or what meaning a particular word is using at the moment. You can buffalo Buffalo all you want but the machine has trouble with to, too, and two.
There's been some cute demonstrations made using huge rule sets that almost work - but they quickly fall apart when you try to converse with them. Even the very best of speech recognition systems suffers from not knowing anything about what's being talked about. When we can equip future machines with the knowledge of a 12 year old human they'll be able to talk with us - and we'll have solved a lot of other related problems at the same time. Until then, computer speech recognition is an AI trick - heavy on the A.
And I intentionally used a phonetic hash I threw together in the key lookup. The script produced some cool output, but didn't do quite what I wanted to do.
Then I learned about Soundex. And then, even better, Metaphone. Better still, Double Metaphone. DM's benefit is that it returns multiple keys for a processed symbol, under the assumption that the symbol might be pronounced multiple ways. It was *almost* what I wanted, except it was still more or less limited to mostly-English words. I'd like to work with IPA, but whenever I asked about a library that attepts to take text and convert it to IPA symbols, I'm reminded that different dialects will say the same words different ways (engaging the vocal chords or not, for example.), and the same word may have a different meaning depending on how it's pronounced, which is also related to its context. A first-order markov model is likely to grant some self-correcting accuracy, though while a second-order or third-order model should do a decent job, they'd represent *huge* data sets.(When I was working with a 1st-order model, and considering moving to 2nd-order, I almost convinced myself to buy an SSD to dedicate to InnoDB.)
It seems obvious to me that you should be able to apply Metaphone's approach (a returned key for each possibility), and then use a markov model to refine which key has the most likely meaning in context. (Feeding it a language's dictionary with word/part-of-speech/IPA tuples would be most excellent)
As for speech recognition, aren't there any libraries or code bases out there that convert sound to IPA? It seems the most obvious solution. Heck, you could probably get away with some on-body sensors for more accurate detection of particular IPA symbols.
Incidentally, if you want the data and code I was playing around with, I put it here. Read the thirty or so lines of disclaiming comments before you complaint about it being a 65MB Perl script. (I didn't want to bother packaging multiple files, among other concerns.) LZMA compressed, so install the lzma package or grab 7zip, depending on your OS. Compressed, it's 6.4MB.
tasks(723) drafts(105) languages(484) examples(29106)
In Klingon. Before the Nuance company bought them, when Mark Mandel was still heavily involved at Dragon, they had a Klingon speech recognition project. It was rather fun because it was an entirely artificial language.
A better (international) example: Police police police, police police.
Understood as: Police (whom) police police, police police.
You can make an arbitrary long (true) sentences: [Police (whom) police]^n police, police [police (whom) police]^(n-1) police.
Of course, you can parse this sentence many ways, in a grammatically correct form, however the sentence is no longer tautological.
I voice googled "Glenn Close" (the actor) and got results for "Clean Clothes" (the laundry). I lol'ed.
A: To keep cows in.
Tay Zonday, is that you?
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
Just today I tried to use an automated recognition system to confirm a change of address-- and it kept just hanging up on me when it couldn't figure it out-- and I doubt it's capable of figuring it out, my street is a weird european name that everybody mis-hears and probably isn't in their database. And their web-based alternative couldn't hack it either for unrelated reasons, directing me to their phone system instead, or I wouldn't have gone there in the first place. It's a big Wall Street company, BNY Mellon, wanting to send me a statement for a fraction of shares worth about 25 cents that would cost me more to sell than they are worth. They wouldn't accept the USPS change-of-address notification without confirmation. Well, fine, but their system is just too broken to do that. Fortunately I just don't care enough about their stupid statements anyway, so I'm just going to blow it off.
Speech recognition will continue to hit upon this wall because it disregards internal representation. What you hear may externally be a wave form that has some certain characteristics, but the phonetic structure is also dependent on the phonemic, morphophonemic, syntactic, and semantic interactions. To actually understand a word, your exposure to phonetic information must trigger the aforementioned interactions.
The best example of speech recognition learning comes from babies. Babies are born with the ability to distinguish a pretty much infinite number of phonemes. Continued exposure to their native language then narrows this to the phones that are applicable to their use, i.e. their language. In English, things like aspirated ps get ignored for the purposes of meaning, such that I can hear "stoph" and how it is not distinct from "stop". Built upon this we discover morphemes and morphophonemic rules, so that I can tell that "stop" becomes "stopt" in past tense. Similarly, we upon this we build syntactic and semantic relationships. This is context based understanding. I need a context of "past" to start applying morphemes for past tense, but I also need a the correct phonemic context to perform the correct allophonic substitutions. Similarly, if someone with a thick Scottish or Novacastrian accent comes up to me on the street, I need to combine my semantic context with my own abstract internal representations of my language to try and understand them.
This provides a form of natural error correction that allows me to understand something I have never heard before and that might contain deviations or ambiguity (either inherent in the language or introduced by the speaker). My internal representation of English should prevent me from ascribing wrong phone clusters or wrong morphemes (runn-ingk) to the processed sound.
It's stimulus plus rule matching plus context plus error correction that should ultimately help me decide if something can be understood.
All thing ignores the complexity of graphemic translation, which build another set of rules.
The article said (somewhat in jest) that throwing out linguists helped improve the accuracy of the system. Sure, methods not representative of human language capability might in the short term give greater results, and there is no definitive model of the how language is represented in the mind. You can probably provide a great system ignoring much linguistic information that functions in a limited context (i.e. one language, rigid contexts (yes/no, numbers etc.). However, ultimately if the goal is to produce a speech system that functions like a human -- that is, performs the error correction when appropriate, uses various types of linguistic information, and in certain circumstances requires clarification -- then linguistic models are important
Amen. NetBSD claims that speech dictation is dying, yada yada. Meanwhile, in the real world, digital dictation is being used very day by vast numbers of people. Use a decent headset, spend the time goign through the training, and Dragon is scarily accurate. It's used by law firms, who can't ask Partners to bill for typing up time, and it works well.
Long way to go.
If there is one thing we have learned from 60 years of AI research, it's to never bet on AI fulfilling its promises.
We've been studying the inferior colliculus, and some of the processing there appears unexpectedly complex, suggesting that speech recognition software may not be using the full set of cues that the auditory system has available to it.
While not as common today, a few decades ago machine dictation was used in much of business. Even with high-fidelity recorders and human transcribers with perfect hearing, there were constant problems with misinterpretation. It is hard to imagine machines readily overcoming hurdles that millions of years of evolution have failed to surmount.
Here is one. Oh, here is another. You must not be looking very hard.
You don't understand. *I* don't have a flying car. Therefore all is lost.