Full-Text Audio Search
Captain Chad writes "The latest print edition (12/16/2002) of InfoWorld has an interesting article about an audio search program by Fast-Talk Communications. (The article is not yet available on the InfoWorld web site, but the Fast-Talk site has some good info, including a downloadable trial version.) The product works by breaking the audio stream into phonemes, which are the 'basic units of sound in a language.' The search is then performed for a specific sequence of phonemes. This method is faster and far superior to traditional audio searches which convert to text and then perform a normal text search. The author of the Infoworld article, Jon Udell, tried a variety of searches that were surpisingly successful. If this technology is as good as he claims, there is a reasonable chance it will revolutionize the way we store data. Maybe there will even be an 'Audio' tab on Google." Here's the Infoworld article.
Actually, Google already has a voice search, albeit in beta form.
Soundex, which uses the way words sound rather than the way they are spelled, has been widely used by the government and genealogy researchers for the past 60 years. This isn't exactly "new" technology.
/. articles starting to sound like corporate press releases?
Why are more and more
Actually it is. InfoWorld: The Power of Voice.
There are a few papers available for download from their website, but you have to register. Basically, traditional voice recognition parses the audio stream into some meta-form, usually representing phonemes (the low-level "atomic" sounds that your speech consists of). These phonemes are then matched against a dictionary of known words (and the phonemes they consist of) and text is produced.
Because phoneme recognition is not particularly accurate (for example, it's hard to tell the difference between "hard d" as in "Dan" and "hard b" as in "Ban" over a noisy phone line), traditional speech to text systems use several approaches to improve accuracy. One is to improve the accuracy of the basic phoneme recognition by "training" it for a specific voice. Another is to use all sorts of hairy-language-specific grammar / syntax algorithms.
Computationally, it's the matching of the phonemes against the dictionary that's the most difficult, and the larger the dictionary, the less accurate and more CPU-chomping it becomes. In addition, searching the resulting text for specific matches grows less accurate as the search string increases in length, due to the likelihood of a transcription errors.
The cool thing that Fast Talk has done is to store and index the phoneme meta-data, rather than complete the recognition to text. When you enter search words, they break the search string into phonemes and look for matches that way. This has several positive benefits:
1. Computational resources are dramatically lessened, since the "phoneme recognition algorithms" are fast and there's no dictionary matching.
2. The matching doesn't depend on having the right words in the dictionary at input time. It works just as well for unusual proper names and technical jargon as it does for common words, since they're all formed from the same basic phonemes.
3. The longer the search string, the greater probability of an accurate match.
4. No need for accurate search string spelling. It doesn't matter if you know how to spell a word, as long as you can write it down phonetically.
In theory, the system should work for any language, but reality is that different languages do have different sets of phonemes, and I think Fast Talk has only really worked on English. So languages like Spanish that are fairly similar phonetically to English would probably work pretty well, but tonal languages like Mandarin Chinese or those with non-vocal sounds like the clicks and pops of the African Bushmen would require a rework of the phoneme recognition code.
The main downside of their system is that it doesn't actually produce text... which means that you'd need another speech-to-text system if you wanted transcripts, or want the data to be searchable with whatever standard text-based search engine you are using on your intranet. But they appear to be aiming at applications where that's not necessary. One of my favorite ideas is integrating it with a video editing suite and being able to jump to different cues in your video clip library simply by stating the dialogue that's found there.
Of course, one of the most obvious applications is for intelligence and security. So far it doesn't appear that the company is pushing too hard in that direction -- it was founded by an academic group that originally developed the technology for a library project at Georgia Tech. However, I'm betting that's where the real money is, and it's only a matter of time before their ideas are found in your favorite national department of big-brotherhood.
-R
And this doesn't even begin to deal with "Engrish" speakers =]
In the great CONS chain of life, you can either be the CAR or be in the CDR.
Actually, there is already an equivalent to the proposed Audio tab on Google. It is an HP Research project, SpeechBot
Another similar product is already on the market from Scansoft (formerly Dragon Systems) but it uses a complete different approach than the fast-talk product. We are actually using Scansoft product where I work to index all of the media (audio and video) files on the corprate lan.