Full-Text Audio Search
Captain Chad writes "The latest print edition (12/16/2002) of InfoWorld has an interesting article about an audio search program by Fast-Talk Communications. (The article is not yet available on the InfoWorld web site, but the Fast-Talk site has some good info, including a downloadable trial version.) The product works by breaking the audio stream into phonemes, which are the 'basic units of sound in a language.' The search is then performed for a specific sequence of phonemes. This method is faster and far superior to traditional audio searches which convert to text and then perform a normal text search. The author of the Infoworld article, Jon Udell, tried a variety of searches that were surpisingly successful. If this technology is as good as he claims, there is a reasonable chance it will revolutionize the way we store data. Maybe there will even be an 'Audio' tab on Google." Here's the Infoworld article.
/sarcasm
How long before the feds start digitizing all of our telephone conversations and using this technology to google our private conversations?
Yay!
before we have a "video" tab on google? :)
Do the songs need to be converted to this new phoenic format or can you just search the audio? Wouldn't this use a tremendous amount of computing power?
I don't keep a lid on my coffee so when I walk around I look busy -me
Now I can finally search for the Free Radio Linux kernel reading by phonemes!
*Splort*
...can it decode rap and/or reggae? I swear I can't understand 3/4 of those lyrics. Songs could start with
-----BEGIN PGP MESSAGE-----
and I wouldn't be able to tell the difference.
The Google voice search is used to search Google by telephone rather than online. This doesn't search through voice/audio records for matches.
I don't keep a lid on my coffee so when I walk around I look busy -me
Combine this with the Streamed Audio Kernel source, and it's only a matter of time before people leave patches on Linus' voicemail! The great thing is that to patch an audio kernel, you only need a tape recorder.... :P
First I'd like to say that this would be wonderful for NPR to use. *drool*
On a serious note. I really didn't think NLP software was to the point to make this plausible. I've never actually used NLP tools, but what I've heard in the main stream is that while they work they aren't perfect. This is fine for someone starting at a screen while talking or someone who is going to review the transcription, but it seems like it would break any automated system when there is not system of checks in place, since this involves a human.
"Not knowing when the dawn will come, I open every door." - Emily Dickinson
So does anyone out there know how well this technology deals with accents and dialects? If so, perhaps we could finally see that 'Star Trek' like universal translator - or at least translate on a large scale media works from the past century into other languages.
... does this spell the end to stenography as we know it?
Of course, noble thoughts aside, I keep thinking how useful it would have been to have such technology in college when I had to transpose long lectures from my chicken scratch.
Hmmm
healyourchurchwebsite.com - WWJB?
... Or imagine Google recording all possible audio streams (TV, radio, ... streets?) and allowing us to search those? All it takes is enough procesors, a bit of wiring...
Now if you record street conversations or all types of public conversations... Do a search on 'bomb'... How appealing is that to big brother.
All right... I'm learning sign language. Now.
I just hope one of those nuisance lawsuits from Tzsvestaeya Zolskovova, the eccentric widow of Sergei Zolskovova, (Russian lunguist who coined the word phoneme) over the use of the term "phoneme" doesn't hobble progress in this fascinating area.
I once wrote a phoenetic search engine for a site that took keywords and broke them down into their soundex phonemes, then stored those. Then, when a search was executed, it would convert the text words into phoenetic pieces and search the database for matches. It was quite accurate, actually. For example, one could search for "olif ghardin" and one of the returned results would be "olive garden".
I guess this is a similar idea. Pretty cool tech.
For editing films. For documentaries in particular, this would be a godsend. Imagine if, in addition to video/audio tracks, you had a simple 'text' track with which you could easily assemble your cuts.
If nothing else, putting the computer to work on the 'condense 100 hours of footage into pieces of paper' stage would be a nice step.
If it prevented just one assitant editor from going insane, it'd be worth it. Do it for the children.
-Brett
A search engine that lets you hum a song and it figures out which one it is.
Someone mentioned it can be used by the government for TIA stuff - agreed, but same with any technology. It has its positive and negative uses. I don't think we are all going to revert to cavemen to get away from it.
Random is the New Order.
There's one kind of "audio search" I'd really like to see: searches for a song by tune.
I've seen a couple of web sites which offer tune searches, but they all work on the index system used in fake-books: start from the first note, and then from there, say whether the next note is higher, lower, or the same. But this system has problems: a reasonably short search will match a whole lot of songs; it's often hard to tell whether certain extra notes are considered part of the tune; and some songs have an obscure beginning and an easily recognizable theme farther in, and you don't know which one is indexed.
These sites have also tended to only index very well-known tunes - usually, folk songs, show tunes, and a few jazz standards.
One site allows you to send them a recording of you whistling the tune, which seems like an improvement, but it actually just translates it into the up-down-repeat notation.
My ideal music search would be something that would take large quantities of music (let's ignore for the moment where it gets the large quantities of music without pissing off the RIAA) and scan each song for prominent tunes. You could then search these with perhaps the up-down-repeat notation, but also by inputting music notation, for people who know it. The search would have to be key-insensitive, and allow fairly fuzzy matches.
If it could give me the name of that pop song/jazz tune/classical piece I just heard on the radio, it'd be pretty good.
But if it works really well, it'd be a blessing for music composers - they could just search for that tune that just popped into their head, instead of worrying over whether they're subconsciously ripping off another song.
Win dain a lotica, en vai tu ri silota
We have three > 300,000 sqft underground facilities loaded with rows of 2U rack systems with eight 120gig hard drives in each. Every phone call you have had since 1994 is now stored in this massive datasystem. Transferring all the old calls from tapes to the new system was probably the most tedious job I have ever done.
Soundex, which uses the way words sound rather than the way they are spelled, has been widely used by the government and genealogy researchers for the past 60 years. This isn't exactly "new" technology.
/. articles starting to sound like corporate press releases?
Why are more and more
You search by pronounciation as well as store by it. The Census Office has stored names by this method for more than a century.
Does it recognize speech, or does it wreck a nice peach?
This sig no verb.
and there are several research prototypes that can do it (check out a paper from Cornell, or just google for query-by-humming)
Direct search using various phonetic representations has been around for many years. All things being equal, it's known to be somewhat better than searching the output of speech recognizer using approximate string matching. But you have to weigh against that that both speech recognition and approximate string matching are being pushed much harder than this kind of search, so you may end up getting better performance using speech recognition and string searching anyway.
I can see a good use for it, and that's taking notes. Imagine carrying around a microphone with you 24/7 and recording everything you hear. We've got the space to store it all, after all. Then you can go back and check "When did he say that meeting was?" or whatever. And those with significant others know how often you'll end up arguing over who said what.
Visit me on #weirdness on the Galaxynet.
Actually it is. InfoWorld: The Power of Voice.
There are a few papers available for download from their website, but you have to register. Basically, traditional voice recognition parses the audio stream into some meta-form, usually representing phonemes (the low-level "atomic" sounds that your speech consists of). These phonemes are then matched against a dictionary of known words (and the phonemes they consist of) and text is produced.
Because phoneme recognition is not particularly accurate (for example, it's hard to tell the difference between "hard d" as in "Dan" and "hard b" as in "Ban" over a noisy phone line), traditional speech to text systems use several approaches to improve accuracy. One is to improve the accuracy of the basic phoneme recognition by "training" it for a specific voice. Another is to use all sorts of hairy-language-specific grammar / syntax algorithms.
Computationally, it's the matching of the phonemes against the dictionary that's the most difficult, and the larger the dictionary, the less accurate and more CPU-chomping it becomes. In addition, searching the resulting text for specific matches grows less accurate as the search string increases in length, due to the likelihood of a transcription errors.
The cool thing that Fast Talk has done is to store and index the phoneme meta-data, rather than complete the recognition to text. When you enter search words, they break the search string into phonemes and look for matches that way. This has several positive benefits:
1. Computational resources are dramatically lessened, since the "phoneme recognition algorithms" are fast and there's no dictionary matching.
2. The matching doesn't depend on having the right words in the dictionary at input time. It works just as well for unusual proper names and technical jargon as it does for common words, since they're all formed from the same basic phonemes.
3. The longer the search string, the greater probability of an accurate match.
4. No need for accurate search string spelling. It doesn't matter if you know how to spell a word, as long as you can write it down phonetically.
In theory, the system should work for any language, but reality is that different languages do have different sets of phonemes, and I think Fast Talk has only really worked on English. So languages like Spanish that are fairly similar phonetically to English would probably work pretty well, but tonal languages like Mandarin Chinese or those with non-vocal sounds like the clicks and pops of the African Bushmen would require a rework of the phoneme recognition code.
The main downside of their system is that it doesn't actually produce text... which means that you'd need another speech-to-text system if you wanted transcripts, or want the data to be searchable with whatever standard text-based search engine you are using on your intranet. But they appear to be aiming at applications where that's not necessary. One of my favorite ideas is integrating it with a video editing suite and being able to jump to different cues in your video clip library simply by stating the dialogue that's found there.
Of course, one of the most obvious applications is for intelligence and security. So far it doesn't appear that the company is pushing too hard in that direction -- it was founded by an academic group that originally developed the technology for a library project at Georgia Tech. However, I'm betting that's where the real money is, and it's only a matter of time before their ideas are found in your favorite national department of big-brotherhood.
-R
And this doesn't even begin to deal with "Engrish" speakers =]
In the great CONS chain of life, you can either be the CAR or be in the CDR.
How is this different from soundex? For decades, databases of names have been stored in soundex. If your driver's license number begins with letter-number-number-number, it is probably soundex. If you have done any ancestry searching, as I have, you have encountered soundex; this way, if you search for John Smith in 1732, you will find records for Juan Smyth, Jon Smythe, John Smitt, etc.
The benefits of having actual sound? If it's just going to use a soundex-type formula in the core functioning, the sound would just be a gimmick, and a storage-taking one at that. Sure, compression has gotten amazing, but will the sound of Smith really take anything near the same 4 bytes as "S720" ??
Aside from searching for music, I can see this being really useful in web conferencing software. Consider this:
You hold a meeting where each person's channel was recorded and stored as part of the meeting info. Upon saving the meeting minutes, the software builds a phonetic index of the entire conversation.
Searches later on would be no more taxing on the server than a fulltext search in MySQL is now.
Useful? Definitely. And that's just one possibility.
putfwd.com - 1GB Free file storage with a twist
As 90% of the data for diagnosis comes from the history-taking (interviewing) the patient, the potential for automating/supporting diagnosis is exciting.
Imagine a system that listens to a consultation in real time, making helpful suggestions for diagnosis based on analysis of the patient and the doctor's phoneme streams! And no tedious data entry, just an unobtrusive microphone.
I've been waiting for this.
You can never eat too much, only cycle too little.
The basic idea of using audio similarity to "grep" short sounds out of audio streams (as opposed to using ASR and text-matching) is quite old - some classic papers based on dynamic timewarping date back to 1977, and HMMs became popular for this application about ten years after that. Papers on this kind of thing appear in conferences like ICASSP - look for keywords like "keyword spotting" or "wordspotting." The phone company wanted to do this for obvious reasons.
Note that I'm not saying the GATech technology used by this company is derivative - I haven't looked at the specifics of this approach.
Really, when would a terrorist's conversation actually include the word "terrorist" ? Maybe he would say something like, "Hey Abdul, we need another terrorist in on this bombing." Or maybe: "Terrorist Jim, meet terrorist Mike."
They say the first thing to go is your penis. Well, it's either that or your brain. I forget which...
Italys nose probes them et al. fingering our hose
the fee Cult to longer stained syrups and Hussein marmot pervert sucks eggs rat. Intact, eye amusing into dick tape his pest of flash snot.
- - -
It has no problems at all firguring out those difficult to understand lyrics and has an almost perfect success rate. In fact, I am using it to dictate this post to Slashdot.
"Everything is adjustable, provided you have the right tools"
After reading a posting that someone had probably typed in silence and submitted to slashdot, I posted this reply silently and now you're reading it. Chances are you aren't reading this out loud. Nobody said a word, or even had to hear one. Reading can be an astoundingly efficient way to transfer information.
:) Click that white button and turn it green.
I have yet to meet anyone in good health who prefers getting ten voice mails over ten emails.
What the world needs is fewer karma whores and more good friends.
Go ahead, friend.
http://tinyurl.com/4ny52
Well, I'd personally love to have an audio search tool to comb through all the mp3 files of talk radio programs such as *Loveline*, *Opie & Anthony*, and *The Greaseman*, which I have. Sometimes I think, "Now which show had that cool bit about..." and I have no hope of finding it.
For a professional rather than personal use, imagine how useful this could be to radio stations if they keep digital archives of their programs--if someone wanted to look up a particular program based on a vague memory of some of the text, a tool like this would be invaluable.
Chasing Amy
(We all chase Amy...)
"The more corrupt the state, the more numerous the laws"-Tacitus
there are 4 petabytes of files being shared, a search through 150 terabytes dosn't seem so big.
I saw a demonstration essentially the same technology at Compaq's CRL about two and a half years ago (formerly DEC CRL, or Cambridge Research Lab, the guys who did research for AltaVista). It did exactly the same thing. It broke sound files down into phonemes, then searched based on the phoneme. It was mostly used for finding a clip on the web rather than a specific place in a long file, but it was the same idea. The nice thing was that it was OK for its application if it missed once or twice. If the audio file was relevant, the word or phrase was probably used multiple times in the clip. It was pretty good at finding NPR stories about certain events. In fact, you can try it out for yourself at an online demo.
I think there should be three tabs instead of one 'Audio' one:
Less is more !
Well if your the govt, then the implications are huge. If they had a way of effieciently "keyword" scanning spoken conversations (esp. phone), this would help intelligence gathering tremendously. If this company can make their stuff work as advertised, they have huge upside potential with the likes of the NSA, DOD, CIA, etc.
Wanna bet?