Online Speech Indexing
Thomas Edwards from The Sync (where we host
Geeks in Space) sent us an
interesting site:
"Speechbot" is a Compaq Research project that is indexing online radio shows. Apparently it found terms like 'Red Hat' and 'Yahoo' in past episodes of GiS. Interesting technology. Imagine when it lets me ask my TV to find me every show that mentions Sarah Michelle Gellar.
Ohoh, I've stirred up a firestorm here :-)
:-)
Re: SQL -- it can be any SQL server, really. However, I will add that we are somewhat in bed with Microsoft on the visualization end, simply because IE5 does XML quite well (note to Mozilla people: get with the program).
Re: Open source. Unfortunately, not up to me. Much of the technology is "open source" in the sense that papers have been published about it (not what you were looking for, I know), but we've already licensed some of the core technology to another company, and being a phone company (GTE) we consider the speech rec somewhat of a competitive advantage (wipe those Echelon thoughts out of your mind! We use it for call center and directory assistance automation! Sheesh
As I posted probably about 6 months ago in a thread about speech recognition, there are some significant issues with open-sourcing beyond the recognizer code. The learning processes behind the recognition are based on a considerable amount of data for which licensing is an issue, such as CNN broadcasts. In fact, we use over 100 hours of broadcast news audio to train the system, and several million words of text for the language model. This comes to us through the Linguistic Data Consortium at the University of Pennsylvania (http://www.ldc.upenn.edu). This is an academic group set up to maintain these common train-and-test databases for researchers, and there's a fairly sizeable fee to join. They handle the intellectual property issues with the training data.
And, unfortunately, without the training data, it's kind of hard to use the system. At least, if you want to use it on something it's not already trained on (in our case: north american broadcast news).
What is particularly interesting to note is that the quality of these Internet Raido shows is generally fairly poor. The voice recognition and dictation software that I have toyed with before have always suggested using better microphones and higher sampling rates to achieve decent results. Some even claimed that low quality audio results in a severe accuracy penalty.
It is very remarkable that this thing can index these low quality streams with the accuracy that they do! I hope that searchable media (other than text) continues to get better like this. Companies like Virage and Compaq definately deserve our support. I hope that standard interfaces appear soon.
~GoRK
It found no instances of the word Linux, which I found humorous.
... there an a to think you're doing is making good news slash my next monday's announcement makes it you can use less leonard still want to which it's tilman of the of the open sores movement I have not part of the open sewers that but why in part of the priests out their foundations giving his last line next to the flashlight next with you a while and we can end of the various duckling and in the he's serving snacks that promptly opening the top of that there is god who will bomb and the crowd is bernie this is definitely the most exciting play a thing would have to have one of I mean you for a column about how ...
:-)
However, a little brain usage, search for "line" and get this:
The words "end of the various duckling" and around there are in fact "Eric Raymond" in the clip, which I thought utterly hysterical. You can tell because they say "it's Eric Raymond, and he's serving snacks," which partway comes out correct.
Linux seems to have came out as "line next" a lot, and "line of" in some clips I've found..
Obviously, the technology is not quite there yet.
---
- Give a man a fire and he's warm for a day, but set him on fire and he's warm for the rest of his life.
The press release has a little more information. We use workstations running NT to spider the sites; processing is done on a farm of Linux servers, and the UI runs on AlphaServer DS20 machines.
Dragon Systems (makers of Naturally Speaking continuous VR) announced a similar product at Comdex. They call it audiomining.
-- Don't Tase me, bro!
Might as well use this as a chance to plug my project:
e arch/extraction/roughn_ready/index.html
http://www.gte.com/AboutGTE/gto/bbnt/speech/res
...which not only tells you what words were said, but who said them, and what topics were being talked about...
Did a search for "Mars Probe" in the Science Friday show, and got this snippet:
Err... yeah. That would explain a great many things about space probes. Actually, I'm sure the textified show would be a lot more interesting than the real show. And then, we could shove it through Babelfish for added enjoyment...
I recently installed the ViaVoice beta for Linux, and found its recognition not quite ready for prime time... at least for my needs. I'd be surprised if radio shows, which often have people on fairly crummy phone connections, would be an ideal candidate for automated indexing.
"I want to die" turns up 6
"Grits" turns up 12
"Sex with animals" turns up 5.
"Your mother" turns up 200.
My conclusion: "Your mother is still almost ten times as important as suicide, sex with animals, and grits combined."
Remember that, always.
"If one is really a superior person, the fact is likely to leak out without too much assistance" -- John Andrew Holmes
Someone should take these pseudo-transcripts and run them thru babelfish. Think of the gibberish level we could achive!
When the guys introduce themselves, the translator has a fun time with their names and nick names.
Rob "CmdrTaco" Malda -- rob commander topple mall
Jeff "Hemos" Bates -- jeff in both states
Nate Ostendorf -- the husband or the smoke
I also searched for linux and I'll bet that it can't find any instances, because it doesn't translate it right. With all the different pronounciation possibilities.
It's a cool idea, but has a ways to go. Go Compaq.
yay.