Data Mining And The CIA
Brotha Z writes "It seems that the CIA has developed a piece of software labeled "Oasis" that can convert the audio from television and radio broadcasts in to text. This software is stated to be able to determine the sex of the speaker, if the speaker is a different person than the original speaker - and if one of the speakers is named, it will continue to place the name next to the correct speaker from that point on. More information on this multi-faceted piece of software can be found here." Hmmm. Sounds like some nice speech recognition technology ("perfect demo" alert!), but as a taxpayer, something rings badly about it. If they're going to use my money to spy on me, can't they at least open source the code so I can dictate a letter?
Just this morning I was joking with my wife about buying an alarm clock that snoozed when I yelled 'shut up', 'piss off', 'go away', or 'it's saturday'... Maybe this technology will lend itself to alarm clocks in the future :)
Morning sarcasm. I'll get back to work.
LOAD "SIG",8,1
LOADING...
READY.
RUN
Beyond that, the TellMe service should also recognize the command "shut up" along with "stop" and "tell me more". I mean, if you're going to have a voice-activated phone portal, why not use "natural language" for commands? ("Shut the hell up you stupid bitch! I said "stock quotes" not "stock racing"!)
For those of you who have no idea what I'm talking about, dial 1-800-555-TELL. The service is free, for now.
- I don't care if they globalize against free speech. All my best free thoughts are done in my head.
and
I don't know about you, but I'm pretty damned impressed.
the article on this system
Without the pad, it's not Dance Dance Revolution, it's Listen
I don't understand why they specifically mentioned TV and radio. If the audio is digitised before being pass to the software, it doesn't really matter where it comes from. Maybe they are trying to draw attention from the fact that it can be used on things like making transcripts of phone calls, normal conversations recorded with various listen devices?
About that feature that id the speaker, imagine a conversation that goes like this:
Speaker 1: You the Man.
Man: No, YOU the MAN.
Man: No no, you Da Bomb
Da Bomb: Hehe
Watch word: BOMB Alert! Alert!
As a final side note, I won-der... if... it... works... if... you... talk like... Cap-tain... K-irk... ;-)
====
Codeala - Just another mindless drone
They don't seem to have very accurate speech recognition technology. The article claims to reduce transcription time by a factor of about nine. That's a lot less unreasonable than believing in good speech recognition technology.
My guess is that it's really fairly poor speaker independent stuff. It probably does a quick, low quality word recognition algorithm - quite a few of those are around - and then some sort of Bayesian network to correct the transcription using lexical context. I know that ARPA was openly funding people doing exactly that a few years ago, and I'll bet their papers are on the web. It doesn't shock me greatly that someone has had some measure of success with it.
If it was 100% accurate transcription, then I wouldn't believe it. But as a time saving device for transcribers... that I find credible.
DARPA also funds a lot of automatic topic spotting research. One of my ex-profs received grants from them under just such a rubric and her papers are publicly available on the web. I'll bet whatever technology they are using, it was developed by a prof at an open university who publishes freely.
As for multilingual text searching and summarisation, the best technology of its kind known to me is Latent Semantic Analysis - the brain child of Thomas Landauer. It's a fairly recent, but hardly secret or obscure, indexing technique that's gaining ground commercially for data mining applications. It can certainly do the the small number of things being claimed by this article. All the relevant papers are on the web.
In short, this doesn't sound like super-secret spy stuff. I'll give long odds the real work is in journals and webpages that are publicly available. Having a couple billion dollars to speed up testing and implementation probably helps, but none of this sounds revolutionary or years ahead of the curve.
Less well known is their Foreign Broadcast Monitoring Service, for which generations of linguists have listened to the hype output of governments worldwide. (FBIS refers to this as "open source" material.)
They've been hoping for years to automate some of this stuff, and apparently they've succeeded. It doesn't require particularly good speech recognition, since the basic goal is to pull out the interesting stuff from the endless drivel.
This sort of info is used to answer questions like "Is country X changing their policy on Y", and "Who is speaking for country X on subject Y?" This is basic political intelligence information.
Actually what it sounds like the CIA is working on is trying to mine data out of public sources. There's good reason to think that you can discover a lot of what governments want to keep hidden if you can just go through enough publically available data and correlate it. For instance, you can probably get a good idea of a government's secret spending by figuring out how much money they're taking in taxes and borrowing and subtracting out expenditures- provided that you can actually track both of those things. It looks hopeless because there's so much data to go through, but with good computers it should be possible, especially if the other guys have a lot of secret spending. Or you can figure out what the inner circle of the government really thinks by looking at all of the news leaks from highly placed government officials.
This stuff scares the crap out of governments that are both required to be open but interested in hiding things from other countries. You simply can't hide everything, especially not anything big enough to be really interesting, because it has to interface with the world somehow. The CIA obviously wants to get really good at this kind of thing, and monitoring vast quantities of mundane stuff like TV news programs, budgets, and corporate annual reports is part of the process. The best part is that if you can do this effectively, you don't need spies as much, but you do need a lot of drones to go through huge piles of paper and TV to enter the raw data into the computers to process. There's probably some filtering out the interesting stuff from listening in on videoconferences, too, but it's amazing how many paper pushing drones wind up working in a sexy sounding business like spying.
There's no point in questioning authority if you aren't going to listen to the answers.