Data Mining And The CIA

It's been around... by Anonymous Coward · 2001-03-06 14:24 · Score: 2

This kind of technology has been around for quite some time now. The first I heard about it, was a few years ago at a German universitary research institute, the DFKI. For those interested, one of the relevant projects is Olive, and some information can be found at http://www.dfki.de/pas/f2w.cgi?ltc/olive-e.

The bottom line of this kind of technology is that although the speech recognition itself is relatively poor it is helped by the fact that most of the interesting words (names of people, places, etc.) occur very often in the same segment. So, it's all statistics. No accurate transcription needs to be made to achieve this kind of result. Therefore, applications such as automatic sub-titling are not possible with such systems. And I think they are still quite far away too.

As for the CIA claiming break throughs, well, I think other people can say wittier things about that.

Theo

Re:Misleading story, but looks who is talking by Olivier+Galibert · 2001-03-06 15:09 · Score: 2

: I don't understand why they specifically mentioned TV and radio.

Large vocabulary (but somehow predictible), speaker trained to overarticulate, no superposition between different speakers, slightly simpler language model (complete phrases, language close to written).

State-of-the-art recognizers have an error rate of ~10% on that test, which until last year was one of the evaluation tests at the speech group at NIST. Check http://www.nist.gov/speech/tests/index.htm for details.

Since the point of disminishing returns was reached, the test in going to be replaced with a new one, a audio/video recorded meeting transcription. Much, much harder.

OG.

Re:Uhm, yes. by Zachary+Kessin · 2001-03-06 23:46 · Score: 2

I remember a bit in one of the Tom Clancy books in which Jack Ryan tell his wife that CNN often knows stuff faster than the CIA. I belive its probably true at least for some types of things. And I would bet hevily that in places like CIA and the Petagon that TV's tuned to CNN are rather common.

--
Erlang Developer and podcaster

Re:Not terribly new or surprising by Zachary+Kessin · 2001-03-06 22:49 · Score: 2

Probably a few tens of thousands are what CIA is interested in. I'm sure that at CIA or NSA there is a room of people listening in to the BBC World Service and Radio Moscow, CNN etc. Lets face it the CIA does not have agents anywhere and much of what you can get off of the public wire is good information.

Plus you can also find out what world leaders are thinking by reading the newspapers in a country and listening to the national radio station. I would imagine that something like a Tivo would make this much easer for them.

--
Erlang Developer and podcaster

Human element by ciurana · 2001-03-06 13:15 · Score: 2

From the last sentence in the article:

Another intelligence official, on condition of anonymity, said: "If they have this kind of technology to plumb the depths of open sources, you can imagine what kind of technologies they have to track down spies."

All this technology wasn't good enough to track down Aldrich Ames, Edward Lee Howard, or the FBI's Hannsen, who together are probably the biggest moles in the history of espionage. People forget that tools are useful/automatic, but they aren't intelligent. Someone must be at the controls to interpret and act on the data. This tool sounds great, and there could be potential civilian uses beyond CI, but people must remember it's only a tool.

Cheers!

E

--
http://eugeneciurana.com | http://ciurana.eu

Re:Actually.. by David+Gould · 2001-03-07 10:19 · Score: 2

Also without bothering to RTFA, I'll repeat Paradise_Pete's question: Do you know what a neural net is?

You see, as I assume was his point, "computer chip neurons" work differently from central processing units, but not from the "data structure neurons" that can be trivially implemented in a program running on a "regular computer" to simulate the exact same neural net. The fact that they did it in hardware is interesting in its own right to someone interested in neural net research (I'll probably go read it later), and perhaps the speed factor is so great that a software version couldn't run in real time (which I guess could be what you meant) or would require an astoundingly powerful and expensive conventional computer in order to do so, but there is nothing special about "computer chip neurons" that in principle prevents the same thing from being done in software on a "regular computer".

Maybe this truly "doesn't run on regular computers" simply because they haven't implemented such a sofware-based simulator, but that's very different from implying that it's based on some kind of exotic technology that a Von Neumann machine is fundamentally incapable of duplicating, which is what it sounded like you were claiming and which is probably what Paradise_Pete objected to (and was wrongly punished for).

David Gould

--
David Gould
main(i){putchar(340056100>>(i-1)*5&31|!!(i<6)<< 6)&&main(++i);}

Re:If this has been around for a while by maggard · 2001-03-06 14:26 · Score: 2

Because Closed Captioning also includes contextual information and is often not a literal transcription but a synopsis slighty rewritten to contain the originial elements but shorter as so to be more easily read on a TV screen while remaining synchronous with the action.

My English teacher would cringe at that run-on.

--
I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.

Uhm, yes. by Black+Parrot · 2001-03-06 12:50 · Score: 2

The moral of this story is, if you're a spy, don't televise your meetings with your control.

I mean, like, really, now, dude. Are they going to start scanning soap operas for the sake of national security? Is Jay Leno broadcasting national secrets? Someone clue me in on the intelligence application here.

I suppose it might be handy for transcribing the numbers stations, though somehow I doubt that they'll seem quite so glamorous in ASCII:

12 2 9 78 16 1 289 8 6 89 9...

--

--
Sheesh, evil *and* a jerk. -- Jade

Re:Uhm, yes. by Malcontent · 2001-03-06 16:14 · Score: 2

Not a legal vote count sorry. During the election every single republican uttered the same mantra. "Any subsequent counts done by any press will be unreliable and false". You seem to be contradicting howard baker there.

--
War is necrophilia.
Re:Uhm, yes. by rgmoore · 2001-03-06 13:38 · Score: 5

Actually what it sounds like the CIA is working on is trying to mine data out of public sources. There's good reason to think that you can discover a lot of what governments want to keep hidden if you can just go through enough publically available data and correlate it. For instance, you can probably get a good idea of a government's secret spending by figuring out how much money they're taking in taxes and borrowing and subtracting out expenditures- provided that you can actually track both of those things. It looks hopeless because there's so much data to go through, but with good computers it should be possible, especially if the other guys have a lot of secret spending. Or you can figure out what the inner circle of the government really thinks by looking at all of the news leaks from highly placed government officials.
This stuff scares the crap out of governments that are both required to be open but interested in hiding things from other countries. You simply can't hide everything, especially not anything big enough to be really interesting, because it has to interface with the world somehow. The CIA obviously wants to get really good at this kind of thing, and monitoring vast quantities of mundane stuff like TV news programs, budgets, and corporate annual reports is part of the process. The best part is that if you can do this effectively, you don't need spies as much, but you do need a lot of drones to go through huge piles of paper and TV to enter the raw data into the computers to process. There's probably some filtering out the interesting stuff from listening in on videoconferences, too, but it's amazing how many paper pushing drones wind up working in a sexy sounding business like spying.

--
There's no point in questioning authority if you aren't going to listen to the answers.

Open the code? Yeah, right. by Lotek · 2001-03-06 12:48 · Score: 2

Sadly, I think we have as much chance as seeing Microsoft open the source to Office as we do of seeing the CIA release this.

Then again, stranger things have happened. But I would bet the proverbial farm that the guts of the software is Classified.

Not about speech recognition by harmonica · 2001-03-06 12:56 · Score: 2

The article is not so much about speech recognition (as some other comments have mentioned). It deals with the possibilities of being able to label speakers, storing data of all kinds of sources in a database and being able to detect previous statements of a person. So, this is more about the intelligent combination of various existing techniques (including speech recognition and machine translation).

Personally, I think this has been done before to a certain degree, the resources available to the CIA (and their counterparts) are just becoming incredibly huge. Given the increasing amount of traffic that is generated by Internet users, they're probably pretty happy about that.

On the terrorists who are being mentioned all the time in that article: they're probably using encryption technology anyway, so I'm not sure if the really dangerous people will be caught with that system.

Re:Actually.. by James+Lanfear · 2001-03-07 01:16 · Score: 2

"In benchmark testing using just a few spoken words"

And why did the benchmark only involve a few words? Because that's all it can recognize. This thing isn't doesn't do speech recognition, it does sound recognition; IIRC, it can only handle single syllables words, and only four or five at that, and no sound-alikes. (I think "yes" and "no" were half its vocabulary.) It might be breakthrough for such a small ANN, but it's not that useful as a natlang system. I suppose something similar could make a good front-end to more complete system, though.

Re:..Sounds good to me! by Jace+of+Fuse! · 2001-03-06 13:30 · Score: 2

I now a large group of paranoid people who like to start all of their unimportant phone conversations with "I'm going to kill the president" or some such giberish because they are firmly convinced that all telephone conversations are being monitored by some Echelon type system, and have been for 20 years. They believe, that by throwing such "Noise" out there, they're helping protect everyone's privacy.

What amuses the hell out of me though, is that this kind of works against them if their own theories hold true.

The way I see it, almost nobody else goes to such efforts no matter how paranoid they are, and even if some phone-listening machine was being put to use, all they're doing is ensuring that they will be listened to.

And it's not that I don't think this sort of thing goes on or anything, it's just that I don't bother fighting it anymore now that they're able to read (and control) all of our minds anyway.

"Everything you know is wrong. (And stupid.)"

--

"Everything you know is wrong. (And stupid.)"

Moderation Totals: Wrong=2, Stupid=3, Total=5.

okay, who will file the FOIA request? by kevinank · 2001-03-06 12:54 · Score: 2

Unless it is classified it is supposedly public data, so you should be able to get a copy of the source code through a Freedom of Inforfmation Act request.

Anyone got a couple of spare lawyers looking for a fun afternoon or twenty?

--
LibBT: BitTorrent for C - small - fast - clean (Now Versio

Oasis and Foreign Broadcast Information Service by pease1 · 2001-03-06 19:15 · Score: 2

Wanna see the results?

I suspect the most common use of this sort of software is to monitor foreign broadcasts - something the CIA/OSS has been doing for more than 50 years. Traditionally, this has been done through a group (mentioned in the article) called the Foreign Broadcast Information Service (FBIS). FBIS monitors newspapers/broadcasts of many, many non-US media sources and makes this information available to US Government agencies.

For many years, FBIS made available to the public a daily paper copy product via the US Dept of Commerce's National Technical Information Service (NTIS) that was fedex'ed daily to hundreds of subscribers around the country/world. There were several issues, broken down by regions. For many years, it was one of the best public ways to track what was happening in the Soviet space program.

It's widely known that FBIS/CIA as been developing and using technology to aid the translation process for many years.

A few years ago, they dropped the paper product and moved to an electronic version.

The FBIS server to distribute the information to US Government users can be seen at http://199.221.15.211/ and can be found via a simple Google search on "FBIS".

The public can access this information via NTIS's World News Connection system (http://wnc.fedworld.gov). Yes, there is a charge to use WNC, because NTIS has to pay copyright (gasp!!!!!) to the foriegn sources (just because you steal the data stream doesn't mean you own it!) as well as operate the system. It's pretty well known that foriegn sources who complain loud enough also get paid by the Govt for the US govt use of the data.

Re:Not terribly new or surprising by peccary · 2001-03-06 19:57 · Score: 2

My guess is that it's really fairly poor speaker independent stuff. It probably does a quick, low quality word recognition algorithm

It doesn't always have to be speaker-independent. Since it doesn't have to be real-time, all you need to do is identify the speaker, and then start over. If we're really talking about TV and radio sources, then there are going to be a large number of regularly-appearing speakers. Just a SWAG, but I'll bet that under a million people account for 80% of all the TV and radio minutes worldwide.

Re:well... by grammar+nazi · 2001-03-06 14:25 · Score: 2

Here's some fitting OASIS Acronym definitions from the acronymfinder.com:

Observation At Several Interacting Scales
Operational Application of Special Intelligence Systems
Oracle Application Software Implementation Strategy

"My one oasis in the dust and drouth Of city life."--Tennyson

--

Keeping /. free of grammatical errors for ~5 years.

Re:Misleading story, but looks who is talking by raju1kabir · 2001-03-06 15:55 · Score: 2

I don't understand why they specifically mentioned TV and radio. If the audio is digitised before being pass to the software, it doesn't really matter where it comes from. Maybe they are trying to draw attention from the fact that it can be used on things like making transcripts of phone calls, normal conversations recorded with various listen devices?

I don't think you realize how boring and mundane most intelligence work is. Thousands of extremely junior people sit all day long translating newspapers and transcribing radio/TV broadcasts. Much of this stuff is made available through FBIS (pronounced "fibis") to further bore people slightly higher up the ladder throughout the government and contracting agencies.

However, it is useful once in a while. Especially when looking back and saying "Now how didn't we catch that?" If it could be brought online cheaper and more quickly, I can see how this would be well worth the money - without being particularly draconian (except insofar as the concentration of enough otherwise innocuous information can be quite powerful).

Sometimes, just sometimes, they mean what they say.

--
"Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS

When it becomes public... by quasipalm · 2001-03-06 13:31 · Score: 2

If Google can find the space to archive the internet, don't you think the CIA could find enough space to archive all of these broadcasts in ASCII format?

I personally would LOVE to see a huge searchable, on-line database of everything ever said by anyone that was broadcasted. Imagine the implications. I'd search for all of my local politicians to see if they ever said anything stupid in their previous life as a coked-out-Miami-televangelist. I'd also search for my own name to see if I missed a song dedication or an NPR sponsorship in my name.

I guess a notable drawback is that the CIA could pretty easily scan cell-phone bandwidths as well... documenting any 'notable' private conversations. Perhaps we should all start talking in pig-latin to avoid the CIA's attention, al la Napster?

Spy On Me? by Regolith · 2001-03-06 14:33 · Score: 2

"If they're going to use my money to spy on me, can't they at least open source the code so I can dictate a letter?"

I may be wrong, but doesn't the CIA's charter say that they cannot conduct operations on native soil?

--

Bow before my sig, for it is good.

That's funny... by eric2hill · 2001-03-06 20:48 · Score: 3

Just this morning I was joking with my wife about buying an alarm clock that snoozed when I yelled 'shut up', 'piss off', 'go away', or 'it's saturday'... Maybe this technology will lend itself to alarm clocks in the future :)

Morning sarcasm. I'll get back to work.

--
LOAD "SIG",8,1
LOADING...
READY.
RUN

TellMe by Fervent · 2001-03-06 12:48 · Score: 3

I just want the damn TellMe service to work. How many times do I have to say "New Jersey Devils" before the sports program on that thing recognizes "Hey, he's talking about a hockey team!"

Beyond that, the TellMe service should also recognize the command "shut up" along with "stop" and "tell me more". I mean, if you're going to have a voice-activated phone portal, why not use "natural language" for commands? ("Shut the hell up you stupid bitch! I said "stock quotes" not "stock racing"!)

For those of you who have no idea what I'm talking about, dial 1-800-555-TELL. The service is free, for now.

--

- I don't care if they globalize against free speech. All my best free thoughts are done in my head.

Actually.. by dietcrack · 2001-03-06 18:18 · Score: 3

Speech recognition does exist that, if not 100% accurate, has been demonstrated to be significantly more accurate than human speech recognition. It was being developed at USC while I was there. Something about a neural network of some sort, so it doesn't run on regular computers, but, and I quote,

In benchmark testing using just a few spoken words, USC's Berger-Liaw Neural Network Speaker Independent Speech Recognition System not only bested all existing computer speech recognition systems but outperformed the keenest human ears.

and

The system can distinguished words in vast amounts of random "white" noise -- noise with amplitude 1,000 times the strength of the target auditory signal.

I don't know about you, but I'm pretty damned impressed.

the article on this system

--

Without the pad, it's not Dance Dance Revolution, it's Listen

Misleading story, but looks who is talking by Codeala · 2001-03-06 14:01 · Score: 3

...that can convert the audio from television and radio broadcasts in to text.

I don't understand why they specifically mentioned TV and radio. If the audio is digitised before being pass to the software, it doesn't really matter where it comes from. Maybe they are trying to draw attention from the fact that it can be used on things like making transcripts of phone calls, normal conversations recorded with various listen devices?

About that feature that id the speaker, imagine a conversation that goes like this:

Speaker 1: You the Man.
Man: No, YOU the MAN.
Man: No no, you Da Bomb
Da Bomb: Hehe

Watch word: BOMB Alert! Alert!

As a final side note, I won-der... if... it... works... if... you... talk like... Cap-tain... K-irk... ;-)

====

--

Codeala - Just another mindless drone

Not terribly new or surprising by vlax · 2001-03-06 13:44 · Score: 4

They don't seem to have very accurate speech recognition technology. The article claims to reduce transcription time by a factor of about nine. That's a lot less unreasonable than believing in good speech recognition technology.

My guess is that it's really fairly poor speaker independent stuff. It probably does a quick, low quality word recognition algorithm - quite a few of those are around - and then some sort of Bayesian network to correct the transcription using lexical context. I know that ARPA was openly funding people doing exactly that a few years ago, and I'll bet their papers are on the web. It doesn't shock me greatly that someone has had some measure of success with it.

If it was 100% accurate transcription, then I wouldn't believe it. But as a time saving device for transcribers... that I find credible.

DARPA also funds a lot of automatic topic spotting research. One of my ex-profs received grants from them under just such a rubric and her papers are publicly available on the web. I'll bet whatever technology they are using, it was developed by a prof at an open university who publishes freely.

As for multilingual text searching and summarisation, the best technology of its kind known to me is Latent Semantic Analysis - the brain child of Thomas Landauer. It's a fairly recent, but hardly secret or obscure, indexing technique that's gaining ground commercially for data mining applications. It can certainly do the the small number of things being claimed by this article. All the relevant papers are on the web.

In short, this doesn't sound like super-secret spy stuff. I'll give long odds the real work is in journals and webpages that are publicly available. Having a couple billion dollars to speed up testing and implementation probably helps, but none of this sounds revolutionary or years ahead of the curve.

Listening to public broadcasts by Animats · 2001-03-06 14:02 · Score: 4

Much of what the CIA does consists of collecting publicly available information. Some of this they now distribute to the public. The CIA World Factbook is the best known example.

Less well known is their Foreign Broadcast Monitoring Service, for which generations of linguists have listened to the hype output of governments worldwide. (FBIS refers to this as "open source" material.)

They've been hoping for years to automate some of this stuff, and apparently they've succeeded. It doesn't require particularly good speech recognition, since the basic goal is to pull out the interesting stuff from the endless drivel.

This sort of info is used to answer questions like "Is country X changing their policy on Y", and "Who is speaking for country X on subject Y?" This is basic political intelligence information.

27 of 107 comments (clear)