Finding a Needle in a Haystack of Data
Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]
Does Google have the technology to do this kind of scientific searches yet?
If it does, it sure can save these researchers a lot of time; If it doesn't, I'm sure Google will be keen to get involved, especially on the "isolate useful signals buried in large datasets" part.
Virtual Betting on Facebook for non-geeks.
It just refused to load for me.
All you have to do is index it properly, and lots of data can be searched really fast.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
... but be advised that the links do not point to his site, but the actual article (are the editors doing this or has he given it up).
I see this as being a boon to SETI. If there was ever a needle in a haystack, it's trying to tease a possible intelligent signal out of the cosmic background noise. If you have an idea what the background is like in general, then it's far easier to detect an abnormality in that background noise. The question will end up being, are we simply detecting more false positives or are these real signals?
GetOuttaMySpace - The Anti-Social Network
It would be a lot easier to find data if we tagged it with things like the Evil Bit, and Broadcast Flag, or like Technocrati's blog tags that the user associates with their data so a search can find it.
What does God use to tag a galaxy with though?
Saskboy's blog is good. 9 out of 10 dentists agree.
82.67% of all statistics are made up anyway...
I can't even find my keys some days.
A good strategy for finding useful information in oceans of data is to reduce the data set by sloughing off large chunks of irrelevant data. For example, if you want to find useful info in /., you would want to start by excluding all stories submitted by Roland Piquepaille.
I see all of these IT breakthroughs and it's just giving me the creeps!
My first post got modd'ed down. Help me.
"But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people."
When asked about more advanced applications for the technology, researchers replied it will probably be "quite a while" before the technology could be used for extremely high noise environments. Said one, "I mean, it's going to be a long time before we're up to finding finding useful comments on Slashdot or something."
2. ?? ----- New statistical techniques 3. Profit!
Sounds like they've been watching Numb3rs ;-)
"What does god want with a starship?" -Spock
Zhrodague.net - I do projects and stuff too.
Could surely use this technology on slashdot to find those very few intelligent &/| funny comments.
The Case team discovered a technique that is built on the principle of comparing a set of summary characteristics for any sub region of the observations with the background variation. From these characteristics, attempts are made to find small regions that appear significantly different from the background--a difference that cannot simply be attributed to random chance
So, basically its the one search engine that can only find the words "horny teen nekkid" if it is NOT on a pr0n-page. I can see uses for that. Not for me, but I'm sure SOMEONE is interested in finding other kinds of pages once in a while.
Why do we need to be warned that it's a PDF? I can understand an "MS Word Warning" but PDF is platform independent. What's wrong with PDF?
Stay tuned for the upcoming Google release of this technology (Beta of course) - To be followd by MSN and Time Warner's combining powers in attempt to compete.
I'd use it to find some decent pr0n among the oceans of crap out there. But seriously, 'searching' is a very interesting subject and a lot deeper than most people realize. Try to explain how Google works to a non-geek.
"...a difference that cannot simply be attributed to random chance..." If it's random, how do you know?
Also the first "usefull" application for this kind of technique which popped up in my head. Actually, the process in my head to make this one item popup is maybe usefull too (-: Lot of random data, and this one is being associated with the article.
My wife's sketchblog Blob[p]: Gastrono-me
FYI: Its abbreviation is not "CWRU" anymore. As of about 2 years ago, they changed it to simply "Case" and gave it the silly new logo of 2 paperclips stuck together.
Why? I have no idea. Some "university branding" thing that some people thought was important to the growth of the campus or something. Apparently it ticked a bunch of alumni (from the original Western Reserve University) too.
Knowing is half the battle.
Karma: NaN
Someone asked me to give ten different ways to find a needle in a haystack, these are my thoughts:
1) INDUSTRIAL MAGNENT
2) BLIND LUCK
3) BURN THE HAY, PICK UP THE NEEDLE
4) STATISTICAL ANALYSIS (SINCE NEEDLES IN HAYSTACKS ARE NOT PLACED AT RANDOM, THEY ARE SUBJECT TO REGRESSION ANALYSIS)
5) OFFSHORE TO CHINA WHERE LABOR IS CHEAPER, SEARCH THE HAY WITH 10000 OF WORKERS.
6) WAIT YEARS UNTIL THE HAY DECAYS, PICK UP THE NEEDLE
7) SPREADOUT THE HAY, HIRE BAREFOOT HAY WALKERS
8) TAKE ALL THE HAY, PUT IN A POOL OF WATER - HAY WILL FLOAT, AND NEEDLE WILL SINK
9) LET COWS EAT THE HAY, X-RAY ALL THE COWS!
10) TRIAL AND ERROR - ONE PERSON
"This isn't a study in computer science, its a study in human behavior"
Perhaps this technology can make Usenet useful once again.
At least he links to the original articles now. However, it still seems like he is trying to better his pagerank (just like that Beatles-Beatles guy). This doesn't irritate me as much as what he used to do, but I still think it's pretty lame. I for one applaud all the people who submit stories without the need to link to their personal or business websites (aka "slashvertisements"). We need more people like them.
"They are trying to efficiently find a signal in random and chaotic data. Random and chaotic data isn't easy to index."
And somehow Google manages with Slashdot.
End
Of
Message
You can have your god back when you are old enough to handle the responsibility.
1) Decide what signal you want
2) Generate a large enough random dataset that it is inevitable your desired signal appears
3) Profit!
"It just refused to load for me."
Maybe your interest in the story was deemed statistically insignificant.
Beauty is in the eye of the beerholder.
End of Message
Mythbusters actually did an ep where they built two different needle-in-haystack finding machines, one actually did quite well...
-everphilski-
...all you need is a good comb :)
So... I'm just another piece of hay... :(
I know the warranty will be void if you shave off the pubic hair yourself (intentional damage to the product), but you might want to try it anyway. Buy the hairless variety next time and you should be in good shape.
Would this be useful to reduce the computations needed for the SETI@Home folks too? Seems they have a bit of data to sort through... Hell, genetic enginering too. Look for useful patterns in hundreds of DNA strands.
today is spelling optional day.
1) INDUSTRIAL MAGNET
DBAs everwhere are cringing and covering their data.
Believe me, I have.
.. looking for stuff on your [real world] desktop?
I have, have actually had my arm and fingers twitching for the keyboard...
I think I need a major vacation soon, somewhere with no IT-devices whatsoever.
a.c.
sig? Oh, that sig...
Mythbusters did this one already. They built two machines/processes to find needles in haystacks. One used a process to burn away the hay leaving the needles and the other used magnets and gravity to separate the needles from the hay.
Oh, wait. Their talking about data. Never mind.
from the moment you posted that comment, the value you gave increased just a little bit more ...
This looks like something related to straw... weird, huh?
THE SINGULARITY
/. while at work might quibble, the fact is that we all now have meaningful leisure time in some sense, we're not dedicated 100% to our livelihood.
Throughout history, we championed the content creator. Only a tiny fraction of the population could write or understood math or science. Only a tiny fraction could dedicate themselves to the arts.
Most individuals' time was consumed by being agrarian generalists: they owned a farm, and they were constantly occupied by all the repairs and maintenance of their property. It wasn't a job, it was a way of life. But now, more and more, our economy makes us all incredible specialists. We're confined not only to a literal cubicle, but to a cubicle of tasks, often only seeing one tiny part of our contribution to social welfare. But as a result, we end up with leisure time. (Cf. Judge Skelly Wright's opinion in Javins v. First National Realty Corporation). While those reading
In addition, current technology is allowing us to collaborate and share information as a global community like it never has before.
What does all this mean? For one, it means that techies can have bands, and even get national coverage, without giving up their day jobs. In fact, if MySpace is any evidence, anyone can have a band... and a lot of us already do. Also, given that 80,000 blogs are created each day (though 40,000 are probably also abandoned each day), huge throngs of people have something to say and are able to say it to huge, unrelated throngs of people.
The singularity is similar to the way other areas of economics have evolved. It used to be that 90% of the population made 100% of the food, and now only 10% of the population provides 100% of the food. It's the opposite for art and science (naturally, as we're freed from producing necessities, we can devote more time to producing luxuries, improving general quality of life, and solving more complex problems). Traditionally, 1% of the population made all the cultural content. The singularity? Soon, 99% of the population will be making 100% of the content.
For the first time in history, we are the captains not only of our personal destiny, but of our cultural destiny. However, as cultural creativity becomes so democratized, our contribution will become less and less controlling. Like Warhol said, it's not that we're all going to be famous, it's that we each only get 15 minutes.
THE DOWNSIDE OF A CULTURE OF CREATIVES, AND A SILVER LINING FOR SEARCH
A professor once said to me, "No one cares how much you know anymore, that's why we have the Internet. The important thing is creating new ideas." The formidible aspect of the new society of cultural creatives is that soon, no one will really need you to create ideas anymore either. Your drop in the cultural bucket is less and less meaningful every day. Content is easier and easier to make and share, and everyone wants to play, so as a corrolary, it will become harder and harder to find compensation as a cultural creative.
So what's the new valuable thing, in this storm of data/content? Maybe not making worthwhile contributions to the arts, science, knowledge, (which is important, but self sustaining). However, finding the worthwhile signal amidst all cultural noise is becoming more and more valuable. Someone needs to be a sieve for all the content being thrown around right now. Technologies of search and sort are the ways to do it. Google is not prospering because it learned something about advertising. Google is prospering because it precociously encapsulates the spirit of the dawning age, while most of us are still trying to figure out just what the hell I'm talking about.
Mmmmmmmmm bush.... better than stubble...
Its better to either have a a priori hypothesis to look for one specific, pre-defined pattern in a mound data than to see if any pattern is in the data. Or, if one insists on looking for many patterns, then the standards for statistical significance must be correspondingly higher.
Two wrongs don't make a right, but three lefts do.
SETI?
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Current fraud detection systems in use in the financial industry are based on two primary knowledge bases:
1. A knowledge of your purchasing pattern as a consumer. To wit, having a statistically significant sample of what are valid transactions as well as knowing your credit score and income.
Do you shop at high-end stores? Do you use your card for primarily travel and entertainment? Do you use your card for everyday purchases? How much of your line-of-credit do you tend to use?
2. A comparison of recent transactions. For example:
A sudden wave of big-ticket purchases very close together in time, such as hitting a Best Buy the same day as buying jewelry.
A single card making multiple high-value transactions (3 or more) within an hour.
A pattern of unattended-auth-transaction (think pay-at-the pump) to big ticket purchase to unattended-auth and back.
Using geometric statistical analysis could only complement pattern analysis in any case, and I fail to see how it's superior to the existing behavior scoring algorithms which are based on an individual's past history, weighting each new transaction to determine if it's "out of profile", and if so, by what margin. Sometimes the fraud is only revealed by several transactions scoring progressively higher on the fraud-o-meter, and I suspect the geometric statistic analysis would fail to trigger that as an event, as it would be a continuation of the pattern.
My ability to read statistics papers is sadly out of date. Anyone want to give a shot at translating this into non-doctoral English?
"What does God use to tag a galaxy with though?"
Are you telling us there's such a thing as Intelligent Tagging?
This is why we have grep.
An article posted by Roland Piquepaille with no links back to his site???
WTF? Roland? You feeling OK?
No folly is more costly than the folly of intolerant idealism. - Winston Churchill
"This is not the signal you're looking for..." Hope the signals they're looking for don't come accompanied with Jedis. Or maybe all the chaos is because of Jedi obfuscation of real signals?
it's been a while since i last did much perl, but shouldn't the last line of your sig be:
($world = $world) =~ s/bad/good/g;
otherwise you're making your world better but not ever doing anything with it...
The paper by Jeremy Stribling, Daniel Aguayo and Maxwell Krohn:
Rooter: A Methodology for the Typical Unification of Access Points and Redundancy
Definitely some interesting parallels between the two. Maybe someone who understands this stuff better could elaborate?
Here... it's like this.. are ya ready?: just pick a chunk of data and call it's useful! It's no different than say, ummm, naming a name, which are then sold... worshipped.. reported... entire charity fund rasiers setup for... and even typed about on fine forums like this one! See the post below about cyber terror that's selling the credibility... they'll show ya how 'the system' works!
Don't try time this is at light home, but.
This paper basically describes an improvement of existing, simple and classical methods to handle the case where it is hard to formulate a model for the background noise (or "normal" behaviour in the cause of fraud detection system), or where it is hard to estimate a model of the background noise because data points are expensive to collect. This method might be useful in the case of particle physics problems - where gathering data is costly, where measures are very noisy, or where it is impossible to have prior knowledge about the background noise / the detected signal.
However, for most of the common applications, the simple, classical method (Likelihood ratio tests) works perfectly!
Likelihood ratio tests (the standard thing) work as follow... Compute :
ô1 : the set of parameters of a noise/perturbations model that fits the data.
ô2 : the set of parameters of a noise+signal model that fits the data.
L(ô1|x) = how well your data is explained by the best fitting noise/perturbations model.
L(ô2|x) = how well your data is explained by the best fitting perturbation + signal model.
If L(ô1|x)/L(ô2|x) threshold, there is a signal ; and some of its properties can be infered from the set of parameters ô2.
Don't we already have regular expressions for this kind of stuff?
(ducks)
"It was hell!" recalls former child.
From TFA (emphasis added): "We propose a new test statistic based on a score process for determining the statistical significance of a putative signal that may be a small perturbation to a noisy experimental background.... We illustrate the technique in the context of a model problem from high-energy particle physics. Monte Carlo experimental results confirm that the score test results in a significantly improved rate of signal detection." Monte Carlo experimental results? So much for the betterment of mankind! These guys are just out to make a killing at the roulette table!
shut your pie hole
><));>
Why not ask the paper clip?
So, are these guys basically saying that to find "the needle", just "turn up the noise"? Hence, look at the noise patterns, then mask them out to get the key value(s)?
Sort of a dilettante question, but I've been researching using entropy and information gain here at work and some of what they're talking about in the article and the paper seems familiar, though I'm not skilled enough in stats yet to make much out of it. It seems to me to be fairly similiar to how you get an information gain score. If you can classify the background as such, you should be able to sift through data with however many parameters you want and find the parameters that cause the greatest difference in how "un-random" that sample is.
So, just so I can get a foothold on this new stuff technically, is the idea that the data they have isn't able to be classified yet? Am I getting ahead in the analysis in thinking about information gain by assuming an existing classification that differentiates signal and noise? Producing IG scores is more about WHY classified data points are different and not WHICH data points are significantly different from the background, right? Maybe I'm thinking too much in terms of data mining and producing a decision tree. Maybe I have it exactly backwards: assuming you already know which parameters (and at what thresholds) are signficant, does the Case-Western process produce the classification of data?
Sorry, I'm sort of thinking out loud here. Just wondering if there's a geek who can set me straight on this--my grasp of information theory is cobbled together from a bunch of google searches and wikipedia pages.
I also know that these sorts of algorithms are created all of the time. In fact, someone in my lab got his Ph.D. for applying a neural network to this problem. Furthermore, these algorithms are not "plug-n-play". They must be manually adjusted, by a team with a deep in-depth knowledge of the system in order to be useful.
So trust me when I say that Roland has blown this out of proportion. Congratulations to the CWRU team for getting the PRL paper published, but this is hardly the kind of ground-breaking news that deserves to be on Slashdot.
I often have the same feeling about Slashdot. it's like a big haystack, but the needles are larger and easier to find. I have noticed that the Roland Piquepaille needles happen to the most worthless. The obvious solution for finding the proverbial needle in the haystack of data is to make it up. It's not like there's any real world examples.
"You'll get nothing, and you'll like it!"
Two types of biomedical research that have this "needle in a haystack" problem are function magnetic resonance imaging (fMRI) and computational neuroanatomy. In fMRI, very basically you image the brain while the test subject is performing a task (looking at something, actively listening, tapping a finger, etc) and when they are not, and use the change in local blood oxygenation to infer brain activity. Since this is a tiny signal, you repeat lots of times. The simplest way to determine where the activity is would be just to do a t-test against the background or against an assumption of no change. However, given many tens or hundreds of thousands or millions of pixels, you'll have lots of false positives, or have to use a really really low p-value. Through the magic of spatial correlations and fancy math tricks, one can do reasonable interpretations of the data, but again, it's that sort of "needle in a haystack" problem. In computational neuroanatomy, you scan lots of brains of normal folks and lots of brains of folks with neurodegenerative diseases, say, Alzheimers (or younger old people and older old people, that sort of thing). You perform some complex warping to map these brains onto a template brain (a real person, the younger version of the person, or some synthetic template ... all are done), then study the warpings that are needed. What you want to see is how the various lobes of the brain are basically eroding with time as the disease progresses. Again, we can do standard statistics, but we are hurt by the massive number of data points we are dealing with (again, it's pixel by pixel), so we have to use more fancy math to get around it. In this case, theories of Hotelling and Adler (referenced in the original article from the original post) are very useful.
As the amount of scientific data we have grows, we are starting to draw on what was once pure abstract mathematics to get meaningful statistics out. I can't pretend to even begin to understand the PDF article, but it's neat to see the same problems in lots of very different fields!
From the title of TFA, "Case researchers discover methods to find 'needles in haystack' in data". Pet peeve of mine, new techniques are not "discovered", they are "developed" (or something similar). Henry Ford did not discover the Model T by peering though a microscope, and CowboyNeal did not discover SlashCode by analyzing reams of code observations. It may be semantic nit-picking, but I think saying that the researchers just discovered this (surely insanely complex) bit of mathematical analysis takes away from their creativity - it all came from their heads, not from under a rock.
I know a data set that will bring this technique or any other to its knees, the S&P 500 for the past 10 years (or any other time frame.) No matter what instrument of torture you bring to bear, stock market data will not yield!
As a new graduate of the physics program at CWRU, I was quite surprised at Prof. Taylor's research. He's so caught up in his physics entrepreneurship program that I had no idea he was actually do REAL research :)
I hope this leads to some more funding for him. He is a great teacher.
Are You Mad? Did you read the paper?
Step 1) Open the PDF
Step 2) Look at all the pretty greek letters
Step 3) Realize that infinite monkeys with infinite typewriters have a better chance of composing the collective works of Shakespeare than 99% of the population understanding even the EQUATIONS used in that article, much less the theory behind why the equations work the way they do. I'm not being arrogant, I don't have a flying fart of a clue why they're integrating from c^2 to infinity in order to refine a permutaion of theta. I probably never will, but that's because this isn't my field of study.
Those researchers are feaking awesome for managing to translate probability into a geometry problem. Give credit where credit is due.
On the other hand, if you're talking about the ability to disemminate works, then yes, average joe can crank out all the content he wants, but I'm not holding my breath to find huge world shattering medical innovations to come out of a blog on Fashions of Utah.
If you've got a treatise, fine, many pleople post large rambling works on slashdot, but for the love of pete, don't post it on an article that disproves your idea!
..this? http://science.slashdot.org/article.pl?sid=05/12/0 5/1912216&tid=160&tid=126
Purple, because ice cream has no bones.
"As a particle physicist I know exactly the kind of challenge that this is. The SNR is horrible, you've got tons of data, and the data may be distorted by all sorts of sources (background, misalignment, the wrong reaction, etc)."'
And yet the human nervous system manages the feat every day without benifit of experts.
I'm almost positive they're trying to find for a decent-looking single female on that campus. Powerful application of mathematics indeed.
They seem to be using linear statistical techniques on non-linear data. Maybe us R/S analysis or Lyapunov exponents etc.
is the overwhelming size of the literature. It is getting harder and harder to find the information that you need among a sea of near misses. Even to stay on top of one's subfield would require reading at least five journal papers a day, which is a significant undertaking even before you have to spend large amounts of time hunting for papers. For example, I am a chemist. It is generally not too difficult to find papers about a specific molecule - each molecule is assigned a specific ID number, which can of course be searched, and then the results further whittled down by using relevant keywords. However, it can still be ridiculously hard to find such trivial information as "what is the best known method for making this molecule" or "what is this molecule soluble in?". Finding information on processes, however, has become a huge chore. If you think you have found a new way to make a class of molecules, you are in for days of sorting through papers hoping that no one has already had your idea - or worse yet, tried it, found that it didn't work, and never reported this information.
This information overload is pushing back the age at which scientists become productive. Back in the 1920's, many of the famous people you learn about made their huge discoveries in their 20's. Now, most Nobel-prize winning work is done in peoples' 40's and 50's. It simply takes that long to climb up the backs of all the giants that came before. At the rate it is going, in fifty years, scientists will die of old age before they can make it to the top.
We really need better ways to sort and condense this mass of information.
Finding a needle in a haystack is relatively easy - you just look for anything that isn't hay.
But in this case, since you don't know what "hay" is (i.e. it's often hard to define "normal"), it's more like searching through a garabage heap hunting for something that you don't know what it looks like.
NAME
memmem - locate a substring
SYNOPSIS
#define _GNU_SOURCE
#include
void *memmem(const void *haystack, size_t haystacklen,
const void *needle, size_t needlelen);
DESCRIPTION
The memmem() function finds the start of the first occurrence of the
substring needle of length needlelen in the memory area haystack of
length haystacklen.
If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no problem at all with the Breit-Wigner tails.
I think their derivation is quite possibly wrong.
Do you have a source for that quote?
It's a great quote, I'd love to be able to use it and attribute it properly.
Jim
Q: So how do I use this method to analyze stock market data and make money?
A: You can't, or any other statistical method for that matter. You'll go broke trying to find a mathematical technique to model the stock market.
Well, he's making HIS world better. Apparently, he can give a crap about the rest of us.
A professor once said to me, "No one cares how much you know anymore, that's why we have the Internet. The important thing is creating new ideas."
How sad. If a professor said that to me, I'd ask them to justify the cost of their courses when anyone could just use the Internet to learn what they teach. The brain is still many orders of magnitude faster and more accurate than the Internet in terms of knowledge retrieval. Given a PhD and a freshman student with the Internet, it's quite likely that the PhD will run circles around the student. Give the PhD 20 years experience, and they will beat graduate students and recent PhD graduates.
As soon as the REAL singularity (whether it's strong AI, computer-brain interfaces, or something else) occurs, knowledge truly will become commoditized. Then anything goes, but my guess is that people with an large existing knowledge base will learn and grow exponentially faster than those who don't.
In Slashdot, the dupe to original article ratio is so high, its the original articles that need finding, not the dupes. Funny, though, from what I've seen, it seems like this particular algorithm would be quite efficient in doing that (e.g. it specializes in finding the data that is different, versus categorizing existing data).
We all know what to do, but we don't know how to get re-elected once we have done it
I tried to understand their paper, and I must say it's exceedingly hard to understand exactly what they did. I suppose this is often the case with PRLs (due to 4 page limit), but this one seemed particularly opaque and unimpressive. If I was going to write a paper I'd want to make it crystal clear, spell everything out (you can call me on that at the arxiv).
e.g. in the paper there is a quantity Z that is introduced first without definition, then a page later defined in terms of some vectors which are never defined. I guess you have to read their (unpublished) reference, but ugh. And the "geometry of the manifold"? What manifold? Wha? Are you a statistician or a wannabe-differential geometer?
Often it seems academics delight in trying to impress their peers with their terrible sophistication for some reason, to the point where it's really unnecessarily tough to understand something (and the high-falutin ideas in these papers usually turn out to be pretty simple and obvious or otherwise wrong, in my experience). Good job getting this one published indeed.
The world is everything that is the case
Actually, it is two paperclips put together: http://wiki.case.edu/Case_logo
Oh come on! It's not that hard!
public static Object find(Object needle, Object[] haystack) {
for (int i = 0; i < haystack.length; i++)
if (haystack[i].equals(needle))
return needle;
return null;
}
Damn! Why do the substitutions on a _copy_ of $world_as_it_is_now ? Your sig is useless, or at least incomplete.
There are disputed reports that this sort of data mining was used to identify the terrorists who attacked the USS Cole and flew airplanes into the World Trade Center (the official 9/11 commission's findings notwithstanding). The project is well documented on the right-side of the web and was called "Able Danger." According to rumor the project was shut down after identifying Mohammed Atta but it also pointing to Condoleeza Rice and Hillary Clinton as potential foreign spies.
This raises the issue of false alarms in any data mining operation. Rigorous secondary testing must be in place to weed out false positive signals. I heard Richard Feynman say that (in nuclear physics) it is painfully easy to fool yourself.