Finding a Needle in a Haystack of Data

← Back to Stories (view on slashdot.org)

Finding a Needle in a Haystack of Data

Posted by ScuttleMonkey on Wednesday December 7, 2005 @08:44AM from the mathematical-sieve dept.

Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]

7 of 173 comments (clear)

Min score:

Reason:

Sort:

Google by biocute · 2005-12-07 08:45 · Score: 4, Interesting

Does Google have the technology to do this kind of scientific searches yet?

If it does, it sure can save these researchers a lot of time; If it doesn't, I'm sure Google will be keen to get involved, especially on the "isolate useful signals buried in large datasets" part.

--
Virtual Betting on Facebook for non-geeks.
The most obvious application by Billosaur · 2005-12-07 08:48 · Score: 5, Interesting

I see this as being a boon to SETI. If there was ever a needle in a haystack, it's trying to tease a possible intelligent signal out of the cosmic background noise. If you have an idea what the background is like in general, then it's far easier to detect an abnormality in that background noise. The question will end up being, are we simply detecting more false positives or are these real signals?

--
GetOuttaMySpace - The Anti-Social Network
Re:Numb3rs by Shadow+Wrought · 2005-12-07 08:58 · Score: 4, Funny

A favorite quote, "Physicists see equations as a reflection of reality, Engineers see reality as a reflection of equations; Mathematicians have never made the connection."

--
If brevity is the soul of wit, then how does one explain Twitter?
Speaking of needle in a haystack ... by airrage · 2005-12-07 08:59 · Score: 5, Funny

Someone asked me to give ten different ways to find a needle in a haystack, these are my thoughts:

1) INDUSTRIAL MAGNENT
2) BLIND LUCK
3) BURN THE HAY, PICK UP THE NEEDLE
4) STATISTICAL ANALYSIS (SINCE NEEDLES IN HAYSTACKS ARE NOT PLACED AT RANDOM, THEY ARE SUBJECT TO REGRESSION ANALYSIS)
5) OFFSHORE TO CHINA WHERE LABOR IS CHEAPER, SEARCH THE HAY WITH 10000 OF WORKERS.
6) WAIT YEARS UNTIL THE HAY DECAYS, PICK UP THE NEEDLE
7) SPREADOUT THE HAY, HIRE BAREFOOT HAY WALKERS
8) TAKE ALL THE HAY, PUT IN A POOL OF WATER - HAY WILL FLOAT, AND NEEDLE WILL SINK
9) LET COWS EAT THE HAY, X-RAY ALL THE COWS!
10) TRIAL AND ERROR - ONE PERSON

--
"This isn't a study in computer science, its a study in human behavior"
Re:9...9...9...9... by flynt · 2005-12-07 09:04 · Score: 4, Insightful

Whether you "know" or not is always up for debate, but that's usually for epistemology class. In classical hypothesis testing in statistics, you make a distributional assumption about your data, and then calculate a probability from the data you observed (the p-value) given your initial assumption. If this probability is very low (also an interpretation), you assume your initial distributional assumption was incorrect. There are finer points to it of course, but classical hypothesis testing in statistics is pretty much a reductio ad absurdem in logic.
Re:Significant % of patterns in randomness by zex · 2005-12-07 09:54 · Score: 5, Informative

If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical level of 5% (even if the data is 100% random).

If you're not correcting for multiple hypothesis testing, you are correct. If you do have 100% random data that holds to perfect randomness at all scales (which I'm not sure is even possible) and correct for multiple hypothesis testing, then you'll find exactly what you "should" find: no significant pattern.

You mention "Cancer clusters" as an example of attribution of significance to insignificant findings. However, these clusters are often found (at least in the genetics research realm) by hierarchical clustering, which is self-correcting for multiple hypothesis testing. If you're speaking of demographic surveys which find that (e.g.) "black females in Tahiti who were exposed to .... are more susceptible to brain cancer", then you're probably right. I too see those as examples of restricting the domain of samples until you find a pattern - but the pattern nonetheless exists.
As a particle physicist by Lord+Byron+II · 2005-12-07 10:48 · Score: 5, Interesting

As a particle physicist I know exactly the kind of challenge that this is. The SNR is horrible, you've got tons of data, and the data may be distorted by all sorts of sources (background, misalignment, the wrong reaction, etc).
I also know that these sorts of algorithms are created all of the time. In fact, someone in my lab got his Ph.D. for applying a neural network to this problem. Furthermore, these algorithms are not "plug-n-play". They must be manually adjusted, by a team with a deep in-depth knowledge of the system in order to be useful.
So trust me when I say that Roland has blown this out of proportion. Congratulations to the CWRU team for getting the PRL paper published, but this is hardly the kind of ground-breaking news that deserves to be on Slashdot.