Slashdot Mirror


Finding a Needle in a Haystack of Data

Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]

29 of 173 comments (clear)

  1. Google by biocute · · Score: 4, Interesting

    Does Google have the technology to do this kind of scientific searches yet?

    If it does, it sure can save these researchers a lot of time; If it doesn't, I'm sure Google will be keen to get involved, especially on the "isolate useful signals buried in large datasets" part.

    1. Re:Google by garcia · · Score: 2, Funny

      Does Google have the technology to do this kind of scientific searches yet?

      It's only in Beta thus it's not useful ;-)

    2. Re:Google by sapped · · Score: 2, Funny

      But can it find potential girlfriends for Slashdotters?

      Wow. There really are't any out there. Check it out on google yourselves.

      The same results come back in images, groups, news, etc. Man. What a sad bunch.

    3. Re:Google by zopf · · Score: 2, Funny

      Bah! Not even in Froogle... ;)

      --
      Did you see the pool? They flipped the bitch!
  2. The most obvious application by Billosaur · · Score: 5, Interesting

    I see this as being a boon to SETI. If there was ever a needle in a haystack, it's trying to tease a possible intelligent signal out of the cosmic background noise. If you have an idea what the background is like in general, then it's far easier to detect an abnormality in that background noise. The question will end up being, are we simply detecting more false positives or are these real signals?

    --
    GetOuttaMySpace - The Anti-Social Network
  3. Ya' know... by jacobcaz · · Score: 3, Funny

    82.67% of all statistics are made up anyway...

    1. Re:Ya' know... by saskboy · · Score: 2, Funny

      "82.67% of all statistics are made up anyway..."

      Well yeah, 50% of all statisticians finished in the bottom half of their class.

      --
      Saskboy's blog is good. 9 out of 10 dentists agree.
    2. Re:Ya' know... by $RANDOMLUSER · · Score: 2, Funny

      Jeez. How anal. You should take some time and count the flowers.

      --
      No folly is more costly than the folly of intolerant idealism. - Winston Churchill
  4. The Real Challenge is Further Off by AthenianGadfly · · Score: 3, Funny

    "But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people."

    When asked about more advanced applications for the technology, researchers replied it will probably be "quite a while" before the technology could be used for extremely high noise environments. Said one, "I mean, it's going to be a long time before we're up to finding finding useful comments on Slashdot or something."

  5. Now that's a change... by Havenwar · · Score: 2, Funny

    The Case team discovered a technique that is built on the principle of comparing a set of summary characteristics for any sub region of the observations with the background variation. From these characteristics, attempts are made to find small regions that appear significantly different from the background--a difference that cannot simply be attributed to random chance

    So, basically its the one search engine that can only find the words "horny teen nekkid" if it is NOT on a pr0n-page. I can see uses for that. Not for me, but I'm sure SOMEONE is interested in finding other kinds of pages once in a while.

  6. Re:Indexes by Husgaard · · Score: 2, Informative

    They are trying to efficiently find a signal in random and chaotic data. Random and chaotic data isn't easy to index.

  7. Case Western Reserve University by tomzyk · · Score: 3, Interesting

    FYI: Its abbreviation is not "CWRU" anymore. As of about 2 years ago, they changed it to simply "Case" and gave it the silly new logo of 2 paperclips stuck together.

    Why? I have no idea. Some "university branding" thing that some people thought was important to the growth of the campus or something. Apparently it ticked a bunch of alumni (from the original Western Reserve University) too.

    Knowing is half the battle.

    --
    Karma: NaN
    1. Re:Case Western Reserve University by Anonymous Coward · · Score: 2, Funny

      Actually, its not two paper clips together. It's a fat man holding a surf board. Look for yourself

  8. Re:Numb3rs by Shadow+Wrought · · Score: 4, Funny

    A favorite quote, "Physicists see equations as a reflection of reality, Engineers see reality as a reflection of equations; Mathematicians have never made the connection."

    --
    If brevity is the soul of wit, then how does one explain Twitter?
  9. Speaking of needle in a haystack ... by airrage · · Score: 5, Funny

    Someone asked me to give ten different ways to find a needle in a haystack, these are my thoughts:

    1) INDUSTRIAL MAGNENT
    2) BLIND LUCK
    3) BURN THE HAY, PICK UP THE NEEDLE
    4) STATISTICAL ANALYSIS (SINCE NEEDLES IN HAYSTACKS ARE NOT PLACED AT RANDOM, THEY ARE SUBJECT TO REGRESSION ANALYSIS)
    5) OFFSHORE TO CHINA WHERE LABOR IS CHEAPER, SEARCH THE HAY WITH 10000 OF WORKERS.
    6) WAIT YEARS UNTIL THE HAY DECAYS, PICK UP THE NEEDLE
    7) SPREADOUT THE HAY, HIRE BAREFOOT HAY WALKERS
    8) TAKE ALL THE HAY, PUT IN A POOL OF WATER - HAY WILL FLOAT, AND NEEDLE WILL SINK
    9) LET COWS EAT THE HAY, X-RAY ALL THE COWS!
    10) TRIAL AND ERROR - ONE PERSON

    --
    "This isn't a study in computer science, its a study in human behavior"
  10. Re:9...9...9...9... by flynt · · Score: 4, Insightful

    Whether you "know" or not is always up for debate, but that's usually for epistemology class. In classical hypothesis testing in statistics, you make a distributional assumption about your data, and then calculate a probability from the data you observed (the p-value) given your initial assumption. If this probability is very low (also an interpretation), you assume your initial distributional assumption was incorrect. There are finer points to it of course, but classical hypothesis testing in statistics is pretty much a reductio ad absurdem in logic.

  11. Re:Was it just me or was this story broken at firs by MarkGriz · · Score: 3, Funny

    "It just refused to load for me."

    Maybe your interest in the story was deemed statistically insignificant.

    --
    Beauty is in the eye of the beerholder.
  12. Mythbusters by everphilski · · Score: 2, Informative

    Mythbusters actually did an ep where they built two different needle-in-haystack finding machines, one actually did quite well...

    -everphilski-

  13. Re:1) INDUSTRIAL MAGNENT by Anne_Nonymous · · Score: 2, Funny

    1) INDUSTRIAL MAGNET

    DBAs everwhere are cringing and covering their data.

  14. Re:9...9...9...9... by Stonehand · · Score: 2, Insightful

    Not really.

    The more you constrain your allegedly random process, such as by insisting that it produce output without "patterns" -- whatever those are -- the less random it actually is.

    To put it in more concrete terms, which is more random -- a coin which flips 50-50 heads/tails with no other constraints whatsoever, or a coin which flips 50-50 but will never, say, flip 100 heads in a row, and will never exactly alternate, and will never produce the bit sequence corresponding to the ASCII encoding of the text of Rissanen's first paper on MML, and... ?

    What the OP might want to look into is the notion of uncompressability, and perhaps Kolmogorov complexity. Of course, the latter is incomputable, but that's life.

    --
    Only the dead have seen the end of war.
  15. Significant % of patterns in randomness by G4from128k · · Score: 2, Informative
    Looking for possible patterns in large volumes of data is dangerous because of the high chance that random data will fit some of the myriad patterns tried. If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical significance level of 5% (even if the data is 100% random). "Cancer clusters" are an excellent example of this -- if you slice a dice a population enough different ways you are bound to find some geographic/demographic/ethnographic subgroup with a very high chance of some cancer.


    Its better to either have a a priori hypothesis to look for one specific, pre-defined pattern in a mound data than to see if any pattern is in the data. Or, if one insists on looking for many patterns, then the standards for statistical significance must be correspondingly higher.

    --
    Two wrongs don't make a right, but three lefts do.
    1. Re:Significant % of patterns in randomness by zex · · Score: 5, Informative
      If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical level of 5% (even if the data is 100% random).


      If you're not correcting for multiple hypothesis testing, you are correct. If you do have 100% random data that holds to perfect randomness at all scales (which I'm not sure is even possible) and correct for multiple hypothesis testing, then you'll find exactly what you "should" find: no significant pattern.

      You mention "Cancer clusters" as an example of attribution of significance to insignificant findings. However, these clusters are often found (at least in the genetics research realm) by hierarchical clustering, which is self-correcting for multiple hypothesis testing. If you're speaking of demographic surveys which find that (e.g.) "black females in Tahiti who were exposed to .... are more susceptible to brain cancer", then you're probably right. I too see those as examples of restricting the domain of samples until you find a pattern - but the pattern nonetheless exists.
  16. Regarding fraudulent transactions... by ahmusch · · Score: 2, Interesting

    Current fraud detection systems in use in the financial industry are based on two primary knowledge bases:

    1. A knowledge of your purchasing pattern as a consumer. To wit, having a statistically significant sample of what are valid transactions as well as knowing your credit score and income.

    Do you shop at high-end stores? Do you use your card for primarily travel and entertainment? Do you use your card for everyday purchases? How much of your line-of-credit do you tend to use?

    2. A comparison of recent transactions. For example:

    A sudden wave of big-ticket purchases very close together in time, such as hitting a Best Buy the same day as buying jewelry.

    A single card making multiple high-value transactions (3 or more) within an hour.

    A pattern of unattended-auth-transaction (think pay-at-the pump) to big ticket purchase to unattended-auth and back.

    Using geometric statistical analysis could only complement pattern analysis in any case, and I fail to see how it's superior to the existing behavior scoring algorithms which are based on an individual's past history, weighting each new transaction to determine if it's "out of profile", and if so, by what margin. Sometimes the fraud is only revealed by several transactions scoring progressively higher on the fraud-o-meter, and I suspect the geometric statistic analysis would fail to trigger that as an event, as it would be a continuation of the pattern.

    My ability to read statistics papers is sadly out of date. Anyone want to give a shot at translating this into non-doctoral English?

  17. Hey, wait a minute! by $RANDOMLUSER · · Score: 3, Funny

    An article posted by Roland Piquepaille with no links back to his site???
    WTF? Roland? You feeling OK?

    --
    No folly is more costly than the folly of intolerant idealism. - Winston Churchill
  18. As a particle physicist by Lord+Byron+II · · Score: 5, Interesting
    As a particle physicist I know exactly the kind of challenge that this is. The SNR is horrible, you've got tons of data, and the data may be distorted by all sorts of sources (background, misalignment, the wrong reaction, etc).

    I also know that these sorts of algorithms are created all of the time. In fact, someone in my lab got his Ph.D. for applying a neural network to this problem. Furthermore, these algorithms are not "plug-n-play". They must be manually adjusted, by a team with a deep in-depth knowledge of the system in order to be useful.

    So trust me when I say that Roland has blown this out of proportion. Congratulations to the CWRU team for getting the PRL paper published, but this is hardly the kind of ground-breaking news that deserves to be on Slashdot.

  19. Re:As a nervous system. by mako1138 · · Score: 2, Funny

    Here's a couple TB of data. Find me all the top quark candidates by tomorrow.

  20. I don't want to rain on the parade by martin-boundary · · Score: 3, Interesting
    I don't want to rain on the parade, but the result is quite possibly wrong.

    If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no problem at all with the Breit-Wigner tails.

    I think their derivation is quite possibly wrong.

    1. Re:I don't want to rain on the parade by martin-boundary · · Score: 2, Insightful
      That's a good point. In the paper, the formula (2) is finite only if the tails of f dominate the tails of psi, so that means that f would have to be at least as fat tailed as the Cauchy. However, the paper doesn't attempt to state any assumptions, so it's hard to see which parts are solid and where there might be handwaving.

      Funnily enough, the density f they use in the monte carlo simulation appears to be truncated to be in the interval [0,2] (otherwise it wouldn't be integrable). That suggests that in practice, they really do everything on the interval [0,2], and the psi they present isn't really a Cauchy in the first place.

      Oh well, rigour still isn't a strong point of physics ;)

    2. Re:I don't want to rain on the parade by jmtpi · · Score: 2, Insightful
      martin-boundary wrote:
      If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no problem at all with the Breit-Wigner tails.

      They use a Breit-Wigner because that's often a realistic model of the signal distribution, when one is talking about resonance production in a particle physics experiment. (My copy is at work, but I know this is discussed, for example, in Sakurai's Modern Quantum Mechanics.) I don't think this paper nearly lived up to the press release, and certainly isn't germane to Slashdot, but I don't think the use of a BW has anything to do with it.

      On the other hand, I'm merely a particle physics grad student, and I didn't even attempt to read the center of the paper. If they really did come up with something that has more power than chi^2 (at least for an extremely simple fit) then that is notable. What would be really interesting would be for someone to come up with a real goodness-of-fit statistic for unbinned fits.