Finding a Needle in a Haystack of Data

Google by biocute · 2005-12-07 08:45 · Score: 4, Interesting

Does Google have the technology to do this kind of scientific searches yet?

If it does, it sure can save these researchers a lot of time; If it doesn't, I'm sure Google will be keen to get involved, especially on the "isolate useful signals buried in large datasets" part.

--
Virtual Betting on Facebook for non-geeks.

Re:Google by paulsgre · 2005-12-07 09:00 · Score: 1

But can it find potential girlfriends for Slashdotters? Now that's what I would call isolating a useful (and rare) signal buried in a large dataset. When i see THOSE results, I will be impressed. And if so, I've got a useful signal that could use some burying...
Re:Google by garcia · 2005-12-07 09:02 · Score: 2, Funny

Does Google have the technology to do this kind of scientific searches yet?

It's only in Beta thus it's not useful ;-)
Re:Google by sapped · 2005-12-07 09:13 · Score: 2, Funny

But can it find potential girlfriends for Slashdotters?

Wow. There really are't any out there. Check it out on google yourselves.

The same results come back in images, groups, news, etc. Man. What a sad bunch.
Re:Google by X0563511 · 2005-12-07 12:57 · Score: 1

it's not that hard. Go outside. Talk to people. Listen (thats the important part).

Give it a few months and you will be suprised!

(Of course, get yourself in shape - not too hard. an hour 5 days out of the week on a cycle or treadmill will do the trick. lay off the sugary and fatty snacks)

Ive even had propositions! THEY came to ME!

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Google by zopf · 2005-12-07 17:22 · Score: 2, Funny

Bah! Not even in Froogle... ;)

--
Did you see the pool? They flipped the bitch!
Re:Google by altan · 2005-12-08 02:13 · Score: 1

The kinds of signals being worked with are very different than what Google has built its technologies around. It has nothing to do with a "Google search" except on an absurdly metaphorical layer, and the fact that it uses elements of statistical science.

Here is a link to the paper: http://arxiv.org/pdf/physics/0505200

Was it just me or was this story broken at first? by Wisgary · 2005-12-07 08:45 · Score: 1

It just refused to load for me.

The most obvious application by Billosaur · 2005-12-07 08:48 · Score: 5, Interesting

I see this as being a boon to SETI. If there was ever a needle in a haystack, it's trying to tease a possible intelligent signal out of the cosmic background noise. If you have an idea what the background is like in general, then it's far easier to detect an abnormality in that background noise. The question will end up being, are we simply detecting more false positives or are these real signals?

--
GetOuttaMySpace - The Anti-Social Network

Re:The most obvious application by Life700MB · 2005-12-07 10:20 · Score: 1

Somebody please mod parent funny, as is clearly a joke: to expect from the seti team to do something to speed up calcs.

I don't know with the new BOINC based client, but back in the day they were widely known for sending the same data again and again just to keep clients busy and accepting incredibly simple cheats as good results. Just do a quick google on the topics, I did, and I left the project very sad for the lost CPU hours.

--
Superb hosting 2400MB Storage, 120GB bandwidth, ssh, $7.95
Re:The most obvious application by SETIGuy · 2005-12-07 14:44 · Score: 1

I see this as being a boon to SETI.
I'll read the article tonight and find out if it's applicable and whether it's better than what we are using. In the SETI@home client processing we already take into account the anticipated form of the signal, so I'm not sure this buys us anything. In fact, other than the exact mathematical description as a multidimensional manifold the text makes it appear that we're already using this technique in our searches for repeated pulses and signals matching the Gaussian profile of a signal drifting through the field of view. Wouldn't be the first time things like this were developed independently, however, I would have thought the technique was fairly obvious. I'd guess a search of early 20th century papers would show that this is an old technique.
In candidate determination our background isn't purely statistical, which complicates matter. (In other words, we see many real signals every day. They just happen to be very local and not at all interesting).
It might be more applicable to Astropulse (our search for evaporating black holes where we will be looking for a deviation from a fairly uniform statistical background), but then again, our current method may be mathematically equivalent.
I'll check it out anyway.

--
Support SETI@home
Re:The most obvious application by gibodean · 2005-12-07 14:50 · Score: 1

"Not to say there isn't ET life out there, but until some evidenc points to it, we may as well assume that the universe is empty except for us."

Hmm, well how would you go about testing that hypothesis that there's no life out there ? Or that there is life out there ?

Well, you could look into space !!!!!

Gee, wouldn't it be good if someone was doing that to provide the evidence one way or another ?

I've got no problem with Seti. I think they're going to fail to detect intelligent life, but as long as their aim is to determine if there's life and not "prove there is life", then it's all scientific.

Ya' know... by jacobcaz · 2005-12-07 08:49 · Score: 3, Funny

82.67% of all statistics are made up anyway...

Re:Ya' know... by saskboy · 2005-12-07 08:51 · Score: 2, Funny

"82.67% of all statistics are made up anyway..."

Well yeah, 50% of all statisticians finished in the bottom half of their class.

--
Saskboy's blog is good. 9 out of 10 dentists agree.
Re:Ya' know... by Funakoshi · 2005-12-07 09:26 · Score: 1

Very true. Also interesting is that 95% of men like to use statistics to seem more intelligent...
Re:Ya' know... by Tony+Hoyle · 2005-12-07 09:28 · Score: 1

Not necessarily... only works if there are an even number of statisticians, and if nobody scored the mean score.

eg. if there 100 statisticians, the mean score is 37 and 10 statisticians scored that, only 45% of statisticians are techincally in the bottom half (and 45% in the top half). 10% are exactly in the middle.

You could say that the 10% are in both the bottom and top half... in which case 55% are in the bottom half and 55% are in the top half!!
Re:Ya' know... by $RANDOMLUSER · 2005-12-07 09:33 · Score: 2, Funny

Jeez. How anal. You should take some time and count the flowers.

--
No folly is more costly than the folly of intolerant idealism. - Winston Churchill
Re:Ya' know... by saskboy · 2005-12-07 09:34 · Score: 1

"You could say that the 10% are in both the bottom and top half..."

I'm not sure now if a comment like that puts you in the top half, or bottom half. :-P

--
Saskboy's blog is good. 9 out of 10 dentists agree.

Sounds useful. by RandoX · 2005-12-07 08:49 · Score: 1, Funny

I can't even find my keys some days.

Re:Sounds useful. by TheComputerMutt.ca · 2005-12-07 09:06 · Score: 1

And how would this help you with your inabbility to locate them?

It wouldn't, that's how.
Re:Sounds useful. by tzot · 2005-12-07 09:39 · Score: 1

I can't even find my keys some days.
Really?

--
I speak England very best

The Real Challenge is Further Off by AthenianGadfly · 2005-12-07 08:50 · Score: 3, Funny

"But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people."

When asked about more advanced applications for the technology, researchers replied it will probably be "quite a while" before the technology could be used for extremely high noise environments. Said one, "I mean, it's going to be a long time before we're up to finding finding useful comments on Slashdot or something."

Numb3rs by vanyel · 2005-12-07 08:51 · Score: 1, Funny

Sounds like they've been watching Numb3rs ;-)

Re:Numb3rs by Shadow+Wrought · 2005-12-07 08:58 · Score: 4, Funny

A favorite quote, "Physicists see equations as a reflection of reality, Engineers see reality as a reflection of equations; Mathematicians have never made the connection."

--
If brevity is the soul of wit, then how does one explain Twitter?

Obligatory by drewzhrodague · 2005-12-07 08:52 · Score: 1

"What does god want with a starship?" -Spock

--
Zhrodague.net - I do projects and stuff too.

Re:Obligatory by RetroGeek · 2005-12-07 08:59 · Score: 1

Kirk said this, not Spock

--

- - - - - - - - - - -
I am a programmer. I am paid to produce syntax not grammar. Deal with it.
Re:Obligatory by drewzhrodague · 2005-12-07 09:17 · Score: 1

Oops! And here I thought I was Spock -- er, spot-on. =_)

--
Zhrodague.net - I do projects and stuff too.
Re:Obligatory by RadioD00d · 2005-12-07 09:37 · Score: 1

Nope - it was McCoy
Re:Obligatory by RetroGeek · 2005-12-07 10:52 · Score: 1

Um, I hate to argue with you, but it was Kirk

--

- - - - - - - - - - -
I am a programmer. I am paid to produce syntax not grammar. Deal with it.
Re:Obligatory by fiannaFailMan · 2005-12-07 11:24 · Score: 1

Kirk said it, then McCoy asked him what he was doing and said "you don't ask the almighty for his ID!"

--
Drill baby drill - on Mars

Now that's a change... by Havenwar · 2005-12-07 08:53 · Score: 2, Funny

The Case team discovered a technique that is built on the principle of comparing a set of summary characteristics for any sub region of the observations with the background variation. From these characteristics, attempts are made to find small regions that appear significantly different from the background--a difference that cannot simply be attributed to random chance

So, basically its the one search engine that can only find the words "horny teen nekkid" if it is NOT on a pr0n-page. I can see uses for that. Not for me, but I'm sure SOMEONE is interested in finding other kinds of pages once in a while.

Re:Now that's a change... by Havenwar · 2005-12-07 09:18 · Score: 1

So the first job of this marvelous search engine statistical method... is to find the one person in a bunch of perverts that would appreciate the results.

jack thompson.

see, didn't need a search engine. and why pray tell would he want to find the one page that said "nekkid teen sexy" and wasn't a pr0n page? Oh to press charges of course. Pages like that corrupt our young! Never mind the real pr0n, thats so.. out there. It's the one page that mentions a single naughty word thats in for the trouble!

PDF Warning? by Anonymous Coward · 2005-12-07 08:54 · Score: 1

Why do we need to be warned that it's a PDF? I can understand an "MS Word Warning" but PDF is platform independent. What's wrong with PDF?

Re:PDF Warning? by tzot · 2005-12-07 09:46 · Score: 1

PDF documents are not handled directly by your browser.

--
I speak England very best
Re:PDF Warning? by rco3 · 2005-12-07 11:59 · Score: 1

"PDF documents are not handled directly by your browser."

They most certainly are. Of course, I'm not using IE or FF. In any case, that doesn't really justify the warning.

--

Ce n'est pas un vrai mouvement de robot!
Re:PDF Warning? by flood6 · 2005-12-08 05:29 · Score: 1

My problem with them is that one of my work PCs is very old but still fine for browsing the internet. Clicking on a link on this machine that I did not realize was a PDF sets off a long and tedious series of about 3 minutes where FF locks up until the Acrobat Reader plugin loads, then it downloads and displays the PDF, then scrolling through the file itself is really jumpy, then I have to close it which is slow and sometimes crashes FF.
Even on my faster PCs, reading a large PDF feels slower than it should.

--
SEO Firefox Extension
Re:PDF Warning? by bogado · 2005-12-09 07:10 · Score: 1

Why don't you unisntall the plugin in this machine then? It seems to me that the plugin is useless since you're never going to want to use it anyway. Keep the reader as a stand alone app, so you can still view the pdfs if you want to.

--
[]'s Victor Bogado da Silva Lins
^[:wq

Re:Indexes by Husgaard · 2005-12-07 08:55 · Score: 2, Informative

They are trying to efficiently find a signal in random and chaotic data. Random and chaotic data isn't easy to index.

9...9...9...9... by r3adah3ad · 2005-12-07 08:56 · Score: 1

"...a difference that cannot simply be attributed to random chance..." If it's random, how do you know?

Re:9...9...9...9... by chill · 2005-12-07 09:03 · Score: 1

Random has NO pattern what so ever. By detecting a pattern, however small, implies non-random data. QED

-Charles

--
Learning HOW to think is more important than learning WHAT to think.
Re:9...9...9...9... by flynt · 2005-12-07 09:04 · Score: 4, Insightful

Whether you "know" or not is always up for debate, but that's usually for epistemology class. In classical hypothesis testing in statistics, you make a distributional assumption about your data, and then calculate a probability from the data you observed (the p-value) given your initial assumption. If this probability is very low (also an interpretation), you assume your initial distributional assumption was incorrect. There are finer points to it of course, but classical hypothesis testing in statistics is pretty much a reductio ad absurdem in logic.
Re:9...9...9...9... by Stonehand · 2005-12-07 09:27 · Score: 2, Insightful

Not really.

The more you constrain your allegedly random process, such as by insisting that it produce output without "patterns" -- whatever those are -- the less random it actually is.

To put it in more concrete terms, which is more random -- a coin which flips 50-50 heads/tails with no other constraints whatsoever, or a coin which flips 50-50 but will never, say, flip 100 heads in a row, and will never exactly alternate, and will never produce the bit sequence corresponding to the ASCII encoding of the text of Rissanen's first paper on MML, and... ?

What the OP might want to look into is the notion of uncompressability, and perhaps Kolmogorov complexity. Of course, the latter is incomputable, but that's life.

--
Only the dead have seen the end of war.
Re:9...9...9...9... by CoolVibe · 2005-12-07 09:35 · Score: 1

If you have an infinite amount of random data, every pattern will be in there somewhere. At least, that's what I was led to believe.
Re:9...9...9...9... by chill · 2005-12-07 10:10 · Score: 1

If you have an infinite amount of random data, every pattern will be in there somewhere. At least, that's what I was led to believe.

Yes, but only if you look at smaller segments, which changes your dataset. For example, if you spot the first 30 digits of Pi in an infinitely random set, the question becomes is your random set Pi? If not, the pattern only applies to those 30 digits and thus your set changes and is no longer the infinite set of random data.

And they aren't dealing with an "infinite" set, but a smaller subset. Thus, the odds of finding the collective works of Shakespeare is significantly smaller.

-Charles

--
Learning HOW to think is more important than learning WHAT to think.
Re:9...9...9...9... by vux984 · 2005-12-07 10:12 · Score: 1

What about infinitely long patterns?
Re:9...9...9...9... by CoolVibe · 2005-12-07 11:05 · Score: 1

They're in there too. Infinity is tricky.

If you have an infinite amount of hey, and throw in an infinite amount of needles, you'll still be spending a lot of time finding the needles. :)
Re:9...9...9...9... by vux984 · 2005-12-07 13:22 · Score: 1

While I agree things do get tricky with infinity; I disagree with your conclusion that they will be in there too.

As the random string grows to infinity the probability of any particular finite string being found within the random one grows to 100%.

However the probability of a particular infinitely long string being found does not.

You'll have to forgive my rusty mathematics...

Let R be a single infinitely long string of random numbers. Let R[i] be the ith character of R. And the interval R[i..j] a finite substring of R, and R[i..infinity] an infinitely long substring beginning at R[i] (lets shorthand this as R[i*]).

Clearly there are infintely many infinite substrings within R, but that infinity is countable; in other words: it is true that for any infinitely long substring in R, there exists some integer x, s.t. R[x*] will identify it.

Lets also define L as the set of all infinitely long strings.

The question: is L a subset of R[i*]?

The answer is no. Although they are both infinite, |L| > |R[i*]|

The cardinality of L is greater than R[i*], L is uncountably infinite while R[i*] is countably infinite therefore L can't be a subset of R[i*].

I loosely demonstrated the countability of R[i*] by showing how you could enumerate them. I think its pretty clear that R[i*] is a countable infinity.

Proving the set of infinitely long strings is uncountable is also fairly easy, and can be done with Cantor's diagonal method.

That is (in breif): Asserting that they are countable, enumerating them (based on the assumption that they are countable), and then constructing a string that isn't in the enumeration by traversing the enumeration on the diagonal. Since the construction isn't in the set you have a contradiction and can conclude that the set isn't countable. (for more info see see Diagonal Method

I could be mistaken, and my math is rusty... but I think I'm on solid ground.

Seti by jurt1235 · 2005-12-07 08:57 · Score: 1

Also the first "usefull" application for this kind of technique which popped up in my head. Actually, the process in my head to make this one item popup is maybe usefull too (-: Lot of random data, and this one is being associated with the article.

--

My wife's sketchblog Blob[p]: Gastrono-me

Case Western Reserve University by tomzyk · 2005-12-07 08:57 · Score: 3, Interesting

FYI: Its abbreviation is not "CWRU" anymore. As of about 2 years ago, they changed it to simply "Case" and gave it the silly new logo of 2 paperclips stuck together.

Why? I have no idea. Some "university branding" thing that some people thought was important to the growth of the campus or something. Apparently it ticked a bunch of alumni (from the original Western Reserve University) too.

Knowing is half the battle.

--
Karma: NaN

Re:Case Western Reserve University by Manhigh · 2005-12-07 09:02 · Score: 1

The name of the school is still Case Western Reserve University.

Despite the fact that its OK to officially call it 'Case' now (it wasnt OK to do so in '97), CWRU is still a valid abbreviation. Plus I paid so much money to that place that I'll call it whatever I damn well please.

- '02

--
"Open the pod by doors, Hal" > "I'm afraid I can't do that, Dave" sudo "Open the pod bay doors, Hal" > alright
Re:Case Western Reserve University by Anonymous Coward · 2005-12-07 09:03 · Score: 2, Funny

Actually, its not two paper clips together. It's a fat man holding a surf board. Look for yourself
Re:Case Western Reserve University by ThosLives · 2005-12-07 09:54 · Score: 1

I have to say, I'm glad that my alma mater (Case School of Engineering, 2000) is actually still doing real science. I'm kind of disappointed at all the folks above who posted about "finding useful information in the noise of internet information" though; that type of information gathering is not quite the same as discerning between special-cause and random-cause fluctuations in a signal (mostly because the Internet consists mostly of special-cause variation: i.e., things people have written or created). Distinguishing between two different pieces of non-random data is vastly different than picking up non-random from random.
Incidentally, I don't mind the switch to Case from CWRU (you would not believe how many people asked if Case Western Reserve University was a military school - I guess they forgot to teach people about the Western Reserve Territory in elementary school). The name change is nothing compared to the Peter B. Lewis building...

--
"There are a dozen opinions on a matter until you know the truth. Then there is only one." - CS Lewis (paraprhase)
Re:Case Western Reserve University by Dachannien · 2005-12-07 10:39 · Score: 1

More on the logo.

The true offense in the OP was calling it "Case Western". It's not a "reserve university", whatever that means.

I've always just called it "Case" since I started there as an undergraduate in 1994, while my e-mail address still contains cwru.edu. Both of those are used now - "Case" just validates the fact that most people really get tired of saying the whole name over and over again.
Re:Case Western Reserve University by regen · 2005-12-08 07:48 · Score: 1

You don't happen to be Tom Zak do you?

--
The Economics of Website Security

Speaking of needle in a haystack ... by airrage · 2005-12-07 08:59 · Score: 5, Funny

Someone asked me to give ten different ways to find a needle in a haystack, these are my thoughts:

1) INDUSTRIAL MAGNENT
2) BLIND LUCK
3) BURN THE HAY, PICK UP THE NEEDLE
4) STATISTICAL ANALYSIS (SINCE NEEDLES IN HAYSTACKS ARE NOT PLACED AT RANDOM, THEY ARE SUBJECT TO REGRESSION ANALYSIS)
5) OFFSHORE TO CHINA WHERE LABOR IS CHEAPER, SEARCH THE HAY WITH 10000 OF WORKERS.
6) WAIT YEARS UNTIL THE HAY DECAYS, PICK UP THE NEEDLE
7) SPREADOUT THE HAY, HIRE BAREFOOT HAY WALKERS
8) TAKE ALL THE HAY, PUT IN A POOL OF WATER - HAY WILL FLOAT, AND NEEDLE WILL SINK
9) LET COWS EAT THE HAY, X-RAY ALL THE COWS!
10) TRIAL AND ERROR - ONE PERSON

--
"This isn't a study in computer science, its a study in human behavior"

Re:Speaking of needle in a haystack ... by yapplejax · 2005-12-07 09:48 · Score: 1

The needle won't necessarily sink. I recall experiments floating a needle on top of water because it wasn't heavy enough to break the surface tension.
Re:Speaking of needle in a haystack ... by pherthyl · 2005-12-07 09:55 · Score: 1

I've got another one..

11) LET COWS EAT THE HAY, DISECT DEAD COW

lameness filter blah
Re:Speaking of needle in a haystack ... by Lehk228 · 2005-12-07 10:16 · Score: 1

that only works when you gently place the needle on the surface. when you plop it in with an assload of hay it will sink.

the problem is you have to churn the hay so the needle won't get stuck to the floating hay.

--
Snowden and Manning are heroes.
Re:Speaking of needle in a haystack ... by iamlucky13 · 2005-12-07 13:09 · Score: 1

Ok, time for another one of iamlucky13's little-known redneck nerd facts

Category: Cattle
Entry: 1097
Ranchers will commonly intentionally force feed a smooth magnet to calves. Because of it's weight, it will remain in the rumen or reticulum (the 1st and 2nd stomach compartments, respectively) for the life of the cow. Fields often have stray bits of metal small enough to be accidentally ingested while grazing, such as barb wire bits, fence staples, screws, etc. When stuck on the magnet, the pieces are effectively immobilized and prevented from damaging the stomach and possibly killing the cow. Eating a needle probably won't kill the cow, but after it dies of age (or bullets), you can retrieve the magnet and the needle.

Now y'all know.

Re:Indexes by CastrTroy · 2005-12-07 08:59 · Score: 1

But that's the trick. Finding a good way to index the data.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.

Re:Indexes by Marko+DeBeeste · 2005-12-07 09:00 · Score: 1

If we had ham, we could have ham and eggs. If we had eggs.

--
Faith: n. -- That human impulse that drives them to steal appliances when the power goes out

To find a signal in a sea of noise... by San+Francisco · 2005-12-07 09:01 · Score: 1

Perhaps this technology can make Usenet useful once again.

Maybe Slashdot can use it to find dupes by kk49 · 2005-12-07 09:04 · Score: 1

End
Of
Message

--
You can have your god back when you are old enough to handle the responsibility.

Re:Maybe Slashdot can use it to find dupes by UMEngin · 2005-12-07 09:13 · Score: 1

That would be like finding the hay in the haystack.

Re:Was it just me or was this story broken at firs by MarkGriz · 2005-12-07 09:07 · Score: 3, Funny

"It just refused to load for me."

Maybe your interest in the story was deemed statistically insignificant.

--
Beauty is in the eye of the beerholder.

Re:Why is technology starting to... by sonofagunn · 2005-12-07 09:08 · Score: 1

It means you're getting old!

Maybe Slashdot can use it to find dupes by Bloggins · 2005-12-07 09:09 · Score: 1

End of Message

Mythbusters by everphilski · 2005-12-07 09:10 · Score: 2, Informative

Mythbusters actually did an ep where they built two different needle-in-haystack finding machines, one actually did quite well...

-everphilski-

Re:Mythbusters by Tony+Hoyle · 2005-12-07 09:34 · Score: 1

Their solutions were kinda destructive though.

I'd like to see a way of finding a needle in a haystack that left you with a (largely) intact haystack afterwards, not a pile of ash or a wet sludge.

Huge inductive coils would be a good start... probably wouldn't find the bone one though - maybe some kind of MRI?
Re:Mythbusters by pipingguy · 2005-12-09 08:57 · Score: 1

Was that the one where Kari(sp?) mugged and did things for the camera in the cutesy, girly way? Stupid me, that pretty much describes every episode she's been in.

Scotty was a no-bullshit welder (and very attractive, to boot). Bring her back, she's a *real* babe.

Re:Was it just me or was this story broken at firs by Wisgary · 2005-12-07 09:12 · Score: 1

So... I'm just another piece of hay... :(

Re:But will it help me... by stinerman · 2005-12-07 09:13 · Score: 1

I know the warranty will be void if you shave off the pubic hair yourself (intentional damage to the product), but you might want to try it anyway. Buy the hairless variety next time and you should be in good shape.

Re:Roland Alert by Anonymous Coward · 2005-12-07 09:14 · Score: 1, Funny

Where have you been?
Don't you know the editors are in cahoots with the the Beatles Beatles guy now?
Please, try to keep up with the conspiracy theories, mkay? Jeez!

SETI? by ruiner13 · 2005-12-07 09:15 · Score: 1

Would this be useful to reduce the computations needed for the SETI@Home folks too? Seems they have a bit of data to sort through... Hell, genetic enginering too. Look for useful patterns in hundreds of DNA strands.

--

today is spelling optional day.

Re:1) INDUSTRIAL MAGNENT by Anne_Nonymous · 2005-12-07 09:18 · Score: 2, Funny

1) INDUSTRIAL MAGNET

DBAs everwhere are cringing and covering their data.

You ever thought [command]-[F] while... by atrocious+cowpat · 2005-12-07 09:21 · Score: 1

.. looking for stuff on your [real world] desktop?

I have, have actually had my arm and fingers twitching for the keyboard...

I think I need a major vacation soon, somewhere with no IT-devices whatsoever.

a.c.

--
sig? Oh, that sig...

Mythbusters did this... by slashname3 · 2005-12-07 09:28 · Score: 1

Mythbusters did this one already. They built two machines/processes to find needles in haystacks. One used a process to burn away the hay leaving the needles and the other used magnets and gravity to separate the needles from the hay.

Oh, wait. Their talking about data. Never mind.

I hope YOU know that ... by dazey · 2005-12-07 09:29 · Score: 1

from the moment you posted that comment, the value you gave increased just a little bit more ...

We are at the horizon of a cultural singularity... by Errandboy+of+Doom · 2005-12-07 09:30 · Score: 1

THE SINGULARITY

Throughout history, we championed the content creator. Only a tiny fraction of the population could write or understood math or science. Only a tiny fraction could dedicate themselves to the arts.

Most individuals' time was consumed by being agrarian generalists: they owned a farm, and they were constantly occupied by all the repairs and maintenance of their property. It wasn't a job, it was a way of life. But now, more and more, our economy makes us all incredible specialists. We're confined not only to a literal cubicle, but to a cubicle of tasks, often only seeing one tiny part of our contribution to social welfare. But as a result, we end up with leisure time. (Cf. Judge Skelly Wright's opinion in Javins v. First National Realty Corporation). While those reading /. while at work might quibble, the fact is that we all now have meaningful leisure time in some sense, we're not dedicated 100% to our livelihood.

In addition, current technology is allowing us to collaborate and share information as a global community like it never has before.

What does all this mean? For one, it means that techies can have bands, and even get national coverage, without giving up their day jobs. In fact, if MySpace is any evidence, anyone can have a band... and a lot of us already do. Also, given that 80,000 blogs are created each day (though 40,000 are probably also abandoned each day), huge throngs of people have something to say and are able to say it to huge, unrelated throngs of people.

The singularity is similar to the way other areas of economics have evolved. It used to be that 90% of the population made 100% of the food, and now only 10% of the population provides 100% of the food. It's the opposite for art and science (naturally, as we're freed from producing necessities, we can devote more time to producing luxuries, improving general quality of life, and solving more complex problems). Traditionally, 1% of the population made all the cultural content. The singularity? Soon, 99% of the population will be making 100% of the content.

For the first time in history, we are the captains not only of our personal destiny, but of our cultural destiny. However, as cultural creativity becomes so democratized, our contribution will become less and less controlling. Like Warhol said, it's not that we're all going to be famous, it's that we each only get 15 minutes.

THE DOWNSIDE OF A CULTURE OF CREATIVES, AND A SILVER LINING FOR SEARCH

A professor once said to me, "No one cares how much you know anymore, that's why we have the Internet. The important thing is creating new ideas." The formidible aspect of the new society of cultural creatives is that soon, no one will really need you to create ideas anymore either. Your drop in the cultural bucket is less and less meaningful every day. Content is easier and easier to make and share, and everyone wants to play, so as a corrolary, it will become harder and harder to find compensation as a cultural creative.

So what's the new valuable thing, in this storm of data/content? Maybe not making worthwhile contributions to the arts, science, knowledge, (which is important, but self sustaining). However, finding the worthwhile signal amidst all cultural noise is becoming more and more valuable. Someone needs to be a sieve for all the content being thrown around right now. Technologies of search and sort are the ways to do it. Google is not prospering because it learned something about advertising. Google is prospering because it precociously encapsulates the spirit of the dawning age, while most of us are still trying to figure out just what the hell I'm talking about.

Significant % of patterns in randomness by G4from128k · 2005-12-07 09:32 · Score: 2, Informative

Looking for possible patterns in large volumes of data is dangerous because of the high chance that random data will fit some of the myriad patterns tried. If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical significance level of 5% (even if the data is 100% random). "Cancer clusters" are an excellent example of this -- if you slice a dice a population enough different ways you are bound to find some geographic/demographic/ethnographic subgroup with a very high chance of some cancer.

Its better to either have a a priori hypothesis to look for one specific, pre-defined pattern in a mound data than to see if any pattern is in the data. Or, if one insists on looking for many patterns, then the standards for statistical significance must be correspondingly higher.

--
Two wrongs don't make a right, but three lefts do.

Re:Significant % of patterns in randomness by zex · 2005-12-07 09:54 · Score: 5, Informative

If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical level of 5% (even if the data is 100% random).

If you're not correcting for multiple hypothesis testing, you are correct. If you do have 100% random data that holds to perfect randomness at all scales (which I'm not sure is even possible) and correct for multiple hypothesis testing, then you'll find exactly what you "should" find: no significant pattern.

You mention "Cancer clusters" as an example of attribution of significance to insignificant findings. However, these clusters are often found (at least in the genetics research realm) by hierarchical clustering, which is self-correcting for multiple hypothesis testing. If you're speaking of demographic surveys which find that (e.g.) "black females in Tahiti who were exposed to .... are more susceptible to brain cancer", then you're probably right. I too see those as examples of restricting the domain of samples until you find a pattern - but the pattern nonetheless exists.
Re:Significant % of patterns in randomness by hackstraw · 2005-12-07 10:00 · Score: 1

Looking for possible patterns in large volumes of data is dangerous because of the high chance that random data will fit some of the myriad patterns tried.

No, God put the figure of Jesus in the sky, but made it not look too much like Jesus just to test the difference between the believers and non-believers. Trust me, it was not easy to do all that with nobody looking.

SETI? by Nom+du+Keyboard · 2005-12-07 09:33 · Score: 1

SETI?

--
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."

Regarding fraudulent transactions... by ahmusch · 2005-12-07 09:33 · Score: 2, Interesting

Current fraud detection systems in use in the financial industry are based on two primary knowledge bases:

1. A knowledge of your purchasing pattern as a consumer. To wit, having a statistically significant sample of what are valid transactions as well as knowing your credit score and income.

Do you shop at high-end stores? Do you use your card for primarily travel and entertainment? Do you use your card for everyday purchases? How much of your line-of-credit do you tend to use?

2. A comparison of recent transactions. For example:

A sudden wave of big-ticket purchases very close together in time, such as hitting a Best Buy the same day as buying jewelry.

A single card making multiple high-value transactions (3 or more) within an hour.

A pattern of unattended-auth-transaction (think pay-at-the pump) to big ticket purchase to unattended-auth and back.

Using geometric statistical analysis could only complement pattern analysis in any case, and I fail to see how it's superior to the existing behavior scoring algorithms which are based on an individual's past history, weighting each new transaction to determine if it's "out of profile", and if so, by what margin. Sometimes the fraud is only revealed by several transactions scoring progressively higher on the fraud-o-meter, and I suspect the geometric statistic analysis would fail to trigger that as an event, as it would be a continuation of the pattern.

My ability to read statistics papers is sadly out of date. Anyone want to give a shot at translating this into non-doctoral English?

ITagging by Anonymous Coward · 2005-12-07 09:33 · Score: 1, Funny

"What does God use to tag a galaxy with though?"

Are you telling us there's such a thing as Intelligent Tagging?

Hey, wait a minute! by $RANDOMLUSER · 2005-12-07 09:40 · Score: 3, Funny

An article posted by Roland Piquepaille with no links back to his site???
WTF? Roland? You feeling OK?

--
No folly is more costly than the folly of intolerant idealism. - Winston Churchill

Re:Hey, wait a minute! by Dachannien · 2005-12-07 10:44 · Score: 1

His name links to his website, so he still gets the pagerank boost. Beatles-Beatles does the same thing, and ScuttleMonkey the Sock Puppet posts his stories, too.

Hope there are no Jedis around by LostBurner · 2005-12-07 09:46 · Score: 1

"This is not the signal you're looking for..." Hope the signals they're looking for don't come accompanied with Jedis. Or maybe all the chaos is because of Jedi obfuscation of real signals?

Re:The most obvious application (way OT) by blackcoot · 2005-12-07 09:49 · Score: 1

it's been a while since i last did much perl, but shouldn't the last line of your sig be:

($world = $world) =~ s/bad/good/g;

otherwise you're making your world better but not ever doing anything with it...

I wonder if they were inspired by... by sciscitor · 2005-12-07 09:59 · Score: 1

The paper by Jeremy Stribling, Daniel Aguayo and Maxwell Krohn:

Rooter: A Methodology for the Typical Unification of Access Points and Redundancy

Definitely some interesting parallels between the two. Maybe someone who understands this stuff better could elaborate?

I think the best solution might already exist by JeffHunt · 2005-12-07 10:18 · Score: 1

Don't we already have regular expressions for this kind of stuff?

(ducks)

--

"It was hell!" recalls former child.

Re:Indexes by SatanicPuppy · 2005-12-07 10:21 · Score: 1

I don't see that there would be any point in indexing it...In an index you're atomizing it down to it's individual meaningless parts. Each and every part is therefore solitary in an index, and cannot be related to any other part of the index in a meaningful way, because all the other parts are equally unrelated to anything and meaningless as well.

It would be more useful to transform the apparently random data in some way so as to make signals or discrepancies buried in it obvious. There are all kinds of funky methods you can apply to random-seeming data to pull out interesting factoids.

--
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.

Monte Carlo experimental results? by nycbicyclist · 2005-12-07 10:23 · Score: 1

From TFA (emphasis added): "We propose a new test statistic based on a score process for determining the statistical significance of a putative signal that may be a small perturbation to a noisy experimental background.... We illustrate the technique in the context of a model problem from high-energy particle physics. Monte Carlo experimental results confirm that the score test results in a significantly improved rate of signal detection." Monte Carlo experimental results? So much for the betterment of mankind! These guys are just out to make a killing at the roulette table!

Re:Monte Carlo experimental results? by jim_deane · 2005-12-07 13:40 · Score: 1

If it isn't monte carlo, it's the "random" (drunkard's) walk.

You know, you GO into computational physics thinking it's all casinos and drinking, and this is what you find...

Re:We are at the horizon of a cultural singularity by fishybell · 2005-12-07 10:23 · Score: 1

shut your pie hole

--
><));>

in the PDF by recharged95 · 2005-12-07 10:30 · Score: 1

So, are these guys basically saying that to find "the needle", just "turn up the noise"? Hence, look at the noise patterns, then mask them out to get the key value(s)?

Relationship to other information theory concepts? by Chilltowner · 2005-12-07 10:45 · Score: 1

Sort of a dilettante question, but I've been researching using entropy and information gain here at work and some of what they're talking about in the article and the paper seems familiar, though I'm not skilled enough in stats yet to make much out of it. It seems to me to be fairly similiar to how you get an information gain score. If you can classify the background as such, you should be able to sift through data with however many parameters you want and find the parameters that cause the greatest difference in how "un-random" that sample is.

So, just so I can get a foothold on this new stuff technically, is the idea that the data they have isn't able to be classified yet? Am I getting ahead in the analysis in thinking about information gain by assuming an existing classification that differentiates signal and noise? Producing IG scores is more about WHY classified data points are different and not WHICH data points are significantly different from the background, right? Maybe I'm thinking too much in terms of data mining and producing a decision tree. Maybe I have it exactly backwards: assuming you already know which parameters (and at what thresholds) are signficant, does the Case-Western process produce the classification of data?

Sorry, I'm sort of thinking out loud here. Just wondering if there's a geek who can set me straight on this--my grasp of information theory is cobbled together from a bunch of google searches and wikipedia pages.

As a particle physicist by Lord+Byron+II · 2005-12-07 10:48 · Score: 5, Interesting

As a particle physicist I know exactly the kind of challenge that this is. The SNR is horrible, you've got tons of data, and the data may be distorted by all sorts of sources (background, misalignment, the wrong reaction, etc).

I also know that these sorts of algorithms are created all of the time. In fact, someone in my lab got his Ph.D. for applying a neural network to this problem. Furthermore, these algorithms are not "plug-n-play". They must be manually adjusted, by a team with a deep in-depth knowledge of the system in order to be useful.

So trust me when I say that Roland has blown this out of proportion. Congratulations to the CWRU team for getting the PRL paper published, but this is hardly the kind of ground-breaking news that deserves to be on Slashdot.

Yarbles by Ranger · 2005-12-07 10:49 · Score: 1

I often have the same feeling about Slashdot. it's like a big haystack, but the needles are larger and easier to find. I have noticed that the Roland Piquepaille needles happen to the most worthless. The obvious solution for finding the proverbial needle in the haystack of data is to make it up. It's not like there's any real world examples.

--
"You'll get nothing, and you'll like it!"

Medical applications by Kontinuum · 2005-12-07 10:49 · Score: 1

Two types of biomedical research that have this "needle in a haystack" problem are function magnetic resonance imaging (fMRI) and computational neuroanatomy. In fMRI, very basically you image the brain while the test subject is performing a task (looking at something, actively listening, tapping a finger, etc) and when they are not, and use the change in local blood oxygenation to infer brain activity. Since this is a tiny signal, you repeat lots of times. The simplest way to determine where the activity is would be just to do a t-test against the background or against an assumption of no change. However, given many tens or hundreds of thousands or millions of pixels, you'll have lots of false positives, or have to use a really really low p-value. Through the magic of spatial correlations and fancy math tricks, one can do reasonable interpretations of the data, but again, it's that sort of "needle in a haystack" problem. In computational neuroanatomy, you scan lots of brains of normal folks and lots of brains of folks with neurodegenerative diseases, say, Alzheimers (or younger old people and older old people, that sort of thing). You perform some complex warping to map these brains onto a template brain (a real person, the younger version of the person, or some synthetic template ... all are done), then study the warpings that are needed. What you want to see is how the various lobes of the brain are basically eroding with time as the disease progresses. Again, we can do standard statistics, but we are hurt by the massive number of data points we are dealing with (again, it's pixel by pixel), so we have to use more fancy math to get around it. In this case, theories of Hotelling and Adler (referenced in the original article from the original post) are very useful. As the amount of scientific data we have grows, we are starting to draw on what was once pure abstract mathematics to get meaningful statistics out. I can't pretend to even begin to understand the PDF article, but it's neat to see the same problems in lots of very different fields!

Develop, not Discover by John+Newman · 2005-12-07 10:50 · Score: 1

From the title of TFA, "Case researchers discover methods to find 'needles in haystack' in data". Pet peeve of mine, new techniques are not "discovered", they are "developed" (or something similar). Henry Ford did not discover the Model T by peering though a microscope, and CowboyNeal did not discover SlashCode by analyzing reams of code observations. It may be semantic nit-picking, but I think saying that the researchers just discovered this (surely insanely complex) bit of mathematical analysis takes away from their creativity - it all came from their heads, not from under a rock.

Re:Develop, not Discover by dreamer-of-rules · 2005-12-07 11:04 · Score: 1

Debatable. When a mathematician works for years to determine the theoretical limit of compression, is it discovery? or development? This is not the same, but seems similar enough that I wouldn't get my panties in a twist over it. (Huh? Do I even have panties?)

In any case, props and honors to those who researched the problem to an improved solution.

--
Everyone is entitled to his own opinions, but not his own facts.

Unexpected Research by LtDrebin · 2005-12-07 11:07 · Score: 1

As a new graduate of the physics program at CWRU, I was quite surprised at Prof. Taylor's research. He's so caught up in his physics entrepreneurship program that I had no idea he was actually do REAL research :) I hope this leads to some more funding for him. He is a great teacher.

Couldn't they just use... by jollyroger1210 · 2005-12-07 11:22 · Score: 1

..this? http://science.slashdot.org/article.pl?sid=05/12/0 5/1912216&tid=160&tid=126

--
Purple, because ice cream has no bones.

I know what they're looking for by dflipse · 2005-12-07 11:34 · Score: 1

I went to Case undergrad.

I'm almost positive they're trying to find for a decent-looking single female on that campus. Powerful application of mathematics indeed.

Re:As a nervous system. by mako1138 · 2005-12-07 11:48 · Score: 2, Funny

Here's a couple TB of data. Find me all the top quark candidates by tomorrow.

What I perceive as a bigger problem... by Ogemaniac · 2005-12-07 12:14 · Score: 1

is the overwhelming size of the literature. It is getting harder and harder to find the information that you need among a sea of near misses. Even to stay on top of one's subfield would require reading at least five journal papers a day, which is a significant undertaking even before you have to spend large amounts of time hunting for papers. For example, I am a chemist. It is generally not too difficult to find papers about a specific molecule - each molecule is assigned a specific ID number, which can of course be searched, and then the results further whittled down by using relevant keywords. However, it can still be ridiculously hard to find such trivial information as "what is the best known method for making this molecule" or "what is this molecule soluble in?". Finding information on processes, however, has become a huge chore. If you think you have found a new way to make a class of molecules, you are in for days of sorting through papers hoping that no one has already had your idea - or worse yet, tried it, found that it didn't work, and never reported this information.

This information overload is pushing back the age at which scientists become productive. Back in the 1920's, many of the famous people you learn about made their huge discoveries in their 20's. Now, most Nobel-prize winning work is done in peoples' 40's and 50's. It simply takes that long to climb up the backs of all the giants that came before. At the rate it is going, in fifty years, scientists will die of old age before they can make it to the top.

We really need better ways to sort and condense this mass of information.

Finding a needle in a haystack is easy by Resseguie · 2005-12-07 12:22 · Score: 1

Comparing this type of data mining to "finding a needle in a haystack" isn't a good analogy.

Finding a needle in a haystack is relatively easy - you just look for anything that isn't hay.

But in this case, since you don't know what "hay" is (i.e. it's often hard to define "normal"), it's more like searching through a garabage heap hunting for something that you don't know what it looks like.

man memmem by ShadowFlyP · 2005-12-07 12:51 · Score: 1

NAME
memmem - locate a substring

SYNOPSIS
#define _GNU_SOURCE
#include

void *memmem(const void *haystack, size_t haystacklen,
const void *needle, size_t needlelen);

DESCRIPTION
The memmem() function finds the start of the first occurrence of the
substring needle of length needlelen in the memory area haystack of
length haystacklen.

I don't want to rain on the parade by martin-boundary · 2005-12-07 13:27 · Score: 3, Interesting

I don't want to rain on the parade, but the result is quite possibly wrong.

If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no problem at all with the Breit-Wigner tails.

I think their derivation is quite possibly wrong.

Re:I don't want to rain on the parade by Anonymous Coward · 2005-12-07 15:30 · Score: 1, Insightful

I'm not a physicist, and I haven't had enough time to really look over the paper thoroughly, but I am a statistician.

My reading of the paper is that the Cauchy distribution is mentioned only to partially define a distribution that is used in an example. That is, there is nothing about the Cauchy distribution that is necessary for their results to hold. The Cauchy distribution is only relevant in an example, and only to partly define a density. Note, furthermore, that nowhere in the paper do they discuss the expectation of a Cauchy density, only the expectation of a score statistic. They do mention in the example that the Cauchy density is "centered" at a point E_0, but that's possible, as the central tendency of a Cauchy can be defined by the median of the distribution.

So you may be right, but I think that their discussion of the Cauchy doesn't detract from the rest of the paper.
Re:I don't want to rain on the parade by burns210 · 2005-12-07 15:37 · Score: 1

"Now, everyone knows that..."
You keep using that word. I do not thing it means what you think it means.
Re:I don't want to rain on the parade by martin-boundary · 2005-12-07 16:34 · Score: 2, Insightful

That's a good point. In the paper, the formula (2) is finite only if the tails of f dominate the tails of psi, so that means that f would have to be at least as fat tailed as the Cauchy. However, the paper doesn't attempt to state any assumptions, so it's hard to see which parts are solid and where there might be handwaving.
Funnily enough, the density f they use in the monte carlo simulation appears to be truncated to be in the interval [0,2] (otherwise it wouldn't be integrable). That suggests that in practice, they really do everything on the interval [0,2], and the psi they present isn't really a Cauchy in the first place.
Oh well, rigour still isn't a strong point of physics ;)
Re:I don't want to rain on the parade by jmtpi · 2005-12-07 19:17 · Score: 2, Insightful

martin-boundary wrote:
If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no problem at all with the Breit-Wigner tails.

They use a Breit-Wigner because that's often a realistic model of the signal distribution, when one is talking about resonance production in a particle physics experiment. (My copy is at work, but I know this is discussed, for example, in Sakurai's Modern Quantum Mechanics.) I don't think this paper nearly lived up to the press release, and certainly isn't germane to Slashdot, but I don't think the use of a BW has anything to do with it.
On the other hand, I'm merely a particle physics grad student, and I didn't even attempt to read the center of the paper. If they really did come up with something that has more power than chi^2 (at least for an extremely simple fit) then that is notable. What would be really interesting would be for someone to come up with a real goodness-of-fit statistic for unbinned fits.

Nice quote by jim_deane · 2005-12-07 13:30 · Score: 1

Do you have a source for that quote?

It's a great quote, I'd love to be able to use it and attribute it properly.

Jim

Re:Nice quote by Shadow+Wrought · 2005-12-08 04:37 · Score: 1

Alas no. I remember hearing it in college (ten years ago) but exactly where, when, and from whom I heard it have long since been dropped out of my memory banks:-(

--
If brevity is the soul of wit, then how does one explain Twitter?

Re:The most obvious application (way OT) by geekd · 2005-12-07 13:44 · Score: 1

Well, he's making HIS world better. Apparently, he can give a crap about the rest of us.

Got your ratio reversed by quanticle · 2005-12-07 14:20 · Score: 1

In Slashdot, the dupe to original article ratio is so high, its the original articles that need finding, not the dupes. Funny, though, from what I've seen, it seems like this particular algorithm would be quite efficient in doing that (e.g. it specializes in finding the data that is different, versus categorizing existing data).

--
We all know what to do, but we don't know how to get re-elected once we have done it

Opaque paper by geordieboy · 2005-12-07 14:38 · Score: 1

I tried to understand their paper, and I must say it's exceedingly hard to understand exactly what they did. I suppose this is often the case with PRLs (due to 4 page limit), but this one seemed particularly opaque and unimpressive. If I was going to write a paper I'd want to make it crystal clear, spell everything out (you can call me on that at the arxiv).

e.g. in the paper there is a quantity Z that is introduced first without definition, then a page later defined in terms of some vectors which are never defined. I guess you have to read their (unpublished) reference, but ugh. And the "geometry of the manifold"? What manifold? Wha? Are you a statistician or a wannabe-differential geometer?

Often it seems academics delight in trying to impress their peers with their terrible sophistication for some reason, to the point where it's really unnecessarily tough to understand something (and the high-falutin ideas in these papers usually turn out to be pretty simple and obvious or otherwise wrong, in my experience). Good job getting this one published indeed.

--
The world is everything that is the case

Au contraire by fireklar · 2005-12-07 16:05 · Score: 1

Actually, it is two paperclips put together: http://wiki.case.edu/Case_logo

Needle/Haystack by pdjohe · 2005-12-07 18:52 · Score: 1

Oh come on! It's not that hard! public static Object find(Object needle, Object[] haystack) { for (int i = 0; i < haystack.length; i++) if (haystack[i].equals(needle)) return needle; return null; }

Able Danger by technoCon · 2005-12-08 04:07 · Score: 1

There are disputed reports that this sort of data mining was used to identify the terrorists who attacked the USS Cole and flew airplanes into the World Trade Center (the official 9/11 commission's findings notwithstanding). The project is well documented on the right-side of the web and was called "Able Danger." According to rumor the project was shut down after identifying Mohammed Atta but it also pointing to Condoleeza Rice and Hillary Clinton as potential foreign spies.

This raises the issue of false alarms in any data mining operation. Rigorous secondary testing must be in place to weed out false positive signals. I heard Richard Feynman say that (in nuclear physics) it is painfully easy to fool yourself.

Slashdot Mirror

Finding a Needle in a Haystack of Data

122 of 173 comments (clear)