Text Mining the Multiverse

← Back to Stories (view on slashdot.org)

Posted by michael on Friday October 17, 2003 @08:41AM from the mother-lode dept.

The NYT has a decent piece about text-mining, skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.

12 of 137 comments (clear)

Min score:

Reason:

Sort:

I didn't read the article by Mattwolf7 · 2003-10-17 08:43 · Score: 2, Insightful

Why does slashdot keep linking to articles that require NYT registration? Isn't there some sort of Google news out there?
(Yes I am a lazy /. reader)
1. Re:I didn't read the article by Rick+the+Red · 2003-10-17 12:14 · Score: 2, Funny
  
  I feel realy sorry for luser@aol.com, because I've signed him up for all sorts of things...
  
  --
  If all this should have a reason, we would be the last to know.
RTFA by devphaeton · 2003-10-17 08:44 · Score: 2, Funny

skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.

Like those ppl who actually RTFA and try to get "FORST PIST!!!"?

--

do() || do_not(); // try();
Support non-whoring reg-free linkage! by Anonymous Coward · 2003-10-17 08:46 · Score: 5, Informative

Brought to you by your favorite anonymous non-whoring poster: the Google link.
The same article is also posted at CNET, which doesn't require registration. They also have it in a nice single-page format for those that don't like to keep hitting "next".
Brute forcing the problem by metlin · 2003-10-17 08:49 · Score: 2, Interesting

To make sense of what it is reading, the software uses algorithms to examine the context behind words.

They make it sound like Semantic and Contextual modeling is done on the fly -- the way I see this system, it does this based on a preset lexicon or database.

Thats again brute forcing the problem -- a lot of researchers in the field feel that real solution does not lie that way. We need to analyze this from ground up, to gather meaning from data.

The above method fails the moment you have spatial and temporal data -- my lexicon may evolve over a period of time.

You're looking at all the information and then deciding whats for you -- a better way is to develop an "instinct" for the right kind of information and refine it.

If you really want to know where data mining is going to, look at KDD or SIGMOD -- thats where all the real action is.
Fun with numbers by ajs · 2003-10-17 08:50 · Score: 2, Interesting

Here's some fun you can have with numbers. Take this Perl one-liner:
perl -ne '$x{$1}++ while /(\d)/g;END{print map {"$_ occured $x{$_} times\n"} sort {$a<=>$b} keys %x}' xxxx
and run it with "xxxx" replaced by the name of some large text file that you create by saving email messages, web pages, log files, what have you.

The scary part (that took mathmeticians a long time to accept and longer to figure out) is that the distribution is the same for any sufficiently representitive sample of text....
Red Necks by k_stamour · 2003-10-17 08:56 · Score: 2, Funny

"to extract some sort of refined knowledge from it." hum....
If you have an infinite number of red necks ....Infinite number of shot guns & shotgun shells.... And an infinite number of stop signs, you will eventually get Shakespeare in brail.....

--
Julius Caesar - Act I, Scene i: "What mean'st thou by that? Mend me, thou saucy fellow!"
Could do us a big favor by Strange+Ranger · 2003-10-17 08:57 · Score: 2, Funny

...skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.

Dear Text Miners,

Please start here: http://slashdot.org

Thanks so much.

--

Operator, give me the number for 911!
Well, DUH! by djeaux · 2003-10-17 09:13 · Score: 2, Insightful

How well computers truly make sense of what they are reading is, of course, highly questionable, and most of those who use text-mining software say that it works best when guided by smart people with knowledge of the particular subject.

May I offer that computers make no sense of what they are reading & that "smart people with knowledge of the particular subject" aren't optional if the results of text-mining are to be of any usefulness whatsoever, at least in any kind of reasonable time frame.
Otherwise, the text-mining computer is playing the old "99 monkeys with typewriters" game...

--
"Obviously, I'm not an IBM computer any more than I'm an ashtray" (Bob Dylan)
but what about the data itself? by koekepeer · 2003-10-17 09:35 · Score: 2, Insightful

i always wondered about this

allright, you can take huge amounts of text and apply some smart tricks to extract patterns from it.

but how can you determine whether the original data was trustworthy?

take the example of genome annotation (description of gene function), which would be helped greatly by including more functional descriptions from scientific literature. how do you determine whether the original publication was backed by solid experimental research?

by the reviewers of the articles? i don't think so, peer review is a snakepit filled with politics. by the amount of people who cited it? hmmmm... so hip subjects are more true?

me personally, because i'm experienced, can recognise bullshit articles when i see them. but how to translate this into an algorithm... anyone any ideas about this? or even working solutions?

(of course this is an example from my field of expertise - biology, but it applies to any set of text data/articles IMO)
Some notes... by ekephart · 2003-10-17 09:36 · Score: 2, Interesting

(1)"Of course, no one, Dr. Liebman included, is arguing that these products are actually reading anything. What they are engaged in is "text mining,'' "

Dijkstra once said "The question of whether computers can think is like the question of whether submarines can swim." ... Just a thought.

(2)As noted in the article sarcasm is very hard to detect. If you think about it even many people have a hard time recognizing it. How are we supposed to develop an intelligent system when we "intelligent" humans don't even "get" it?

(3)"There is a need to make these technologies available for publicly available information," he wrote at his site.

Yes of course. Anyone who has done research knows how frustrating it is to read through abstract after abstract, let alone the entire publication, to find what you are looking for. In research when you are looking for facts or raw information text mining seems highly promising. Yet, for interpretive processes it grows increasingly difficult to envision a correct system. As noted nuances are difficult to detect. In addition to sarcasm, words like "still" allow for multiple meanings for the bigrams, trigrams, etc. to which they belong. Natural language ambiguity is the most important problem to overcome in NLP. After all, how would you like to write a printf statement and not know whether you would get the intended output or some other arbitrary call.

--
sig
KDD Cup by apsmith · 2003-10-17 16:18 · Score: 3, Informative

The knowledge discovery and datamining cup challenge this year was looking at the arxiv.org papers for this sort of analysis - some very interesting results. The Task 4 winnder looked at the structure of the papers as a sort of relational database and uncovered a lot of statistical patterns and metrics that could be quite useful for scientists.

--
Energy: time to change the picture.