Text Mining the Multiverse

← Back to Stories (view on slashdot.org)

Posted by michael on Friday October 17, 2003 @08:41AM from the mother-lode dept.

The NYT has a decent piece about text-mining, skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.

3 of 137 comments (clear)

Min score:

Reason:

Sort:

Brute forcing the problem by metlin · 2003-10-17 08:49 · Score: 2, Interesting

To make sense of what it is reading, the software uses algorithms to examine the context behind words.

They make it sound like Semantic and Contextual modeling is done on the fly -- the way I see this system, it does this based on a preset lexicon or database.

Thats again brute forcing the problem -- a lot of researchers in the field feel that real solution does not lie that way. We need to analyze this from ground up, to gather meaning from data.

The above method fails the moment you have spatial and temporal data -- my lexicon may evolve over a period of time.

You're looking at all the information and then deciding whats for you -- a better way is to develop an "instinct" for the right kind of information and refine it.

If you really want to know where data mining is going to, look at KDD or SIGMOD -- thats where all the real action is.
Fun with numbers by ajs · 2003-10-17 08:50 · Score: 2, Interesting

Here's some fun you can have with numbers. Take this Perl one-liner:
perl -ne '$x{$1}++ while /(\d)/g;END{print map {"$_ occured $x{$_} times\n"} sort {$a<=>$b} keys %x}' xxxx
and run it with "xxxx" replaced by the name of some large text file that you create by saving email messages, web pages, log files, what have you.

The scary part (that took mathmeticians a long time to accept and longer to figure out) is that the distribution is the same for any sufficiently representitive sample of text....
Some notes... by ekephart · 2003-10-17 09:36 · Score: 2, Interesting

(1)"Of course, no one, Dr. Liebman included, is arguing that these products are actually reading anything. What they are engaged in is "text mining,'' "

Dijkstra once said "The question of whether computers can think is like the question of whether submarines can swim." ... Just a thought.

(2)As noted in the article sarcasm is very hard to detect. If you think about it even many people have a hard time recognizing it. How are we supposed to develop an intelligent system when we "intelligent" humans don't even "get" it?

(3)"There is a need to make these technologies available for publicly available information," he wrote at his site.

Yes of course. Anyone who has done research knows how frustrating it is to read through abstract after abstract, let alone the entire publication, to find what you are looking for. In research when you are looking for facts or raw information text mining seems highly promising. Yet, for interpretive processes it grows increasingly difficult to envision a correct system. As noted nuances are difficult to detect. In addition to sarcasm, words like "still" allow for multiple meanings for the bigrams, trigrams, etc. to which they belong. Natural language ambiguity is the most important problem to overcome in NLP. After all, how would you like to write a printf statement and not know whether you would get the intended output or some other arbitrary call.

--
sig