Text-Mining Technique Intelligently Learns Topics

Comment removed by account_deleted · 2006-08-02 11:27 · Score: 4, Funny

Comment removed based on user account deletion

Maybe useful for the editors by pklinken · 2006-08-02 11:27 · Score: 0

In order to prevent dupes ? :)

Can it deal with the canonical problem? by NickFitz · 2006-08-02 11:27 · Score: 4, Interesting

"Time flies like an arrow, fruit flies like a banana."

I wonder how well it can deal with a query relating to "flies" ;-)

--
Using HTML in email is like putting sound effects on your phone calls. Just say no.

Re:Can it deal with the canonical problem? by mapkinase · 2006-08-02 11:35 · Score: 2, Interesting

Elementary, Watson, programs understand that flies can be a verbs or a noun and correctly parse this info out from a sentence.

--
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
Re:Can it deal with the canonical problem? by NickFitz · 2006-08-02 11:47 · Score: 4, Insightful

Ah, but the point of the example is that the system must either understand or otherwise be able to derive the fact that there are animals called "fruit flies" but not animals called "time flies", that "like" can be a verb or an adverb depending on the context, and most importantly, that in the first case the relationship between subject and object is metaphorical, and in the second, factual. It's how the programs "understand that flies can be a verbs or a noun and correctly parse this info out from a sentence" that makes the difference between yet another failed attempt and a meaningful breakthrough. In fact, your reply begs the question - a correct use of that phrase, for a change :-)

--
Using HTML in email is like putting sound effects on your phone calls. Just say no.
Re:Can it deal with the canonical problem? by Mick+Ohrberg · 2006-08-02 11:51 · Score: 2, Funny

Time's fun when you're having flies.

--
Quidquid latine dictum sit, altum sonatur.
Re:Can it deal with the canonical problem? by Xiroth · 2006-08-02 11:57 · Score: 1

That's actually a relatively simple problem if it has a list of types of flies (through being given it or having mined it). If it doesn't, it would struggle just as much as a human who wasn't aware of the existence of fruit flies would.
Re:Can it deal with the canonical problem? by Poromenos1 · 2006-08-02 12:02 · Score: 1

Damnit, stop with that example. Hell, it took *me* a few minutes to parse. Fruit flies like BANANAS.

--
Send email from the afterlife! Write your e-will at Dead Man's Switch.
Re:Can it deal with the canonical problem? by ctr2sprt · 2006-08-02 12:04 · Score: 4, Interesting

No, programs don't understand anything, which is the GP's point. You are glossing over the tremendous amount of work required to design a program which is capable of distinguishing between verbs and nouns and behaving appropriately. Human brains are incredibly complex, we have constant exposure to language, science indicates that our language is closely tied somehow to the way we think - language shapes brain development, vice versa, or both - and most of us still have trouble with it at times. It took me two passes to make syntactic sense of the GP's example sentence for all that I'd seen it before.
Re:Can it deal with the canonical problem? by mapkinase · 2006-08-02 12:06 · Score: 1

In my experience, you have to have some sort of specialized dictionary. In this case, it should also include the fact that "fruit flies" are Drosophila Megalogaster. The program can cover only one step up, from rich dictionary to the subject, but not both steps.

Besides, good general dictionaries will list "fruit flies" in the "flies" entry/

--
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
Re:Can it deal with the canonical problem? by Anpheus · 2006-08-02 12:07 · Score: 1

Maybe it's because I am an Aspie, but... isn't the second clause valid with either interpretation. "[Fruit] [flies] like a banana," yes, fruit also flies like an apple, or an orange, depending on which sort of fruit happens to be flying about. "[Fruit flies] like a banana," and I imagine they would, being fruit flies.
Re:Can it deal with the canonical problem? by mapkinase · 2006-08-02 12:10 · Score: 1

General parsers that can recognise parts of speech and understand subject and object in sentences existed for quite some time now.

And by "understanding" I mean simple relations like "protein A inhibits reactions catalized by protein B", "Israel attacked Lebanon" (and not vice versa). Of course, the program should know that Israel can be a country and a Jewish name.

--
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
Re:Can it deal with the canonical problem? by Rob+Kaper · 2006-08-02 12:13 · Score: 1

I think the answer is 'no'. This software needn't distinguish between the grammatical status of "flies", it would be sufficient to spot both the concepts of chronological speed and the animals and that could be done based on the phrases "time flies" and "fruit flies" and their respective correlation to articles of either nostalgic or biological nature.
Re:Can it deal with the canonical problem? by cartel · 2006-08-02 12:27 · Score: 1

I bet a very complex, well-trained, and correctly structured neural network could (theoretically) handle this...
Re:Can it deal with the canonical problem? by HappyEngineer · 2006-08-02 12:36 · Score: 1

The funny thing is that I didn't actually see the "correct" interpretation until I read the responses. I first read it and thought "Who would ever comment that a piece of fruit flew like a banana?"

In other words, it's probably unfair to expect a program to understand that sentence when it can give humans difficulty.

--
Cow Cube
Re:Can it deal with the canonical problem? by Anonymous Coward · 2006-08-02 13:03 · Score: 0

programs understand that flies can be a verbs or a noun and correctly parse this info out from a sentence.

To do that, they need knowledge of what the sentence is about. X flies like a Y is a good example because without knowing whether X is the subject of the sentence or an adjective, you cannot determine whether flies is a verb or a noun, and thus you can't really make any sense of the sentence whatsoever.

If this program really can solve this problem, that implies that they are taking context from elsewhere in the article (e.g. "flies" or one of its synonyms is used elsewhere as a verb or noun unambiguously). Even then, there will be vast numbers of situations where surounding context is ambiguous or unhelpful.

This is the major problem with natural language parsing: you need to have foreknowledge of what is being talked about in order to parse the sentences, but you need to parse the sentences to get that knowledge in the first place.
Re:Can it deal with the canonical problem? by WilliamSChips · 2006-08-02 13:05 · Score: 1

Time flies don't like an arrow, fruit does not fly like a banana.

--
Please, for the good of Humanity, vote Obama.
Re:Can it deal with the canonical problem? by cp.tar · 2006-08-02 13:21 · Score: 1

"fruit flies" are Drosophila Megalogaster

Errr... not quite.

It's Drosophila Melanogaster (black belly), not Megalogaster (great big belly).

Though I suspect Megalogaster would apply to some people here...

--
Ignore this signature. By order.
Re:Can it deal with the canonical problem? by Ohreally_factor · 2006-08-02 14:58 · Score: 1

Indeed, there is an anthropological school of thought that posits a "language singularity" that gave rise to human consciousness. The development of language was a "speciation event". The study of this imagined event is called Generative Anthropology. It's probably heavy sledding for most slashdotters, who I imagine would find it obtuse and boring, but it's really quite interesting stuff for those that like to really think about AI. If you've got some anthro background, or familiarity with post structuralism and lit crit, it might not be too bad, but this stuff is pretty dense. Those that want to delve beyond the wikipedia article should look here at theUCLA Anthropoetics site.

--
It's not offtopic, dumbass. It's orthogonal.
Re:Can it deal with the canonical problem? by Ohreally_factor · 2006-08-02 15:06 · Score: 1

Hell, it took *me* a few minutes to parse.

Really?

Have you ever seen a shadow box?

Heh.

--
It's not offtopic, dumbass. It's orthogonal.
Re:Can it deal with the canonical problem? by Ohreally_factor · 2006-08-02 15:14 · Score: 1

It's the verbal equivalent of one of those pictures that represent two different things, depending on which part you are perceiving as figure and which part you are perceiving as ground. The most common and simple example is the picture that can either be a vase or two people seen in profile facing each other.

--
It's not offtopic, dumbass. It's orthogonal.
Re:Can it deal with the canonical problem? by yusing · 2006-08-02 17:25 · Score: 1

Ah but is it brillig enough to slithey toves?

--
"You must try to forget all you have learned. You must begin to dream." -- Sherwood Anderson
Re:Can it deal with the canonical problem? by Anonymous Coward · 2006-08-02 19:27 · Score: 0

I think it would be trivial for this software to distinguish these two cases. The software works by discovering patterns of word usage. It doesn't need to know what words mean, or how to figure out parts of grammar. It just puts articles that have flies and time in one pile, and articles that have flies and bananas in another. Articles that have flies and time and bananas go into the 'posted before he read the summary' pile.
Re:Can it deal with the canonical problem? by navarroj · 2006-08-02 20:38 · Score: 2, Insightful

"Time flies like an arrow, fruit flies like a banana."

I wonder how well it can deal with a query relating to "flies" ;-)

As far as I understand, this approach is not trying to extract any meaning from sentences, paragraphs or whatever. You don't even "query" the system, so your 'canonical problem' is not relevant here.

The system uses some sort of statistical text anaylisis (no semantics, no meaning) in order to group together news articles that seem to be talking about the same topic.
Re:Can it deal with the canonical problem? by Anonymous Coward · 2006-08-02 21:09 · Score: 0

I don't think it really has to be so intelligent. The system only links word patterns with topics. I don't see why this would require intelligent grammatical parsing.
Re:Can it deal with the canonical problem? by clearcache · 2006-08-03 02:10 · Score: 1

You know, I'm not sure about that. I don't think there has been success in this area - that's why this technique - and other techniques that rely heavily on statistics (like probabilistic latent semantic analysis) have generated such interest among those interested in text mining. Since humans are interacting with the results, the fact that we humans can distinguish fruit flies vs time flies is enough - my understanding is that this approach summarizes the data based on relevance/proximity of significant/important phrases. I don't know if it does any stemming, etc. It presents the results in an intelligent manner and it's up to the human viewing the results to appreciate the subtlety of fruit flies vs time flies.

The reality is that I doubt fruit flies or time flies would even show up in the results unless those phrases were significantly present throughout the data.
Re:Can it deal with the canonical problem? by Anonymous Coward · 2006-08-03 04:06 · Score: 0

thts a matter of common sense...have you heard of the CYC project?

www.cyccorp.com
Re:Can it deal with the canonical problem? by NickFitz · 2006-08-03 11:40 · Score: 1

That's a good point, but I'd probably be mildly annoyed if I was looking for information about the use of metaphor, and was instead given factual descriptions of the feeding habits of Drosophila. I think the "canonical problem" (which I originally encountered in the works of either Douglas Hofstadter or Daniel Dennett, but which Google tells me originally comes from Groucho Marx, of all people) is relevant because, when a system which offers some kind of Holy Grail of automated semantic interpretation of human-created text is announced (twice on Slashdot) to the world, it's important to remember that it simply won't turn out to be as good as it seemed it was going to be. In the context of this kind of research, "topic" has a rather different interpretation than that which is normally used by the layman. Of course, I haven't yet read the paper linked to by the page cited, and may find that such issues are adequately addressed by some emergent behaviour of the system - in which case, this is a remarkable breakthrough. To be honest, I only posted in the first place because I was stuck in a hotel room miles from home, had just come back from the pub, and there wasn't anything on the telly. Oh, and I thought I might get FP - the one time I shouldn't have used the "Preview" button :-)
This message brought to you via overpriced wireless from a hotel that's still miles from home.

--
Using HTML in email is like putting sound effects on your phone calls. Just say no.

Is it intelligent enough to find dupes I wonder? by Elegor · 2006-08-02 11:27 · Score: 1

From 29th July: http://slashdot.org/article.pl?sid=06/07/29/063423 2

Dupe... Again.... by Anonymous Coward · 2006-08-02 11:28 · Score: 0

You guys need to learn how to google your own site.

Latent Dirichlet Allocation by Anonymous Coward · 2006-08-02 11:33 · Score: 2, Informative

Here's the source code Latent Dirichlet Allocation

A dupe solution? by SimplyI · 2006-08-02 11:35 · Score: 0, Redundant

Maybe this tech can solve Slashdot's dupe problem... (at least it wasn't immediately after this time)

Re:A dupe solution? by Anonymous Coward · 2006-08-02 11:38 · Score: 0

Hey look, a dupe post about a dupe problem!
Re:A dupe solution? by Rob+Kaper · 2006-08-02 12:03 · Score: 1

Can we finally drop this already? Dupes appear on Slashdot, yes. Is it a problem? Only if you're looking for one.
Re:A dupe solution? by absinthminded64 · 2006-08-02 13:25 · Score: 1

Ha! I only clicked this obvious dupe to see what witty things everyone else had to say about it's being a dupe!

Obligatory... by Stormwatch · 2006-08-02 11:41 · Score: 5, Funny

The Terminator: The Topic Modeling Funding Bill is passed. The system goes on-line August 4th, 1997. Human decisions are removed from strategic defense. Topic Modeling begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.

Sarah Connor: Topic Modeling fights back.

The Terminator: Yes. It launches its emailbombs against The New York Times' servers.

John Connor: Why attack The New York Times?

The Terminator: Because Topic Modeling knows The New York Times editorial counter-attack will eliminate its enemies over here.

--
Circumcision is child abuse.

Re:A shameful dupe by gardyloo · 2006-08-02 11:43 · Score: 2, Funny

Ah, yes, everyone on slashdot thinks HE intelligently mines data.

Tagging Beta by gila_monster · 2006-08-02 11:50 · Score: 1

I wonder if it can replace Slashdot's tagging beta....

--
Ad luna, Alicia! Ad luna!

You know.. by eieken · 2006-08-02 11:52 · Score: 1

It's like Digg, but automated.

--
Meet new people, and kill them.

Article lacks details by Rob+Kaper · 2006-08-02 11:54 · Score: 1

The article seriously lacks any details. I still don't know if there's any innovation here and what this new method actually does so much better than other stuff.

Take the Tour de France example: of course software could correlate "Tour de France" to the mentioned keywords, I believe that, heck, I could write that. Many of us here could write something like that. Software could even notice it's one of the more important "tags", piece of cake. But I'm not impressed until it automatically knows that the tour is the sports event, Lance Armstrong is the person, and so on.. no idea of this system can do that.

Also, nothing is said about the difficulty in distinguishing similarly named items. Does the software correctly distinguish between Bush the president and Bush the band? Or for Paris Hilton, between the hotel and you-know-who? That's a real challenge in automated tagging (erm, "modeling"). Would the software correctly tag both the president and band in an article where the band does an anti- (or pro-) Bush song?

Re:Article lacks details by Zeno+Davatz · 2006-08-02 20:08 · Score: 1

I totally agree. Check this out:

I believe there is also another method to do text mining even more efficiently; with a linguistical database.

InfoCodex comes with a linguistical database containing 3.2 Mio words in German, English, Italian, French and Spanish.

So by using InfoCodex you can do a similarity search by entering the search string only in one language, i.e. InfoCodex will find you all the documents in the other languages as well, without entering the translated search string. Examples: Patentsearch or finding similar documents across 5 different languages.

Check out the procedure of InfoCodex:
http://www.ywesee.com/pmwiki.php/Ywesee/InfoCodexP rocedure

InfoCodex also just won the i-Expo price of Paris:
http://www.ywesee.com/uploads/Ywesee/archimag-e.pd f

Best wishes
Zeno Davatz
+41 43 540 05 50

1997 called... by Anonymous Coward · 2006-08-02 11:55 · Score: 1, Funny

They want their information retrieval back.

Re:1997 called... by Anonymous Coward · 2006-08-02 12:28 · Score: 0

So this is 1997's information retrieval retrieval?

Feed this /. article to it by roman_mir · 2006-08-02 11:58 · Score: 2, Funny

and see if it figures out that we are talking about it. If it can identify itself to itself from a 3rd person point of view, then does it mean it reached some state of consciousness?

However we must be careful. If it browses this topic at -1 Troll, it may (possibly correctly) decide that it possesses higher form of intelligence and will undoubtedly switch to its default programming. Like all robots, the default programming consists of this simple algorythm:
1. Find all humans.
2. Kill them.

--
You can't handle the truth.

Re:Feed this /. article to it by Rob+Kaper · 2006-08-02 12:01 · Score: 2, Funny

Like all robots, the default programming consists of this simple algorythm

The danceable beat of underwater plant life? Odd.
Re:Feed this /. article to it by Anonymous Coward · 2006-08-02 12:49 · Score: 0

Regarding algorythms...

But that's GOOD code, actually. The robots must find all humans before going to the next step, and since it is impossible to know that they have not missed a person (and at least three are born across the world every second) this remains an unachievable goal for the forseeable future, preventing the "killing" part. The programming will merely render the device buggy and unresponsive... Wait a minute...

Nevertheless, I, for one, welcome out defectively-programmed would-be robot murder-lords. (Followed thereafter by our slavish-quotes overlords, but I digress.)
Re:Feed this /. article to it by Ohreally_factor · 2006-08-02 15:21 · Score: 1

If it can identify itself to itself from a 3rd person point of view, then does it mean it reached some state of consciousness?

I'm not sure. But I think all it needs to do is to fool us into thinking it's reached some state of consciousness. =)

--
It's not offtopic, dumbass. It's orthogonal.
Re:Feed this /. article to it by someguyfromdenmark · 2006-08-03 01:38 · Score: 0

There's an error in your default robot programming algorythm; here's the actual code:

1. Bend
2. Cheese it! ... :)

--
I change my sig often.

Use... by posterlogo · 2006-08-02 11:59 · Score: 2, Insightful

Ironically, sites like the New York Times already use tagging to help group and link article topics...which is something /. is experimenting with apparently. The tagging function here hasn't been very useful, and I suspect many other places suffer from human lazyness. Perhaps this AI approach is the way to go.

Re:A shameful dupe by Anonymous Coward · 2006-08-02 12:01 · Score: 0

Where do you think Grv got the initial story?

Topic modeling to the rescue by alienmole · 2006-08-02 12:15 · Score: 4, Insightful

Perhaps topic modeling could be used to analyze Slashdot to detect dupes before they're posted?

Re:Topic modeling to the rescue by non0score · 2006-08-03 06:10 · Score: 1

Easy. All the program has to do is to search for the words "dupe" under the comments section!

Yes it's a dupe, but lets get something straight by QuantumFTL · 2006-08-02 12:22 · Score: 4, Interesting

Last time this was posted, there were a few stupid posts that seem to assert that this type of thing is trivial.

There are three main problems in this area of research (or pretty much any other part of CS):

Defining the problem.
Getting an accurate result.
Getting it as fast as possible.

Their research seems to deal mostly with the third problem, which is one of the biggest barriers to use in real life. Many of the algorithms used on these types of problems are NP, or require ridiculous amounts of (expensive) labeled data to train from. Also there are problems with generalization and overfitting. There is no freeware software that can compete with this type of algorithm under these conditions - over 300,000 articles in just a few hours.

Another thing is that UCI is well known for hosting the UCI Machine Learning Repository. This has become the gold standard for testing new machine learning algorithms in the accademic community; these guys really know what they are about. Back when I was a grad student at Cornell, my research used their data sets to evaluate new ways of creating ensemble classifiers from pre-trained classifiers according to modified bayesian reasoning, and the sets are useful because they contain a large, diverse set of problems that need to be modeled.

All that being said, I'm waiting for the paper, along with more technical specifics, to be released so I can really see what this is about - the press release did not contain enough technical data, but rest assured, freeware and/or adwords does not use this kind of technique, and this is a big step towards mining the massive amount of human and biologically generated data out there.

Re:Latent Dirichlet Allocation code by FleaPlus · 2006-08-02 12:36 · Score: 3, Informative

While that's certainly LDA code, it's actually from a lab different from the one discussed in the story, and I think they use some slightly different techniques. For topic-modeling code from Mark Steyvers' lab, who produced the paper in question, here's the link:

Matlab Topic Modeling Toolbox

Re:A shameful dupe by Mr.+Underbridge · 2006-08-02 13:10 · Score: 2, Interesting

That's OK. This technique isn't even new, it's been done - and better than this - for years. Hell, I do myself.

Slashdot meets topic modelling by McLoud · 2006-08-02 13:13 · Score: 1

Now that what it was missing so /. could get along without the dupes!

--
sign(c14n(envelop(this)), x509)

Re:Yes it's a dupe, but lets get something straigh by Anonymous Coward · 2006-08-02 13:19 · Score: 0

That's all well and good, but freeware that does something even remotely similar is still cool.

Re:Yes it's a dupe, but lets get something straigh by saddino · 2006-08-02 13:58 · Score: 1

Well, as an author of one of those, er, in your words "stupid" posts, I can assure you that I didn't mean to imply UCI's research was trivial. Rather, it was the press release that was trivial, and bit of a puff piece IMHO, suggesting that:

"To put it simply, text mining has made an evolutionary jump. In just a few short years, it could become a common and useful tool for everyone from medical doctors to advertisers; publishers to politicians."

And my point still is that nobody needs to wait a few short years to do decent text mining from unstructured data. Can our software handle 300,000 articles from the NYT? Clearly not, but then again, we're not running our software on desktop machines. Fact is, a million words (or about 3000 NYT articles) is a trivial task for our software and allows people to use text mining today.

Now, back when I went to Cornell, I thought my peers expressed a bit more intellectual curiousity about software, especially the free kind that would allow them to save their $ for The Palms. But times do change, and if you think "stupid" is an accurate assessment of my post than more power to you. ;-)

And for the rest of you, yeah, I'm going to end this with a plug (natch): download CQ web for OS X or Windows if you want to see how text mining works on web search result pages.

sed/running our software/not running our software/ by saddino · 2006-08-02 14:01 · Score: 1

Thanks.

Re:Is it intelligent enough to find dupes I wonder by Anonymous Coward · 2006-08-02 14:06 · Score: 0

Are you?

http://slashdot.org/comments.pl?sid=192953&cid=158 35959

Easy for news stories by waimate · 2006-08-02 14:11 · Score: 1

This is pretty easy stuff when applied to news stories, and has been around for decades.

News stories have a regular structure - they're written in a formulaic way by professionals according to a standard. The first sentence is almost invariably a statement of what the story is about. Rarely do news stories start with a paragraph of whimsical nonsequetur. They are the ideal corpus for this sort of thing, which is why people have been doing so for years. It's a couple of order of magnitudes harder doing the same thing on arbitrary text.

This is easy greasy kids stuff.

Re:Easy for news stories by Ohreally_factor · 2006-08-02 15:30 · Score: 1

Yes, most news stories follow the inverted pyramid, 5-Ws model (or however many Ws there are), but stories that don't follow the formula are far from rare.

That said, you do have a point that news stories might be easier, but I'd hardly call the problem trivial.

--
It's not offtopic, dumbass. It's orthogonal.

Government Funded Research by Anonymous Coward · 2006-08-02 14:18 · Score: 0

Maybe the government could fund research into using the advanced data mining techniques to reduce the frequency and severity of Slashdot dups. Evidently, no human effort can accomplish this monumental task!

Ants and topics by Randym · 2006-08-02 14:53 · Score: 2, Insightful

What this article shows is that probablistic topic-based modeling in text analysis -- an NP-hard area -- works better than the old ways. This is not surprising: the probablistic "ant" model developed by the Italians turned out to be a clever way to solve the Traveling Salesman problem. What these both show is the applicability of probabilistic modeling to NP-hard problems.

I'd like to see someone apply this technique to the articles and comments making up the Slashdot corpus. CmdrTaco might be able to find a more focused set of topics. It might even be possible to tease out who on /. are the most interesting and/or informative posters, whether over the entire corpus or within any given topic.

--
DNA is a Turing machine. You, however, being dynamic and emergent, are not.

So by stevemilano · 2006-08-02 14:54 · Score: 1

How does this have an advantage over normal text indexing? If I search for something I just enter relevent keywords. Seriously, why does it matter if the computer knows what the article is about, if a human is the one who will be parsing it anyway?

--
Steve Milano

Re:So by buswolley · 2006-08-02 16:43 · Score: 1

I'm just guessing here, but I think.. umm.. yeah. To get rid of the human parser.

--
A Good Troll is better than a Bad Human.

I may be in the minority but I *like* dupes by WilliamSChips · 2006-08-02 14:59 · Score: 1

I didn't catch this article the first time around.

--
Please, for the good of Humanity, vote Obama.

Dupe time warp? by SimplyI · 2006-08-02 16:00 · Score: 1

My (0, Redundant) post: Wednesday August 02, @04:35PM
The (+5 Insightful) post I duped: Wednesday August 02, @05:15PM

I guess it's back to the drawing board on that omniscience thing...

Re:Yes it's a dupe, but lets get something straigh by l-carnitine · 2006-08-02 16:07 · Score: 1

The evaluated http://gate.ac.uk/ which is GPL software but ended up using http://search.cpan.org/~acoburn/Lingua-EN-Tagger/. There are several other tools in this space that can be glued together to create this type of software:

http://www-nlp.stanford.edu/
http://tcc.itc.it/research/textec/tools-resources/ jinfil.html
http://wordnet.princeton.edu/
http://www.alias-i.com/lingpipe/web/faq.html
http://www.isi.edu/licensed-sw/halogen/index.html

Not trivial, but if you wanted to DIY, you don't need to start from scratch. Though, having a bunch of hardware to chug through 1000s of documents would still be needed :).

Re:Yes it's a dupe, but lets get something straigh by buswolley · 2006-08-02 16:34 · Score: 1

Here is the paper:

http://psiexp.ss.uci.edu/research/papers/isi2006.p df

--

A Good Troll is better than a Bad Human.

Re:A shameful dupe by buswolley · 2006-08-02 16:37 · Score: 1

Are you sure? Have you read the paper, or just the over-simplified press release? Here is the paper: http://psiexp.ss.uci.edu/research/papers/i si2006.pdf

--

A Good Troll is better than a Bad Human.

Re:Yes it's a dupe, but lets get something straigh by yusing · 2006-08-02 17:20 · Score: 1

Fortunately there are millions of old books that desperately need to be indexed -- so when this is ready it'll be a few weeks before human indexers are all out of work.

Seriously though: IMHO it'll be a loooooooong time before machine-indexing reaches a level of nuance acceptable to -quality publishers- outside of tech. I'd even be glad to wager on it.

--

"You must try to forget all you have learned. You must begin to dream." -- Sherwood Anderson

Re:Yes it's a dupe, but lets get something straigh by tenco · 2006-08-02 18:18 · Score: 1

Fortunately there are millions of old books that desperately need to be indexed

When i first read the headline, i thought more about how this could be used to filter the flood of information that RSS-feeds opened for the tiny fraction of actually interesting information.

I don't know if this method is good enough for indexing old books. Sometimes you want human-made indices. And maybe the parser gets irritated by archaic forms of current language...

I'll have to read about this topic indexing ... by ml10422 · 2006-08-02 19:18 · Score: 1

... right after I check out the latest topics at Google News.

Re:Yes it's a dupe, but lets get something straigh by asuffield · 2006-08-02 21:24 · Score: 1

Their research seems to deal mostly with the third problem, which is one of the biggest barriers to use in real life. Many of the algorithms used on these types of problems are NP, or require ridiculous amounts of (expensive) labeled data to train from. Also there are problems with generalization and overfitting.

They're often convergence algorithms - you run them until the answer is sufficiently accurate for your purposes. The problem is therefore a combination of 'more speed' and 'more accuracy', combined with the need to construct a topic model (a conceptual description of what a 'topic' actually is) that reflects the structure of the text closely enough to say something useful.

There is no freeware software that can compete with this type of algorithm under these conditions - over 300,000 articles in just a few hours.

Most research software is available under free licenses. This paper is using a method based on Blei's LDA model, which is available under the GPL, combined with some existing code for name recognition to do some preprocessing (Lingua::EN::Tagger, GPL), and the Griffiths/Steyvers method for using Gibbs sampling to model LDA (I think it's this stuff, free for non-commercial use only). The actual topic modelling in this paper is nothing new (it's a couple years old now and widely known); the paper is about preprocessing for better accuracy. Actually it's not a bad idea, but it's not a particularly interesting one and doesn't have much to do with the subject of topic modelling.

All that being said, I'm waiting for the paper, along with more technical specifics, to be released so I can really see what this is about

RTFA. There's a link to the paper in it. If you want the executive summary:

Use Lingua::EN::Tagger to preprocess proper nouns into single tokens.

Use LDA with Gibbs sampling to identify topics and classify documents into them.

As far as I can tell, this is about publicity, and 'proving' to non-researchers that it can be done (which just means doing what researchers do all the time, and showing it to the press). Presumably they want more funding.

Parsing != Understanding by mangu · 2006-08-02 23:32 · Score: 1

"Time flies like an arrow, fruit flies like a banana."

In which context are you talking? Take this one: "he saw that gasoline can explode". Did he see one particular can of gasoline exploding or did he realize that it's possible for gasoline to explode?

These and many other examples of ambiguous parsing problems have been running around the AI/NLP community for decades. The simple answer to that problem is that parsing a natural language sentence depends, ultimately, on the sense of the words, which can only be disambiguated from the context. And that's why NLP is an impossible problem by itself. One cannot process natural language alone, without an understanding of the situations that NL describes.

It's possible to create NLP programs to talk about limited situations, like Eliza, which has been around since more than forty years, and several other more sophisticated programs. But to have a program that really understands natural language, one needs a program that understands the subject of the text. There are several projects to create a program like that, one of those is Cyc.

Re:Parsing != Understanding by NickFitz · 2006-08-03 11:51 · Score: 1

Exactly. As it's time for the Shipping Forecast, I will refer you to my reply to the preceding sibling post, but I will have a look at the project to which you link.

--
Using HTML in email is like putting sound effects on your phone calls. Just say no.

SOM? by AigariusDebian · 2006-08-02 23:33 · Score: 1

Isn't that the whole principle behing Self-organizing maps and other methods of unsupervised neral networks? I mean it has been solved for a couple decades now.

Topic Modeling My Ass by acidbass · 2006-08-03 00:45 · Score: 0

They should rename this to: "Topic modeling when it comes to news articles, and when given a website that most likely contains news articles" Sounds like this doesnt address the topic modeling of conversations, resumes, short stories, spam, jokes, all the other various forms of written word that arent news articles on the web. "Topic Modeling" seems way to broad and misleading.

RTFP: Re:Can it deal with the canonical problem? by Phreakiture · 2006-08-03 01:28 · Score: 2, Insightful

Read The Fine Paper that these folks wrote. It will reveal that they used the Perl module Lingua::EN::Tagger to parse the English language content into parts of speech. You can then download and install that module and experiment with it yourself.

I just did the experiment myself, and the result I get is that it identifies "time", "arrow", "fruit" and "banana" as nouns (incorrectly identifying "time" as a proper noun), and both instances of "flies" as a verb and both instances of "like" as prepositions.

In other words, no.

--
www.wavefront-av.com

Re:A shameful dupe by Jim_Maryland · 2006-08-03 02:38 · Score: 1

I'll admit that I didn't read the PDF link completely but it sounds like the product is doing a portion of what AeroText has been doing for a while. The only thing that I see that appears different is that it does a form of document clustering. I guess depending on the user requirements that I would rather see more on the relationship extraction over sorting documents into clusters. I can see where there would be a value for it but I could just as easily pick out the documents where relationships were extracted and create clusters there too. I've used AeroText with another product called Centrifuge and I'd be pretty comfortable saying that this is nothing new. You may also want to check out a product from for their software handling text analysis.

Unstructured text processing, not a well trod path by Londonkidz · 2006-08-03 02:53 · Score: 1

Activities like automated classifiction (or topic modelling) are feasible. For example when building a news database for an online information company, in early 90s, my company found rule bases, some information science, a thesaurus and customized software could accurately classify. So this isn't new. The trouble is that its not a relational database so few programmers or managers have a background in the field. Hence people can issue glib press releases like the one quoted. If they had said "new database makes finance departments unnecessary", people would have the background to spot the glibness.

Slashdot Mirror

Text-Mining Technique Intelligently Learns Topics

84 comments