Mining Unstructured Data
jscribner writes "Data these days tends to an unstructured form, be it text (like the web, email, or books), spoken word, or even in DB's with unique organization (and thus a discrete language). There's a new article on Unstructured Data in Think Research; it's an overview of the challenges, progress, and potential rewards in this area. I'm leaving on your doorstep because, to me, it's a good launching point for discussion of several interesting possibilities: /. as a minable DB of ideas, email identified by interpretation rather than keywords, emotive XML, etc."
A Machine will be considered truly intelligent when it can translate all emails on slashdot into a usable form. Since spammers are some of the most persistent and aggressive users and developers of technology, I expect we'll have real AI telling us how to enlarge our penises by next Thursday.
---- El diablo esta en mis pantalones! Mire, mire!
Oooookay.
Sir? Please step away from the bong.
I just spent an ejoyable half hour or so reading Business 2.0's "minable database" of 101 Dumbest Moments in Business, and then I had a look at their even-more-hilarious 100 Dumbest moments in e-Business. This article really does have that weird flavor of megalomaniacal Internet-hype gibberish that we all came to know so well during the boom years. In a way, it's a pleasant little nostalgia trip to see the same old idiocy presented with the same old mindless confidence, but in another way it's just depressing.
Reality Check: Slashdot is a BBS for bored IT workers taking a break while installing nine hundred copies of Word on nine hundred 266 MHz beige boxes at the local credit union. It is not a minable database of ideas (or at least not of ideas worth mining). At its best, it's an undergraduate bull session.
What the hell are you people smoking?
"Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive" -- hey, that's me!
Interesting, my mining of hot ideas on Slashdot has determind that a Beowolf Cluster of First Posts is the next big thing...
"Good things don't end with eum, they end with mania or teria." - H. Simpson
By definition this is an unsolvable problem, because what it requires is definition of undefinition (if such a term exists). While you can make assumptions on unstructured data and apply Natural Language rules across it you are still left with the possibility that you've interpretted incorrectly. So to create definition in a loose format inherently requires you to assume its meaning, the rate of accuracy can be improved but absolutes are impossible to attain.
Simply put, if you don't understand what someone is talking about, you can make a reasonable guess and then refine it but you are always making assumptions.
To put in it simple terms for George W. Bush
All Muslims are Terrorists
All Supporters of Militia (McVeigh) are terrorists
Unstructured data is a great way to make money, and a great way to get 80% of the story, the trouble is the other 20% gets destroyed in the process.
Welcome to 1984, and a Brave New World, the minority will cease to count.
An Eye for an Eye will make the whole world blind - Gandhi
This article will have great importance to our director of IT, since the way our company stores data seems to completely unstructured.
-- Stanislav Shalunov
They draw you in with the bit about unstructured data, but it turns out to be more about differently structured data. I think they missed their own point.
I just attended the Knowledge Technologies conference in Seattle. It's scary how many people think the way to mine unstructured data is to force it into a structure. So many people spending years developing standard taxonomies--different standards, of course. And so many companies (like Semio, for example) that want you to develop your own taxonomy. Then you wind up with the very problem this article really discusses.
[Skip next section to avoid my self-promotion]
I'm a big fan of mining unstructured (and differently structured) data by throwing a mining layer on top of it. All of us at Think Tank 23 are. Check out the demo of our technology, Waypoint 2.0, which pulls concepts from unstructured documents, then uses the concepts as the basis for finding relationships between them.
"email identified by interpretation rather than keywords"
Report: The attached email messages indicate a successful business plan. This simple way to make money fast by selling pamphlets is interpreted as being good: it has been confirmed by many quotes within the email, by repetition in many similar emails, by the suggested calculation of potential return.
Opportunity: There is an unfilled business opportunity which is confirmed by the lack of existing businesses which use this plan. Searches of local and national databases have not found any businesses which are using this method.
Suggestion: Give me a dollar so I can start a business.
A minor nitpick with the article... when the term "natural language understanding" is used, it seems to be mostly synonymous with "speech recognition". Actually, speech recognition is a subset of natural language understanding. NLU (or NLP, natural language processing) deals with all aspects of understanding human languages. In fact, most NLP is done with text, not speech.
Use Ctrl-C instead of ESC in Vim!
- show a text and find other texts about the same subject.
- hum a tune and tell find an mp3 of the same music.
- show a picture and find other pictures of the same girl.
- better, show a picture of a girl's face and tell your search engine to find nude pictures of the same girl...
Until those simple tasks can be done easily, we will be stuck with the 13500 links one gets when searching for "christina ricci nude" in Google.
It makes you wonder how much of this is based on theoretical linguistics and formal semantics, and how much is based on good old fashioned statistics and optimization.
Amazing magic tricks
This can be handled, and is handled, by metadata. Most OSes do a limp-wristed version of it every day--"that movie I downloaded a few days ago..."
Natural language grepping through a binary audio file is, no doubt, quite cool, but I believe mostly wasted effort. Well, wasted effort for everybody except IBM, who might sell a few more seats of ViaVoice. I say it's wasted because, most often, it's not the content itself you remember but the circumstances surrounding it. "I saw an article in a magazine, and I read it on the train on the way to Boston--it had something to do with widgets" No ammount of data mining will appropriately pull that info out of a simple text file.
I relate all this in terms of human-interaction, i.e. the computer mining to satisfy the needs of a carbon-based lifeform who regularly purchases Big Macs. Data-mining between computer programs for other computer programs would be a different kettle of fish altogether--and there are a lot of ex-LISP hackers at MIT who would like to know how you got something like that to work, thank you very much.
Oh, and Apple called--they'd like their Knowledge Navigator back, please.
Potato chips are a by-yourself food.