Mining Unstructured Data
jscribner writes "Data these days tends to an unstructured form, be it text (like the web, email, or books), spoken word, or even in DB's with unique organization (and thus a discrete language). There's a new article on Unstructured Data in Think Research; it's an overview of the challenges, progress, and potential rewards in this area. I'm leaving on your doorstep because, to me, it's a good launching point for discussion of several interesting possibilities: /. as a minable DB of ideas, email identified by interpretation rather than keywords, emotive XML, etc."
This is a fascinating article. I'm especially interested in it, because I tend to work with databases - most of which are created from completely unstructured data.
For instance, Company A is tracking all their data in a Microsoft Word document. Frequently I get asked to dynamically work with this data, and pull it directly out programmatically. I can attest to how difficult this can be sometimes, and I frequently find that upper management doesn't understand the challenges behind pulling unstructured data out.
I definitely recommend this article - especially when trying to explain to your boss why you can't flick your magic wand, and *poof* the data moves from his text file into a database.
By definition this is an unsolvable problem, because what it requires is definition of undefinition (if such a term exists). While you can make assumptions on unstructured data and apply Natural Language rules across it you are still left with the possibility that you've interpretted incorrectly. So to create definition in a loose format inherently requires you to assume its meaning, the rate of accuracy can be improved but absolutes are impossible to attain.
Simply put, if you don't understand what someone is talking about, you can make a reasonable guess and then refine it but you are always making assumptions.
To put in it simple terms for George W. Bush
All Muslims are Terrorists
All Supporters of Militia (McVeigh) are terrorists
Unstructured data is a great way to make money, and a great way to get 80% of the story, the trouble is the other 20% gets destroyed in the process.
Welcome to 1984, and a Brave New World, the minority will cease to count.
An Eye for an Eye will make the whole world blind - Gandhi
We used perl regular expressions and lex/yacc
like tools to tease structured data out of semi
structured web pages and other listings. It's
doable if you limit your scope to one particular
subject, such as job listings. The hardest part
is creating contextual lexicons. Does MS mean
that a master degree is required? The job is
located in Mississippi? Expreince with Microsoft
products is required? The hr contact is Ms.
Smith? You have to figure it out based on
context. Is MS preceded by a city name, that type
of thing.
- show a text and find other texts about the same subject.
- hum a tune and tell find an mp3 of the same music.
- show a picture and find other pictures of the same girl.
- better, show a picture of a girl's face and tell your search engine to find nude pictures of the same girl...
Until those simple tasks can be done easily, we will be stuck with the 13500 links one gets when searching for "christina ricci nude" in Google.
This can be handled, and is handled, by metadata. Most OSes do a limp-wristed version of it every day--"that movie I downloaded a few days ago..."
Natural language grepping through a binary audio file is, no doubt, quite cool, but I believe mostly wasted effort. Well, wasted effort for everybody except IBM, who might sell a few more seats of ViaVoice. I say it's wasted because, most often, it's not the content itself you remember but the circumstances surrounding it. "I saw an article in a magazine, and I read it on the train on the way to Boston--it had something to do with widgets" No ammount of data mining will appropriately pull that info out of a simple text file.
I relate all this in terms of human-interaction, i.e. the computer mining to satisfy the needs of a carbon-based lifeform who regularly purchases Big Macs. Data-mining between computer programs for other computer programs would be a different kettle of fish altogether--and there are a lot of ex-LISP hackers at MIT who would like to know how you got something like that to work, thank you very much.
Oh, and Apple called--they'd like their Knowledge Navigator back, please.
Potato chips are a by-yourself food.
It's really a case of using the right tool for the right job. After all, some data is not well expressed in a tree, while some is not well expressed in a relational database. Does this mean it's more right to use one or the other? Too often I see people using XML just because it's new, and not because it actually makes the data easier to work with.
As for the Object hierarchy in Java, it really doesn't limit what you can do with the objects and classes... you can still have a class with no data and only static methods, which is just like a function in C. The nice thing about the automatic Object superclass is that it makes generic, heterogenous containers really easy to use.