Mining Unstructured Data
jscribner writes "Data these days tends to an unstructured form, be it text (like the web, email, or books), spoken word, or even in DB's with unique organization (and thus a discrete language). There's a new article on Unstructured Data in Think Research; it's an overview of the challenges, progress, and potential rewards in this area. I'm leaving on your doorstep because, to me, it's a good launching point for discussion of several interesting possibilities: /. as a minable DB of ideas, email identified by interpretation rather than keywords, emotive XML, etc."
A Machine will be considered truly intelligent when it can translate all emails on slashdot into a usable form. Since spammers are some of the most persistent and aggressive users and developers of technology, I expect we'll have real AI telling us how to enlarge our penises by next Thursday.
---- El diablo esta en mis pantalones! Mire, mire!
Oooookay.
Sir? Please step away from the bong.
I just spent an ejoyable half hour or so reading Business 2.0's "minable database" of 101 Dumbest Moments in Business, and then I had a look at their even-more-hilarious 100 Dumbest moments in e-Business. This article really does have that weird flavor of megalomaniacal Internet-hype gibberish that we all came to know so well during the boom years. In a way, it's a pleasant little nostalgia trip to see the same old idiocy presented with the same old mindless confidence, but in another way it's just depressing.
Reality Check: Slashdot is a BBS for bored IT workers taking a break while installing nine hundred copies of Word on nine hundred 266 MHz beige boxes at the local credit union. It is not a minable database of ideas (or at least not of ideas worth mining). At its best, it's an undergraduate bull session.
What the hell are you people smoking?
"Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive" -- hey, that's me!
Interesting, my mining of hot ideas on Slashdot has determind that a Beowolf Cluster of First Posts is the next big thing...
"Good things don't end with eum, they end with mania or teria." - H. Simpson
By definition this is an unsolvable problem, because what it requires is definition of undefinition (if such a term exists). While you can make assumptions on unstructured data and apply Natural Language rules across it you are still left with the possibility that you've interpretted incorrectly. So to create definition in a loose format inherently requires you to assume its meaning, the rate of accuracy can be improved but absolutes are impossible to attain.
Simply put, if you don't understand what someone is talking about, you can make a reasonable guess and then refine it but you are always making assumptions.
To put in it simple terms for George W. Bush
All Muslims are Terrorists
All Supporters of Militia (McVeigh) are terrorists
Unstructured data is a great way to make money, and a great way to get 80% of the story, the trouble is the other 20% gets destroyed in the process.
Welcome to 1984, and a Brave New World, the minority will cease to count.
An Eye for an Eye will make the whole world blind - Gandhi
From the article:
One tool used to corral unstructured data is XML (extensible markup language), which tags salient parts of unstructured electronic documents so they can be searched. The structure of XML documents resembles that of a tree, with branches of tagged information, while relational databases consist of regimented rows. "Being able to produce, accept, store, and search XML provides a little structure to unstructured information," explains Selinger of the Silicon Valley Lab.
This makes a lot of sense. When you think about it, things like images and audio clips can provide some very useful information, but they can be difficult to classify and store in a useful and searchable manner. Having a product or suite of products that would provide the facility to not only classify, but also search the many different types of XML signatures for each type of resource could prove to be a very valuable thing for buisnesses.
Imagine the amount of time that could be saved if you could simply search all of those images/diagrams that you have for different projects, and all of the audio clips from that conferences that you have attended for that key idea that your sure is in there, but just can't remember where!
-ryanThis sounds interesting - particularly how essentially this is something that makes an unstructured filing system suddenly become a structured filing system. What implications does this have for UK law?
UK data protection states that copies of email have to be kept for a 28 day minimum period. It advises that "email is a transitory medium" and our company person in charge of such policies has just written a policy that says I'm supposed to program our mail systems to auto-delete mail after a three month period. Staff are supposed to save their emails that they want to keep to their local hard drives, as they suddenly become "documents" rather than emails.
Why? because in the UK any individual can legally ask for copies of any email that mentions them individually by name. Local hard drives can be searched, however this is only if the documents are stored in a "structured filing system". I have raised concerns about what constitutes a "structured filing system" to the point where I would argue that FAT, NTFS and HFS are structured due to the fact they utilise indexes. Add to this the new MS Object Oriented Filing System (OFS) that is basically going to be a simplified version of SQL server as a filing system, is the ability to search previously considered "unstructured" data going to complicate the UK law?
This article will have great importance to our director of IT, since the way our company stores data seems to completely unstructured.
-- Stanislav Shalunov
They draw you in with the bit about unstructured data, but it turns out to be more about differently structured data. I think they missed their own point.
I just attended the Knowledge Technologies conference in Seattle. It's scary how many people think the way to mine unstructured data is to force it into a structure. So many people spending years developing standard taxonomies--different standards, of course. And so many companies (like Semio, for example) that want you to develop your own taxonomy. Then you wind up with the very problem this article really discusses.
[Skip next section to avoid my self-promotion]
I'm a big fan of mining unstructured (and differently structured) data by throwing a mining layer on top of it. All of us at Think Tank 23 are. Check out the demo of our technology, Waypoint 2.0, which pulls concepts from unstructured documents, then uses the concepts as the basis for finding relationships between them.
"email identified by interpretation rather than keywords"
Report: The attached email messages indicate a successful business plan. This simple way to make money fast by selling pamphlets is interpreted as being good: it has been confirmed by many quotes within the email, by repetition in many similar emails, by the suggested calculation of potential return.
Opportunity: There is an unfilled business opportunity which is confirmed by the lack of existing businesses which use this plan. Searches of local and national databases have not found any businesses which are using this method.
Suggestion: Give me a dollar so I can start a business.
These guys have some interesting tech revolving around semantic search... worth a boo anyway...
I am become Troll, destroyer of threads
We used perl regular expressions and lex/yacc
like tools to tease structured data out of semi
structured web pages and other listings. It's
doable if you limit your scope to one particular
subject, such as job listings. The hardest part
is creating contextual lexicons. Does MS mean
that a master degree is required? The job is
located in Mississippi? Expreince with Microsoft
products is required? The hr contact is Ms.
Smith? You have to figure it out based on
context. Is MS preceded by a city name, that type
of thing.
And pray your boss hasn't heard of Perl :)
A minor nitpick with the article... when the term "natural language understanding" is used, it seems to be mostly synonymous with "speech recognition". Actually, speech recognition is a subset of natural language understanding. NLU (or NLP, natural language processing) deals with all aspects of understanding human languages. In fact, most NLP is done with text, not speech.
Use Ctrl-C instead of ESC in Vim!
Think of the informal knowledge embodied in the emails sent and received, attachments, spreadsheets, favorite websites, your colleagues documents, as well as SQL databases and the like. There simply is no suitably shaped container that you can put amorphous knowledge into. It defies structure, and XML is no answer.
Useful knowledge is of a pervasive nature. It infuses through everything, and often the really useful bits are where you least expect it, so therefore attempting to design a structure, a priori, to hold it is always doomed to failure.
The key here is polymorphic searching of both structured and unstructured data without distinction. That's where products such as ISYS earn their salt. The hard part is in convincing the blissfully unaware that knowledge is being wasted in the first place.
The other key concept is value. Large result lists are less useful than small, high-quality result lists. Everybody knows this from using Google and getting back 198,000 hits. In the old CB radio days, it was called a squelch knob. Search engines that just give you large amounts of static do you a dis-service. Useful results are small and targeted.
- show a text and find other texts about the same subject.
- hum a tune and tell find an mp3 of the same music.
- show a picture and find other pictures of the same girl.
- better, show a picture of a girl's face and tell your search engine to find nude pictures of the same girl...
Until those simple tasks can be done easily, we will be stuck with the 13500 links one gets when searching for "christina ricci nude" in Google.
It makes you wonder how much of this is based on theoretical linguistics and formal semantics, and how much is based on good old fashioned statistics and optimization.
Amazing magic tricks
which, come to think of it, is what is happening in XML anyhow, you are adding tags in the file instead of having descriptive data outside in the database.
YMMV as far as which method will work better for you.
"It is a greater offense to steal men's labor, than their clothes"
What happens when you don't know what your even looking for? Data mining is more about ways to automaticly find interesting ways of indexing and displaying data than simply looking up known values in unstructured data.
There is a package that is good at displaying unstructured data and letting you see strange patterns since it has tools to find patterns in the data. Its called Partek.
This can be handled, and is handled, by metadata. Most OSes do a limp-wristed version of it every day--"that movie I downloaded a few days ago..."
Natural language grepping through a binary audio file is, no doubt, quite cool, but I believe mostly wasted effort. Well, wasted effort for everybody except IBM, who might sell a few more seats of ViaVoice. I say it's wasted because, most often, it's not the content itself you remember but the circumstances surrounding it. "I saw an article in a magazine, and I read it on the train on the way to Boston--it had something to do with widgets" No ammount of data mining will appropriately pull that info out of a simple text file.
I relate all this in terms of human-interaction, i.e. the computer mining to satisfy the needs of a carbon-based lifeform who regularly purchases Big Macs. Data-mining between computer programs for other computer programs would be a different kettle of fish altogether--and there are a lot of ex-LISP hackers at MIT who would like to know how you got something like that to work, thank you very much.
Oh, and Apple called--they'd like their Knowledge Navigator back, please.
Potato chips are a by-yourself food.
This reminds me of work I did as an undergrad in my advanced database class. At the end of the term we were given group projects to research and present "future" database concepts. One of them was unstructured data. The conclusions drawn were that right now unstructured data has no real value. The value assigned to a particular element can only be assigned by the human who assesses those values. My group was assigned the task of making unstructured data available to standard databases.
Consider:
3822 North Fickle, Frequent Customer - solicit often.
The inherent meaning is obvious to a human. Most likely a street address followed by a remark of some type. For a computer to correctly (and in a real world industry, the demand is that it be 100% correct, always, or forget it) establish the meanings would require some serious AI.
Enter XML. As it exists now, XML is the bridge between unstructured data and data which can be formalized into a structured format. As other people have pointed out, XML does solve some of these issues. Semistructured data via XML is a fairly recent innovation and does a nice job adding meaning and definition to otherwise cryptic (to a computer) strings of ASCII characters.
Going back to the original problem however, XML must be inserted by a human. Until such time as a machine can *establish* intrinsic value on data, there needs to be an intermediate platform. This doesn't help in the example provided where a company keeps data in a Word document. That data is truly unstructured and random. Human interaction can easily destroy the meaning of a document to a computer, without affecting (and perhaps increasing) the comprehension by other people.
In the near future, it looks like the closest we will truly get is semi-structured, not un-structured data. He who ultimately solves this problem will also solve AI.
This was my final year project thesis. Just remember the golden rule unstructured 2 structured == convert 2 XML I wrote a [very bad] program in C++/Perl/tcsh IPC=pipes to add XML tags to English, and then index them into a search engine which would use the lingual data stored in the XML tags to help the search.
NIST does a MASSIVE competition on this annually. I don't want to be an XML-buzzword whore <Arnold Schwarzenegger accent> (XML commando eats Green berets, C++, Java, Perl, COBOL for breakfast)</Arnold Schwarzenegger accent> but you can't beat XML for easily converting anything that you can make sense out of into computer readable format. Real h3cKoRs use SGML, but us underlings have to stick with things we can understand like XML. As for expandability, if we want to encode something else into the document, then just tag-it-and-go
;-)
It took me 200 hours to fish out all these links (before the Google days), I don't want anyone to have to waste as much time as I did feeding the search engines exotic foods. It's a year old so pardon me for the odd broken link, armed with these you could probably turn jello into XML
A caveman dreams of being us, the incalculable power and riches. We dream of being Q, then what?
While I would say that the vast majority of posts on /. are mere discussion, etc - there is a small but useful subset buried deep within that arguably contains useful information, or at the very least would serve as a starting point for further research.
/. - true, there is a ton of SPAM and troll posts, etc to wade through, but that is what we are discussing here - how do you "mine" through the ore to get to that nugget of "gold"?
There are a TON of "Ask Slashdot" articles with very valuable information. I also see in this article many valuable posts (especially that one with the tons of links on machine learning and mining). I also remember some valid ideas bandied about back on the homebrew rollercoaster posting. I also remember seeing information on a posting about mozilla yesterday talking about how to get nice looking fonts in X. Finally, I remember quite some time back (possible up to 2 years ago) an article on AI, in which two individuals, who seemed to know their shit at minimum, and at best were both neuroscientists - arguing about how neurons worked and how the brain "thinks" - most of that went WAAAY over my head, but it was valuable information (or at least it might be a stepping stone).
I see this all the time here on
Reason is the Path to God - Anon
It's not "training," really. In the demo I saw, they gave VideoLogger four or five video frames that had Saddam Hussein in them, and drew little boxes around his face to identify it, then assigned a keyword to it.
Then they ran some news footage through the system that had other pictures of Hussein in it. VideoLogger picked him out and assigned the keyword "Saddam Hussein" to the clip. It got did this on face recognition, not speech or CC recognition, because the video clip was from the Russian TV news!
It was pretty cool, even though it was just a demo.