Mining Unstructured Data
jscribner writes "Data these days tends to an unstructured form, be it text (like the web, email, or books), spoken word, or even in DB's with unique organization (and thus a discrete language). There's a new article on Unstructured Data in Think Research; it's an overview of the challenges, progress, and potential rewards in this area. I'm leaving on your doorstep because, to me, it's a good launching point for discussion of several interesting possibilities: /. as a minable DB of ideas, email identified by interpretation rather than keywords, emotive XML, etc."
Very interesting read on this (and a little more) over here at UIA. It seems that its not up to human intellect to structure data, but rather is easier to recognize information through keywords (true for all acquisation of data, ie book reading, watching tv, etc.).
I submitted it to Slashdot a couple of days ago but guess they didn't like my story.
http://www.google.com/search?q=learn+perl+for+h
(remove the silly space in "humanities"
perl and a lot of thinking, that is.
Of course, this isn't limited to web-based e-mail, there are parsers to parse web-based forums and bulletin boards, yes even for Slashdot. The unstructured data here can be converted and served via NNTP (NetNews) or some other method.
There's a huge amount of unstructured data and services available on the Web, making these available to computers is a huge step forward in information technology.
This is a fascinating article. I'm especially interested in it, because I tend to work with databases - most of which are created from completely unstructured data.
For instance, Company A is tracking all their data in a Microsoft Word document. Frequently I get asked to dynamically work with this data, and pull it directly out programmatically. I can attest to how difficult this can be sometimes, and I frequently find that upper management doesn't understand the challenges behind pulling unstructured data out.
I definitely recommend this article - especially when trying to explain to your boss why you can't flick your magic wand, and *poof* the data moves from his text file into a database.
A Machine will be considered truly intelligent when it can translate all emails on slashdot into a usable form. Since spammers are some of the most persistent and aggressive users and developers of technology, I expect we'll have real AI telling us how to enlarge our penises by next Thursday.
---- El diablo esta en mis pantalones! Mire, mire!
Oooookay.
Sir? Please step away from the bong.
I just spent an ejoyable half hour or so reading Business 2.0's "minable database" of 101 Dumbest Moments in Business, and then I had a look at their even-more-hilarious 100 Dumbest moments in e-Business. This article really does have that weird flavor of megalomaniacal Internet-hype gibberish that we all came to know so well during the boom years. In a way, it's a pleasant little nostalgia trip to see the same old idiocy presented with the same old mindless confidence, but in another way it's just depressing.
Reality Check: Slashdot is a BBS for bored IT workers taking a break while installing nine hundred copies of Word on nine hundred 266 MHz beige boxes at the local credit union. It is not a minable database of ideas (or at least not of ideas worth mining). At its best, it's an undergraduate bull session.
What the hell are you people smoking?
"Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive" -- hey, that's me!
Interesting, my mining of hot ideas on Slashdot has determind that a Beowolf Cluster of First Posts is the next big thing...
"Good things don't end with eum, they end with mania or teria." - H. Simpson
By definition this is an unsolvable problem, because what it requires is definition of undefinition (if such a term exists). While you can make assumptions on unstructured data and apply Natural Language rules across it you are still left with the possibility that you've interpretted incorrectly. So to create definition in a loose format inherently requires you to assume its meaning, the rate of accuracy can be improved but absolutes are impossible to attain.
Simply put, if you don't understand what someone is talking about, you can make a reasonable guess and then refine it but you are always making assumptions.
To put in it simple terms for George W. Bush
All Muslims are Terrorists
All Supporters of Militia (McVeigh) are terrorists
Unstructured data is a great way to make money, and a great way to get 80% of the story, the trouble is the other 20% gets destroyed in the process.
Welcome to 1984, and a Brave New World, the minority will cease to count.
An Eye for an Eye will make the whole world blind - Gandhi
From the article:
One tool used to corral unstructured data is XML (extensible markup language), which tags salient parts of unstructured electronic documents so they can be searched. The structure of XML documents resembles that of a tree, with branches of tagged information, while relational databases consist of regimented rows. "Being able to produce, accept, store, and search XML provides a little structure to unstructured information," explains Selinger of the Silicon Valley Lab.
This makes a lot of sense. When you think about it, things like images and audio clips can provide some very useful information, but they can be difficult to classify and store in a useful and searchable manner. Having a product or suite of products that would provide the facility to not only classify, but also search the many different types of XML signatures for each type of resource could prove to be a very valuable thing for buisnesses.
Imagine the amount of time that could be saved if you could simply search all of those images/diagrams that you have for different projects, and all of the audio clips from that conferences that you have attended for that key idea that your sure is in there, but just can't remember where!
-ryanThis sounds interesting - particularly how essentially this is something that makes an unstructured filing system suddenly become a structured filing system. What implications does this have for UK law?
UK data protection states that copies of email have to be kept for a 28 day minimum period. It advises that "email is a transitory medium" and our company person in charge of such policies has just written a policy that says I'm supposed to program our mail systems to auto-delete mail after a three month period. Staff are supposed to save their emails that they want to keep to their local hard drives, as they suddenly become "documents" rather than emails.
Why? because in the UK any individual can legally ask for copies of any email that mentions them individually by name. Local hard drives can be searched, however this is only if the documents are stored in a "structured filing system". I have raised concerns about what constitutes a "structured filing system" to the point where I would argue that FAT, NTFS and HFS are structured due to the fact they utilise indexes. Add to this the new MS Object Oriented Filing System (OFS) that is basically going to be a simplified version of SQL server as a filing system, is the ability to search previously considered "unstructured" data going to complicate the UK law?
This article will have great importance to our director of IT, since the way our company stores data seems to completely unstructured.
Why is it that the very thought of mining Slashdot makes me think of the goatse.cx guy?
Hell, I've really got to stop reading at -1 so much.
--saint
-- Stanislav Shalunov
They draw you in with the bit about unstructured data, but it turns out to be more about differently structured data. I think they missed their own point.
I just attended the Knowledge Technologies conference in Seattle. It's scary how many people think the way to mine unstructured data is to force it into a structure. So many people spending years developing standard taxonomies--different standards, of course. And so many companies (like Semio, for example) that want you to develop your own taxonomy. Then you wind up with the very problem this article really discusses.
[Skip next section to avoid my self-promotion]
I'm a big fan of mining unstructured (and differently structured) data by throwing a mining layer on top of it. All of us at Think Tank 23 are. Check out the demo of our technology, Waypoint 2.0, which pulls concepts from unstructured documents, then uses the concepts as the basis for finding relationships between them.
"email identified by interpretation rather than keywords"
Report: The attached email messages indicate a successful business plan. This simple way to make money fast by selling pamphlets is interpreted as being good: it has been confirmed by many quotes within the email, by repetition in many similar emails, by the suggested calculation of potential return.
Opportunity: There is an unfilled business opportunity which is confirmed by the lack of existing businesses which use this plan. Searches of local and national databases have not found any businesses which are using this method.
Suggestion: Give me a dollar so I can start a business.
Email identified by interpretation rather than keywords? Does that mean email addresses can be identified in the same way? Surely that'd mean spamblocks wouldn't work any more.
Roadkill is yummy.
These guys have some interesting tech revolving around semantic search... worth a boo anyway...
I am become Troll, destroyer of threads
We used perl regular expressions and lex/yacc
like tools to tease structured data out of semi
structured web pages and other listings. It's
doable if you limit your scope to one particular
subject, such as job listings. The hardest part
is creating contextual lexicons. Does MS mean
that a master degree is required? The job is
located in Mississippi? Expreince with Microsoft
products is required? The hr contact is Ms.
Smith? You have to figure it out based on
context. Is MS preceded by a city name, that type
of thing.
Insightful has a cool software package called inFact. http://www.insightful.com/solutionslibrary/TextIma geMining/inFactInfoExtraction/Information_bizcase. asp
A minor nitpick with the article... when the term "natural language understanding" is used, it seems to be mostly synonymous with "speech recognition". Actually, speech recognition is a subset of natural language understanding. NLU (or NLP, natural language processing) deals with all aspects of understanding human languages. In fact, most NLP is done with text, not speech.
Use Ctrl-C instead of ESC in Vim!
Think of the informal knowledge embodied in the emails sent and received, attachments, spreadsheets, favorite websites, your colleagues documents, as well as SQL databases and the like. There simply is no suitably shaped container that you can put amorphous knowledge into. It defies structure, and XML is no answer.
Useful knowledge is of a pervasive nature. It infuses through everything, and often the really useful bits are where you least expect it, so therefore attempting to design a structure, a priori, to hold it is always doomed to failure.
The key here is polymorphic searching of both structured and unstructured data without distinction. That's where products such as ISYS earn their salt. The hard part is in convincing the blissfully unaware that knowledge is being wasted in the first place.
The other key concept is value. Large result lists are less useful than small, high-quality result lists. Everybody knows this from using Google and getting back 198,000 hits. In the old CB radio days, it was called a squelch knob. Search engines that just give you large amounts of static do you a dis-service. Useful results are small and targeted.
Anyone care to analyse the relationships between debian package dependencies ?
My current uynderstanding is
There are 8 types of dependencies, 5 that are enforced, pre-depends, depends, conflicts, replaces, provides and 3 optional types suggests, recommends, enhances
Two packages can depend or pre-depend on each other.
A package can conflict, provide or replace itself.
The data structure must;
- directional
- cyclical (self-loops)
- multigraph (parallel edges of different dependency types)
So its a multidigraph (i think)
Id like to analyse the entire depedency graph and do all sorts of checks to analyse each release as a whole rather than on a package by package basis.
I have a good book (Algo. in C Part 5 Graph Algorithms by Sedgwick). But still not sure how to analyse, or represent (in a more structured way) this beast.
Anyone have any hints ?
Isn't XML only part of the solution? I'm pretty sure RDF comes into play somewhere here.
That man tried to kill mah Daddy
- show a text and find other texts about the same subject.
- hum a tune and tell find an mp3 of the same music.
- show a picture and find other pictures of the same girl.
- better, show a picture of a girl's face and tell your search engine to find nude pictures of the same girl...
Until those simple tasks can be done easily, we will be stuck with the 13500 links one gets when searching for "christina ricci nude" in Google.
It makes you wonder how much of this is based on theoretical linguistics and formal semantics, and how much is based on good old fashioned statistics and optimization.
Amazing magic tricks
which, come to think of it, is what is happening in XML anyhow, you are adding tags in the file instead of having descriptive data outside in the database.
YMMV as far as which method will work better for you.
"It is a greater offense to steal men's labor, than their clothes"
ATHBT
What happens when you don't know what your even looking for? Data mining is more about ways to automaticly find interesting ways of indexing and displaying data than simply looking up known values in unstructured data.
There is a package that is good at displaying unstructured data and letting you see strange patterns since it has tools to find patterns in the data. Its called Partek.
Unstructured, like [Everything2]?
It was a pretty lousy article. Did the editor even skim it before posting this?
This can be handled, and is handled, by metadata. Most OSes do a limp-wristed version of it every day--"that movie I downloaded a few days ago..."
Natural language grepping through a binary audio file is, no doubt, quite cool, but I believe mostly wasted effort. Well, wasted effort for everybody except IBM, who might sell a few more seats of ViaVoice. I say it's wasted because, most often, it's not the content itself you remember but the circumstances surrounding it. "I saw an article in a magazine, and I read it on the train on the way to Boston--it had something to do with widgets" No ammount of data mining will appropriately pull that info out of a simple text file.
I relate all this in terms of human-interaction, i.e. the computer mining to satisfy the needs of a carbon-based lifeform who regularly purchases Big Macs. Data-mining between computer programs for other computer programs would be a different kettle of fish altogether--and there are a lot of ex-LISP hackers at MIT who would like to know how you got something like that to work, thank you very much.
Oh, and Apple called--they'd like their Knowledge Navigator back, please.
Potato chips are a by-yourself food.
There is a field dealing with this. It's called Knowledge Discovery in Databases (KDD). It's been around for a few years now. Go here for a slightly more technical overview. The posted article is aimed more toward the business people rather than the technical people.
This reminds me of work I did as an undergrad in my advanced database class. At the end of the term we were given group projects to research and present "future" database concepts. One of them was unstructured data. The conclusions drawn were that right now unstructured data has no real value. The value assigned to a particular element can only be assigned by the human who assesses those values. My group was assigned the task of making unstructured data available to standard databases.
Consider:
3822 North Fickle, Frequent Customer - solicit often.
The inherent meaning is obvious to a human. Most likely a street address followed by a remark of some type. For a computer to correctly (and in a real world industry, the demand is that it be 100% correct, always, or forget it) establish the meanings would require some serious AI.
Enter XML. As it exists now, XML is the bridge between unstructured data and data which can be formalized into a structured format. As other people have pointed out, XML does solve some of these issues. Semistructured data via XML is a fairly recent innovation and does a nice job adding meaning and definition to otherwise cryptic (to a computer) strings of ASCII characters.
Going back to the original problem however, XML must be inserted by a human. Until such time as a machine can *establish* intrinsic value on data, there needs to be an intermediate platform. This doesn't help in the example provided where a company keeps data in a Word document. That data is truly unstructured and random. Human interaction can easily destroy the meaning of a document to a computer, without affecting (and perhaps increasing) the comprehension by other people.
In the near future, it looks like the closest we will truly get is semi-structured, not un-structured data. He who ultimately solves this problem will also solve AI.
FlipDog crawls the web and uses machine-learning technology to extract job listings from companies' web sites. You can then browse through them at their web site, filtering based on location, position category, etc.
The technology was developed by WhizBang Labs, and is quite cool. They basically take a small set of job listings that their crawler finds, have a human classify parts of the web page (job title, location, description, etc.) and then let their software program loose on it. It analyzes the human-filtered web pages, "learns" how to extract relevant data, and then uses that to classify all of the other crawled pages. Of course, this is over-simplified, but that's the basic idea.
(I'm not affiliated with them, other than successfully finding a job there.)
--Bruce
There are 10 kinds of people in the world: those who understand binary, and those who don't.
This article refers to something called KDD. Knowledge Discovery in Databases [KDD] - as his counterpart Knowledge Discovery in Texts [KDT] - is a whole field of computer sciences. It's been around for more than 20 years now. The ACM even has a Special Interest Group : ACM-SIGKDD.
For an introduction you should read: _Introduction to Machine Learning_ (Kodratoff, Yves; Morgan Kaufmann Pub; 1988) or for a more recent and complete survey: _Advances in Knowledge Discovery and Data Mining_ (Fayaad, Usama; AAAI/MIT Press; 1996).
This was my final year project thesis. Just remember the golden rule unstructured 2 structured == convert 2 XML I wrote a [very bad] program in C++/Perl/tcsh IPC=pipes to add XML tags to English, and then index them into a search engine which would use the lingual data stored in the XML tags to help the search.
NIST does a MASSIVE competition on this annually. I don't want to be an XML-buzzword whore <Arnold Schwarzenegger accent> (XML commando eats Green berets, C++, Java, Perl, COBOL for breakfast)</Arnold Schwarzenegger accent> but you can't beat XML for easily converting anything that you can make sense out of into computer readable format. Real h3cKoRs use SGML, but us underlings have to stick with things we can understand like XML. As for expandability, if we want to encode something else into the document, then just tag-it-and-go
;-)
It took me 200 hours to fish out all these links (before the Google days), I don't want anyone to have to waste as much time as I did feeding the search engines exotic foods. It's a year old so pardon me for the odd broken link, armed with these you could probably turn jello into XML
A caveman dreams of being us, the incalculable power and riches. We dream of being Q, then what?
"One financial institution is using IBM ViaVoice® voice recognition software to convert complaint calls--considered unstructured data--into text."
That's interesting, because the IntelliStation I bought a while back came with ViaVoice, and it was excellent at converting unstructured voice data to unstructured text.
I have worked with Discovery Link, it contains wrappers around heterogenous database sources, like Oracle, flat text files and tries to integrate everything into a single representation.
In life sciences data sources are huge and plentiful. This thing is a monster, it's slow and it needs lots of dedicated people integrating and maintaining it. I'm not even talking about the (IBM) hardware you need for this.
No, I'm a pragmatic guy. I will integrate on the fly whatever I need to know. The idea is nice and all, but it is unworkable at the moment.
IANAL, but imagine a beowulf cluster of in Soviet Russia all your belong are base to us welcoming the new SCO overlords.
While I would say that the vast majority of posts on /. are mere discussion, etc - there is a small but useful subset buried deep within that arguably contains useful information, or at the very least would serve as a starting point for further research.
/. - true, there is a ton of SPAM and troll posts, etc to wade through, but that is what we are discussing here - how do you "mine" through the ore to get to that nugget of "gold"?
There are a TON of "Ask Slashdot" articles with very valuable information. I also see in this article many valuable posts (especially that one with the tons of links on machine learning and mining). I also remember some valid ideas bandied about back on the homebrew rollercoaster posting. I also remember seeing information on a posting about mozilla yesterday talking about how to get nice looking fonts in X. Finally, I remember quite some time back (possible up to 2 years ago) an article on AI, in which two individuals, who seemed to know their shit at minimum, and at best were both neuroscientists - arguing about how neurons worked and how the brain "thinks" - most of that went WAAAY over my head, but it was valuable information (or at least it might be a stepping stone).
I see this all the time here on
Reason is the Path to God - Anon
How can it know who is Saddam?