Slashdot Mirror


Mining Unstructured Data

jscribner writes "Data these days tends to an unstructured form, be it text (like the web, email, or books), spoken word, or even in DB's with unique organization (and thus a discrete language). There's a new article on Unstructured Data in Think Research; it's an overview of the challenges, progress, and potential rewards in this area. I'm leaving on your doorstep because, to me, it's a good launching point for discussion of several interesting possibilities: /. as a minable DB of ideas, email identified by interpretation rather than keywords, emotive XML, etc."

5 of 105 comments (clear)

  1. answer is perl by Anonymous Coward · · Score: 1, Interesting


    http://www.google.com/search?q=learn+perl+for+hu ma nities+student+data+mining

    (remove the silly space in "humanities"

    perl and a lot of thinking, that is.

  2. I've been working on a project... by Anonymous Coward · · Score: 1, Interesting
    ..it basically converts human readable (unstructured data) into computer-readable, structured data. Parsers are in the works for converting unstructured services into standard services; for example the inboxes of Yahoo, Lycos, Mailcity, Excite, etc. are converted to an internal form, which is later served via a POP3 server.

    Of course, this isn't limited to web-based e-mail, there are parsers to parse web-based forums and bulletin boards, yes even for Slashdot. The unstructured data here can be converted and served via NNTP (NetNews) or some other method.

    There's a huge amount of unstructured data and services available on the Web, making these available to computers is a huge step forward in information technology.

  3. UK data protection law by skinfitz · · Score: 2, Interesting

    This sounds interesting - particularly how essentially this is something that makes an unstructured filing system suddenly become a structured filing system. What implications does this have for UK law?
    UK data protection states that copies of email have to be kept for a 28 day minimum period. It advises that "email is a transitory medium" and our company person in charge of such policies has just written a policy that says I'm supposed to program our mail systems to auto-delete mail after a three month period. Staff are supposed to save their emails that they want to keep to their local hard drives, as they suddenly become "documents" rather than emails.
    Why? because in the UK any individual can legally ask for copies of any email that mentions them individually by name. Local hard drives can be searched, however this is only if the documents are stored in a "structured filing system". I have raised concerns about what constitutes a "structured filing system" to the point where I would argue that FAT, NTFS and HFS are structured due to the fact they utilise indexes. Add to this the new MS Object Oriented Filing System (OFS) that is basically going to be a simplified version of SQL server as a filing system, is the ability to search previously considered "unstructured" data going to complicate the UK law?

  4. good title, but mismatched content by candot · · Score: 3, Interesting

    They draw you in with the bit about unstructured data, but it turns out to be more about differently structured data. I think they missed their own point.

    I just attended the Knowledge Technologies conference in Seattle. It's scary how many people think the way to mine unstructured data is to force it into a structure. So many people spending years developing standard taxonomies--different standards, of course. And so many companies (like Semio, for example) that want you to develop your own taxonomy. Then you wind up with the very problem this article really discusses.

    [Skip next section to avoid my self-promotion]

    I'm a big fan of mining unstructured (and differently structured) data by throwing a mining layer on top of it. All of us at Think Tank 23 are. Check out the demo of our technology, Waypoint 2.0, which pulls concepts from unstructured documents, then uses the concepts as the basis for finding relationships between them.

  5. Semi structured data, likely the way. by DutchSter · · Score: 2, Interesting

    This reminds me of work I did as an undergrad in my advanced database class. At the end of the term we were given group projects to research and present "future" database concepts. One of them was unstructured data. The conclusions drawn were that right now unstructured data has no real value. The value assigned to a particular element can only be assigned by the human who assesses those values. My group was assigned the task of making unstructured data available to standard databases.

    Consider:
    3822 North Fickle, Frequent Customer - solicit often.

    The inherent meaning is obvious to a human. Most likely a street address followed by a remark of some type. For a computer to correctly (and in a real world industry, the demand is that it be 100% correct, always, or forget it) establish the meanings would require some serious AI.

    Enter XML. As it exists now, XML is the bridge between unstructured data and data which can be formalized into a structured format. As other people have pointed out, XML does solve some of these issues. Semistructured data via XML is a fairly recent innovation and does a nice job adding meaning and definition to otherwise cryptic (to a computer) strings of ASCII characters.

    Going back to the original problem however, XML must be inserted by a human. Until such time as a machine can *establish* intrinsic value on data, there needs to be an intermediate platform. This doesn't help in the example provided where a company keeps data in a Word document. That data is truly unstructured and random. Human interaction can easily destroy the meaning of a document to a computer, without affecting (and perhaps increasing) the comprehension by other people.

    In the near future, it looks like the closest we will truly get is semi-structured, not un-structured data. He who ultimately solves this problem will also solve AI.