Slashdot Mirror


Mining Unstructured Data

jscribner writes "Data these days tends to an unstructured form, be it text (like the web, email, or books), spoken word, or even in DB's with unique organization (and thus a discrete language). There's a new article on Unstructured Data in Think Research; it's an overview of the challenges, progress, and potential rewards in this area. I'm leaving on your doorstep because, to me, it's a good launching point for discussion of several interesting possibilities: /. as a minable DB of ideas, email identified by interpretation rather than keywords, emotive XML, etc."

6 of 105 comments (clear)

  1. Good use of XML by soap.xml · · Score: 2, Informative

    From the article:

    One tool used to corral unstructured data is XML (extensible markup language), which tags salient parts of unstructured electronic documents so they can be searched. The structure of XML documents resembles that of a tree, with branches of tagged information, while relational databases consist of regimented rows. "Being able to produce, accept, store, and search XML provides a little structure to unstructured information," explains Selinger of the Silicon Valley Lab.

    This makes a lot of sense. When you think about it, things like images and audio clips can provide some very useful information, but they can be difficult to classify and store in a useful and searchable manner. Having a product or suite of products that would provide the facility to not only classify, but also search the many different types of XML signatures for each type of resource could prove to be a very valuable thing for buisnesses.

    Imagine the amount of time that could be saved if you could simply search all of those images/diagrams that you have for different projects, and all of the audio clips from that conferences that you have attended for that key idea that your sure is in there, but just can't remember where!

    -ryan
  2. Google Made to Order by shalunov · · Score: 3, Informative
    Some quotes from the press release:
    People actually vote their preferences by providing links to different documents. You may be able to determine that a page is authoritative because lots of people have found it important enough to have links to it. People explicitly create links from page one to page two, and if many people point to page two it looks like it is an important link to something.
    This Discoverylink(TM) search engine concept somehow sounds very familiar. Where could I have heard this innovative idea before? Or, as the press release asks, "Where did I read that?" Ah, yes!
  3. Nat. Language Understanding != Speech Recognition by thelenm · · Score: 3, Informative

    A minor nitpick with the article... when the term "natural language understanding" is used, it seems to be mostly synonymous with "speech recognition". Actually, speech recognition is a subset of natural language understanding. NLU (or NLP, natural language processing) deals with all aspects of understanding human languages. In fact, most NLP is done with text, not speech.

    --
    Use Ctrl-C instead of ESC in Vim!
  4. Polymorphic Searching by waimate · · Score: 2, Informative
    Of all the information stored in computers, 80% of it is unstructured, and arguably it's the most valuable 80%, too.

    Think of the informal knowledge embodied in the emails sent and received, attachments, spreadsheets, favorite websites, your colleagues documents, as well as SQL databases and the like. There simply is no suitably shaped container that you can put amorphous knowledge into. It defies structure, and XML is no answer.

    Useful knowledge is of a pervasive nature. It infuses through everything, and often the really useful bits are where you least expect it, so therefore attempting to design a structure, a priori, to hold it is always doomed to failure.

    The key here is polymorphic searching of both structured and unstructured data without distinction. That's where products such as ISYS earn their salt. The hard part is in convincing the blissfully unaware that knowledge is being wasted in the first place.

    The other key concept is value. Large result lists are less useful than small, high-quality result lists. Everybody knows this from using Google and getting back 198,000 hits. In the old CB radio days, it was called a squelch knob. Search engines that just give you large amounts of static do you a dis-service. Useful results are small and targeted.

  5. creative uses by rnd() · · Score: 3, Informative
    There are some companies that are doing some creative things with this kind of technology.

    It makes you wonder how much of this is based on theoretical linguistics and formal semantics, and how much is based on good old fashioned statistics and optimization.

    --

    Amazing magic tricks

  6. This was my final year project thesis by Beliskner · · Score: 2, Informative

    This was my final year project thesis. Just remember the golden rule unstructured 2 structured == convert 2 XML I wrote a [very bad] program in C++/Perl/tcsh IPC=pipes to add XML tags to English, and then index them into a search engine which would use the lingual data stored in the XML tags to help the search.

    NIST does a MASSIVE competition on this annually. I don't want to be an XML-buzzword whore <Arnold Schwarzenegger accent> (XML commando eats Green berets, C++, Java, Perl, COBOL for breakfast)</Arnold Schwarzenegger accent> but you can't beat XML for easily converting anything that you can make sense out of into computer readable format. Real h3cKoRs use SGML, but us underlings have to stick with things we can understand like XML. As for expandability, if we want to encode something else into the document, then just tag-it-and-go

    It took me 200 hours to fish out all these links (before the Google days), I don't want anyone to have to waste as much time as I did feeding the search engines exotic foods. It's a year old so pardon me for the odd broken link, armed with these you could probably turn jello into XML ;-)

    My favourite bookmarx
    PROJect[21 links]
    Beginners' Guide[13 links]
    Berkeley Linguistics Dept. Course Summaries, general stuffzzzzzzzzzzzzzzCryptic IR Vocabulary defined
    Explanations of weird words like hypernym zzzzzzzzzzzzzzHow do we produce and understand speech
    How Inverted Files are Created - Univeristy of Berkeley zzzzzzzzzzzzzzNLP Univ. of Indiana, very good basics e.g. word sense d
    Simple langauge - useful.... zzzzzzzzzzzzzzWhat is Natural Language Processing, links
    What is POS tagging........ zzzzzzzzzzzzzzWord Sense Disambiguation defined
    Word Sense Disambiguation in detail, scroll down far zzzzzzzzzzzzzzWord Sense Disambiguator - LOLITA (tested at MUC-7 and SENSEVAL competition as best)
    XML for the absolute beginner

    HTML, XML stuff + parsers[19 links]
    Apache plug-in that uhhh does stuff with XML zzzzzzzzzzzzzzConvert COM to XML
    convert XML, HTML to Unix pipeable formats zzzzzzzzzzzzzzconverters to and from HTML
    expat XML parser zzzzzzzzzzzzzzHTML Tidy - converts HTML 2 XML + source code!!
    Parse DB (RDBMS, whatever) to XML zzzzzzzzzzzzzzPerl-XML Module List
    PHP Manual XML parser functions - what the hell are they talking about, PHP Virtual M... zzzzzzzzzzzzzzPublic SGML-XML Software
    Pyxie - XML Processor for Python, Perl, etc. zzzzzzzzzzzzzzSGML+XML tools.org
    The XML Resource Centre - massive number of links zzzzzzzzzzzzzzW4F wrapper - wrapper converts XML to HTML
    XFlat - convert flat file into XML zzzzzzzzzzzzzzXML Parsers and other XML stuff
    XML.com - Parsers, etc. zzzzzzzzzzzzzzXML-Data Catalog System - uhhhh looks close
    XTAL's general converter - convert anything 2 XML

    other Background[8 links]
    Is Linux ready for the Enterprise, scalable... zzzzzzzzzzzzzzLinux reliability
    Linux Versus Windows NT, Mark(sysinternals bloke) zzzzzzzzzzzzzzPC reliability (pcworld)
    SPEC - Standard Performance Evaluation Corp. zzzzzzzzzzzzzzSystems benchmarks
    TPC - Transaction Processing Performance Council zzzzzzzzzzzzzzUnix Beats Back NT In EDA Workstation Arena
    Proper TREC(-8) QA systems[2 links]

    pg. 387 LIMSI-CNRS pretty deep parsing[2 links]
    More links....
    NLP, IR links - lots to corpii, etc.

    pg. 575 U. of Ottawa and NRL (shit system, got 0%)[1 links]
    LAKE Lab
    pg. 607! University of Sheffield (crap system, but OPEN SOURCE!)[2 links]
    GATE - FREE IE app w`source code
    LaSIE - ER, coreference, template (cv)

    pg. 617 Univ of Surrey (inconclusive matches)[2 links]
    System Quirk - Or is this their search system..... Hmmmmmm
    Univ of Surrey - pointers (hopefully this is their WILDER search system...)

    SMU - Pg. 65[1 links]
    Natural Language Processing Laboratory at SMU

    Textract[2 links]
    Cymfony - Technology
    Textract - State of the Art Information Extraction

    Xerox uhhhhh maybe[1 links]
    Xerox Palo Alto Research Center
    (OVERVIEW) 1999 TREC-8 Q&A Track Home Page
    NLP bloke, Univ Sussex


    Tcl-Tk[4 links] Tcl tutorial
    Tcl-Tk Contributed Programs Index
    Tcl-Tk Resources, sources
    TclXML - manipulating XML using Tcl-Tk
    Artificial Natural Language - Is this what I'm trying to parse into...
    Comparison of Indexers - Prise vs. Inquery vs. MG, etc.
    Eagles - Language Engineering Standards
    Language Technology Group - lots of modules!
    LDC - Linguistic Data Consortium, lots of corpora
    Lexical Resources
    Links 2 resources, indexers.....
    Lots of IR stuff, University of uhhh
    Managing Gigabytes Indexer
    Managing Gigabytes Manuals and stuff
    Htdig search system
    NLP & IR (NLPIR, NIST) Group
    OVERVIEW OF MUC-7-MET-2
    Perl XML Indexing - XML search engine type thing
    Phrasys Language Processing Software Components (money)
    QA HCI bullshit
    SIGIR - TREC-type thing, resources
    SMART indexer system documentation
    Text REtrieval Conference (TREC) Home Page
    The Natural Language Software Registry
    Thunderstone IE and IR products
    WordNet - FREE DOWNLOADABLE lexical English database

    Page created with URL+, nice utility for working with internet shortcuts
    --
    A caveman dreams of being us, the incalculable power and riches. We dream of being Q, then what?