Slashdot Mirror


Mining Unstructured Data

jscribner writes "Data these days tends to an unstructured form, be it text (like the web, email, or books), spoken word, or even in DB's with unique organization (and thus a discrete language). There's a new article on Unstructured Data in Think Research; it's an overview of the challenges, progress, and potential rewards in this area. I'm leaving on your doorstep because, to me, it's a good launching point for discussion of several interesting possibilities: /. as a minable DB of ideas, email identified by interpretation rather than keywords, emotive XML, etc."

37 of 105 comments (clear)

  1. They've discovered Google! by twitchkat · · Score: 2, Insightful
    They better get ready to pay some google patent licensing fees:
    People also make their feelings known in less direct ways, says Jhingran. "People actually vote their preferences by providing links to different documents," he explains. "You may be able to determine that a page is authoritative because lots of people have found it important enough to have links to it. People explicitly create links from page one to page two, and if many people point to page two it looks like it is an important link to something." Businesses could use such analytical capability to determine the "buzz" about their products found in chat rooms and forums on the Internet.
  2. /. as a Turing Test by bravehamster · · Score: 5, Funny
    email identified by interpretation rather than keywords


    A Machine will be considered truly intelligent when it can translate all emails on slashdot into a usable form. Since spammers are some of the most persistent and aggressive users and developers of technology, I expect we'll have real AI telling us how to enlarge our penises by next Thursday.

    --
    ---- El diablo esta en mis pantalones! Mire, mire!
  3. "Slashdot as a minable database of ideas..." by theonomist · · Score: 4, Funny

    Oooookay.

    Sir? Please step away from the bong.

    I just spent an ejoyable half hour or so reading Business 2.0's "minable database" of 101 Dumbest Moments in Business, and then I had a look at their even-more-hilarious 100 Dumbest moments in e-Business. This article really does have that weird flavor of megalomaniacal Internet-hype gibberish that we all came to know so well during the boom years. In a way, it's a pleasant little nostalgia trip to see the same old idiocy presented with the same old mindless confidence, but in another way it's just depressing.

    Reality Check: Slashdot is a BBS for bored IT workers taking a break while installing nine hundred copies of Word on nine hundred 266 MHz beige boxes at the local credit union. It is not a minable database of ideas (or at least not of ideas worth mining). At its best, it's an undergraduate bull session.

    What the hell are you people smoking?

    --
    "Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive" -- hey, that's me!
    1. Re:"Slashdot as a minable database of ideas..." by ahde · · Score: 2

      what that list doesn't tell you is that all those stupid stock market analysts were doubling their money every month or less, because the rest of us suckers bought their hype and false predictions. Payne Webber, Merrill Lynch, Credit Suisse, Oppenheimer & co. consolidated the largest percentage of the world's money since the London Bay Company & East India Company. And they haven't lost a penny of it. (WTC offices were insured)

    2. Re:"Slashdot as a minable database of ideas..." by MikeBabcock · · Score: 2

      It depends on the article ... when was the last time you read every article in a given week and every message attached thereto?

      _Sometimes_ unique or semi-unique or thought-provoking ideas get stated. That's the nature of chat rooms and discussion boards. USENET has unique ideas as well and is much more spam-filled and useless looking on first glance.

      --
      - Michael T. Babcock (Yes, I blog)
  4. Slashdot by rbgaynor · · Score: 4, Funny

    Interesting, my mining of hot ideas on Slashdot has determind that a Beowolf Cluster of First Posts is the next big thing...

    --
    "Good things don't end with eum, they end with mania or teria." - H. Simpson
  5. Doomed Doomed we're all doomed by MosesJones · · Score: 3, Insightful

    By definition this is an unsolvable problem, because what it requires is definition of undefinition (if such a term exists). While you can make assumptions on unstructured data and apply Natural Language rules across it you are still left with the possibility that you've interpretted incorrectly. So to create definition in a loose format inherently requires you to assume its meaning, the rate of accuracy can be improved but absolutes are impossible to attain.

    Simply put, if you don't understand what someone is talking about, you can make a reasonable guess and then refine it but you are always making assumptions.

    To put in it simple terms for George W. Bush

    All Muslims are Terrorists
    All Supporters of Militia (McVeigh) are terrorists

    Unstructured data is a great way to make money, and a great way to get 80% of the story, the trouble is the other 20% gets destroyed in the process.

    Welcome to 1984, and a Brave New World, the minority will cease to count.

    --
    An Eye for an Eye will make the whole world blind - Gandhi
    1. Re:Doomed Doomed we're all doomed by Alien54 · · Score: 2
      By definition this is an unsolvable problem, because what it requires is definition of undefinition (if such a term exists). While you can make assumptions on unstructured data and apply Natural Language rules across it you are still left with the possibility that you've interpretted incorrectly. So to create definition in a loose format inherently requires you to assume its meaning, the rate of accuracy can be improved but absolutes are impossible to attain.

      Probably it should be randomly structured data, but in any case, the problem still boils down to how you described to, trying to decide what is relevant and how. Other wise you just have a bunch of blobs.

      Or else you have a database with links to the random objects (word docs, etc.), but descriptions, etc in the database about the objects. Quick and dirty, but not the best solution.

      --
      "It is a greater offense to steal men's labor, than their clothes"
  6. Good use of XML by soap.xml · · Score: 2, Informative

    From the article:

    One tool used to corral unstructured data is XML (extensible markup language), which tags salient parts of unstructured electronic documents so they can be searched. The structure of XML documents resembles that of a tree, with branches of tagged information, while relational databases consist of regimented rows. "Being able to produce, accept, store, and search XML provides a little structure to unstructured information," explains Selinger of the Silicon Valley Lab.

    This makes a lot of sense. When you think about it, things like images and audio clips can provide some very useful information, but they can be difficult to classify and store in a useful and searchable manner. Having a product or suite of products that would provide the facility to not only classify, but also search the many different types of XML signatures for each type of resource could prove to be a very valuable thing for buisnesses.

    Imagine the amount of time that could be saved if you could simply search all of those images/diagrams that you have for different projects, and all of the audio clips from that conferences that you have attended for that key idea that your sure is in there, but just can't remember where!

    -ryan
    1. Re:Good use of XML by Pinball+Wizard · · Score: 2

      Interestingly enough, relational database technology itself was created to overcome the limitations of hierarchal databases(aka tree-based data structures). The problem back then was - not everything can be organized into a nice little hierarchal tree of data, however if you could create relations between otherwise unrelated pieces of information you could tie together all sorts of disparate data.

      Seems to me like we're coming full circle with OOP and XML - trying to create huge monolithic structures that can handle everything we need to do. Look at Java - everything ultimately is inherited from the almighty Object. XML is no better in this regard, although you can have lots of different XML files describing different pieces of data.

      I wonder if the people pushing hierarchal(OOP and XML) data models over relational ones realize that the exact opposite was the case 30 years ago. Perhaps we should have stuck with hierarchal databases in the first place?

      --

      No, Thursday's out. How about never - is never good for you?

    2. Re:Good use of XML by Mr.+Shiny+And+New · · Score: 2, Insightful

      It's really a case of using the right tool for the right job. After all, some data is not well expressed in a tree, while some is not well expressed in a relational database. Does this mean it's more right to use one or the other? Too often I see people using XML just because it's new, and not because it actually makes the data easier to work with.

      As for the Object hierarchy in Java, it really doesn't limit what you can do with the objects and classes... you can still have a class with no data and only static methods, which is just like a function in C. The nice thing about the automatic Object superclass is that it makes generic, heterogenous containers really easy to use.

    3. Re:Good use of XML by dubl-u · · Score: 2

      Interestingly enough, relational database technology itself was created to overcome the limitations of hierarchal databases(aka tree-based data structures). [...] Look at Java - everything ultimately is inherited from the almighty Object.

      Don't mistake a hierarchical type structure for a hierarchical data structure.

      In Java, one might model things so that Persons and Vehicles are both subclasses of Object, and that Cars and Trucks are subclasses of vehicles. This is indeed strictly hierarchical.

      But a Person called Joe can be the owner for a Truck, ride in a Car, and be the spouse of another person Jane simultaneously. That's not a hierarchical relationship; it's a web of connections.

      You can still have hierarchical relationships with OO data; if Joe sells his truck, the Engine and the four Wheels would automatically go along with. But that's just one possible relationship.

  7. UK data protection law by skinfitz · · Score: 2, Interesting

    This sounds interesting - particularly how essentially this is something that makes an unstructured filing system suddenly become a structured filing system. What implications does this have for UK law?
    UK data protection states that copies of email have to be kept for a 28 day minimum period. It advises that "email is a transitory medium" and our company person in charge of such policies has just written a policy that says I'm supposed to program our mail systems to auto-delete mail after a three month period. Staff are supposed to save their emails that they want to keep to their local hard drives, as they suddenly become "documents" rather than emails.
    Why? because in the UK any individual can legally ask for copies of any email that mentions them individually by name. Local hard drives can be searched, however this is only if the documents are stored in a "structured filing system". I have raised concerns about what constitutes a "structured filing system" to the point where I would argue that FAT, NTFS and HFS are structured due to the fact they utilise indexes. Add to this the new MS Object Oriented Filing System (OFS) that is basically going to be a simplified version of SQL server as a filing system, is the ability to search previously considered "unstructured" data going to complicate the UK law?

  8. Forward this to the Director of IT, stat! by johncheng · · Score: 4, Funny

    This article will have great importance to our director of IT, since the way our company stores data seems to completely unstructured.

  9. Google Made to Order by shalunov · · Score: 3, Informative
    Some quotes from the press release:
    People actually vote their preferences by providing links to different documents. You may be able to determine that a page is authoritative because lots of people have found it important enough to have links to it. People explicitly create links from page one to page two, and if many people point to page two it looks like it is an important link to something.
    This Discoverylink(TM) search engine concept somehow sounds very familiar. Where could I have heard this innovative idea before? Or, as the press release asks, "Where did I read that?" Ah, yes!
    1. Re:Google Made to Order by John+Harrison · · Score: 2
      "Where did I read that?" Ah, yes! [google.com]

      Of course if you were in IBM Research, as the authors are, you might have been familiar with The Clever Project prior to Google. It is explained very nicely here.

      I am not saying that the authors might not have been inspired by Google, but I am saying that Google isn't the only possible source of their inspiration.

    2. Re:Google Made to Order by shalunov · · Score: 2
      I am not saying that the authors might not have been inspired by Google, but I am saying that Google isn't the only possible source of their inspiration.
      The concept of using links as votes to rate resources is simply not novel anymore. Everyone knows about it. I'm not claiming that Google invented it (not being a very non-obvious idea, this was probably independently developed at a number of places); but presenting stuff familiar to everybody as "Invented Here" news sounds like PR.

      But wait, it was a press release. Submitted by someone from IBM, too.

    3. Re:Google Made to Order by John+Harrison · · Score: 2
      But wait, it was a press release. Submitted by someone from IBM, too.

      Just to make the conspiracy complete, I am from IBM as well.

  10. good title, but mismatched content by candot · · Score: 3, Interesting

    They draw you in with the bit about unstructured data, but it turns out to be more about differently structured data. I think they missed their own point.

    I just attended the Knowledge Technologies conference in Seattle. It's scary how many people think the way to mine unstructured data is to force it into a structure. So many people spending years developing standard taxonomies--different standards, of course. And so many companies (like Semio, for example) that want you to develop your own taxonomy. Then you wind up with the very problem this article really discusses.

    [Skip next section to avoid my self-promotion]

    I'm a big fan of mining unstructured (and differently structured) data by throwing a mining layer on top of it. All of us at Think Tank 23 are. Check out the demo of our technology, Waypoint 2.0, which pulls concepts from unstructured documents, then uses the concepts as the basis for finding relationships between them.

  11. This Is Like Mining Money by Anonymous Coward · · Score: 5, Funny

    "email identified by interpretation rather than keywords"

    Report: The attached email messages indicate a successful business plan. This simple way to make money fast by selling pamphlets is interpreted as being good: it has been confirmed by many quotes within the email, by repetition in many similar emails, by the suggested calculation of potential return.

    Opportunity: There is an unfilled business opportunity which is confirmed by the lack of existing businesses which use this plan. Searches of local and national databases have not found any businesses which are using this method.

    Suggestion: Give me a dollar so I can start a business.

  12. Some interesting technology... by jonfromspace · · Score: 2

    These guys have some interesting tech revolving around semantic search... worth a boo anyway...

    --
    I am become Troll, destroyer of threads
  13. Worked at two starups that do this by voisine · · Score: 2, Insightful

    We used perl regular expressions and lex/yacc
    like tools to tease structured data out of semi
    structured web pages and other listings. It's
    doable if you limit your scope to one particular
    subject, such as job listings. The hardest part
    is creating contextual lexicons. Does MS mean
    that a master degree is required? The job is
    located in Mississippi? Expreince with Microsoft
    products is required? The hr contact is Ms.
    Smith? You have to figure it out based on
    context. Is MS preceded by a city name, that type
    of thing.

  14. Re:Oh Man by brer_rabbit · · Score: 2
    I definitely recommend this article - especially when trying to explain to your boss why you can't flick your magic wand, and *poof* the data moves from his text file into a database.

    And pray your boss hasn't heard of Perl :)

  15. Nat. Language Understanding != Speech Recognition by thelenm · · Score: 3, Informative

    A minor nitpick with the article... when the term "natural language understanding" is used, it seems to be mostly synonymous with "speech recognition". Actually, speech recognition is a subset of natural language understanding. NLU (or NLP, natural language processing) deals with all aspects of understanding human languages. In fact, most NLP is done with text, not speech.

    --
    Use Ctrl-C instead of ESC in Vim!
  16. Polymorphic Searching by waimate · · Score: 2, Informative
    Of all the information stored in computers, 80% of it is unstructured, and arguably it's the most valuable 80%, too.

    Think of the informal knowledge embodied in the emails sent and received, attachments, spreadsheets, favorite websites, your colleagues documents, as well as SQL databases and the like. There simply is no suitably shaped container that you can put amorphous knowledge into. It defies structure, and XML is no answer.

    Useful knowledge is of a pervasive nature. It infuses through everything, and often the really useful bits are where you least expect it, so therefore attempting to design a structure, a priori, to hold it is always doomed to failure.

    The key here is polymorphic searching of both structured and unstructured data without distinction. That's where products such as ISYS earn their salt. The hard part is in convincing the blissfully unaware that knowledge is being wasted in the first place.

    The other key concept is value. Large result lists are less useful than small, high-quality result lists. Everybody knows this from using Google and getting back 198,000 hits. In the old CB radio days, it was called a squelch knob. Search engines that just give you large amounts of static do you a dis-service. Useful results are small and targeted.

    1. Re:Polymorphic Searching by Com2Kid · · Score: 2

      Irony is of course that ALL data that we recieve and send for human consumption is INDEED structured.

      It is just not structured according to how the COMPUTER sees it.

      Hell this posting of mine right here is structured, and beyond the obvious sentances/paragraphs explanation that is most often given.

      Almost all written work is designed as so to allow for the reader to follow along the author's thought process.

      Indeed writting could be looked at as some sort of bare level one shot emulation code for the human brain.

      Now for computers this makes NO sense at all.

      Uh duh, they don't think.

      Now with a lot of work native languages can indeed be PARTIALY understood by computers, and there is an artificial language out there (I forget the name) that was designed from the ground up for both comptuer and human understanding on a quasi-equal level. But even so it cannot match the same. . . . underlying meanings between both parties.

      Humans are capible of understanding all of the complexities of modern day computers, it may require a lot of work and some darn good wizardy, but it IS possible.

      The issue is that the way that computers 'think' is not but a subset of our own thought methods that we have expanded upon and made more complex but ultimatly added nothing new too.

      And yet it is by the very nature of being a subset that computer 'thinking' (ugh I hate using that term in this context) can only contain a partial set of the abilities of Human thinking.

      Ah, to take a related explanation from Dansdata

      "But a clever enough algorithmic composition system can get around this, by using a human to direct it through infinite musical space. With any luck, the human will have some idea of what sounds good; that's a really difficult thing to teach a computer."

      (speaking about the Kong Karma's composition functions)

      Humans have to GUIDE the computer.

      For instance the file finder feature on many OSs.

      If I tell my Windows box to search for mIRC* it will search my entire computer's hard drive including my Cygwin folder and my C:\corel folder.

      Which is obviously highly friggin stupid since mIRC is NOT going to be in either one of those. (well not today at least. :) )

      But the COMPUTER does not know that. Despite having a highly refined layout system for my files that has everything compacted into nice small little subsets of subsets as to what types of file it is, the damn computer has;

      No idea WTF mIRC is, what IRC is (outside of some sort of program that tells the computer to interpet network packet X with Y evaluation system and display Z depending on X's contents, and oh yah shove the word IRC on the window while your at it. That is ALL computers know of IRC), what the hell a 'program file' is or why in the world (no concept of 'why' either) mIRC would be in C:\program files\

      Now if I use a bit of human judgement and direct the computer to search only C:\program files\ it can find the requested files just fine.

      But it is STUPID. Period.

      What is the BEST possible outcome we can hope for in this situation? Hmm?

      Hah. All files in some sort of a database system? Make it 'object based'? Or just add assloads of data to the 'file fork'.

      Bah it would STILL come down to the computer going over each friggin entry in a database until it gets a match with the search string. Hell even if some more efficent searching algorithm is used besides just going through every item in the database, the fact is that the computer

      (pay attention here folks)

      STILL HAS NO FRIGGIN IDEA AS TO WHAT IN THE HELL mIRC is.

      I can add descriptors to heck to all files associated with the program. And the computer will STILL NOT KNOW WHAT mIRC IS!

      Once again.

      THE COMPUTER HAS NO IDEA AS TO WHAT THE HELL ANYTHING IS.

      For instance.

      I know off hand that my copy of virtual dub is in F:\video editing tools\virtual dub\ (actualy the version number follows it, but close enough. :) )

      Now the computer has no idea as to what 'video editing tools' is (I am using is here folks, plural? Huh, whats that? what is 'what'. The computer does not have an understanding of ANY of these topics.)

      In fact, one thing that SO many people seem to forget, is that COMPUTERS UNDERSTAND NOTHING.

      Nothing AT ALL.

      PERIOD.

      So please.

      Please.

      PLEASE

      Understand that the computer will NEVER be able to truly organize or structure your data, because the computer does not even know what the hell a structure is. Sure you can tell it to shove such and such bits into such and such places, but it knows not what those bits are or what those bits mean or what those places mean or what the hell a place is or ANYTHING ELSE AT ALL.

      I can make my computer feel happy.

      I have it show "I am happy" on the screen.

      That is as close as you are ever going to get the current breeds of computers to being able to understand or think about anything at all.

      Because everything eventualy comes down to that same basic fact.

      The computer does what you tell it too and nothing else.

  17. XML won't make it by mangu · · Score: 4, Insightful
    To encode information in XML is as much work as doing it in SQL or any other language. What is needed is artificial intelligence, to take any data source, be it a picture, text, music, or whatever, and classify it. Some examples of what I have wanted for:

    - show a text and find other texts about the same subject.

    - hum a tune and tell find an mp3 of the same music.

    - show a picture and find other pictures of the same girl.

    - better, show a picture of a girl's face and tell your search engine to find nude pictures of the same girl...


    Until those simple tasks can be done easily, we will be stuck with the 13500 links one gets when searching for "christina ricci nude" in Google.

    1. Re:XML won't make it by foobar104 · · Score: 2

      Bell Labs? Are you high? The system you're talking about is commercially available: it's called Virage VideoLogger. (I'd provide a link, but the Virage web site sucks so bad.... Just go to www.virage.com [the www is mandatory].)

      VideoLogger has neato features like speech-to-text, speaker identification, face recognition, and keyframe extraction. All of those things happen in real time, if the PC is fast enough for it.

      Combined with a half-decent RDBMS back-end, you can do stuff like search on "Saddam Hussein" and get back a reference to a clip that includes a picture of him, but not his actual name anywhere in the voiceover or the CC data. It's pretty cool.

      It's also, like, $60,000 a copy, or something.

      No, I don't work for Virage, and I've never had a business relationship with them. I've seen their stuff demoed, though.

  18. creative uses by rnd() · · Score: 3, Informative
    There are some companies that are doing some creative things with this kind of technology.

    It makes you wonder how much of this is based on theoretical linguistics and formal semantics, and how much is based on good old fashioned statistics and optimization.

    --

    Amazing magic tricks

    1. Re:creative uses by gwernol · · Score: 2

      It makes you wonder how much of this is based on theoretical linguistics [stanford.edu] and formal semantics [mit.edu], and how much is based on good old fashioned statistics [nec.com] and optimization.

      I can't speak to the work discussed in the original post, but I do know that in the real world a formal linguistics/semantics approach is impractical. These systems require complete or near-complete knowledge structures to work at all. They are brittle, meaning as the world changes they fail to adapt to the changing lexicon. Formal systems are often computationally expensive, and scale poorly to large data sets. The practical problems of constructing and maintaining the formal knowledge structures quickly overwhelms the advantages they have over looser approaches.

      So in most cases it is a hybrid of machine learning and statistical techniques that are used in these systems.

      --
      Sailing over the event horizon
  19. database of descritpions by Alien54 · · Score: 2
    Or else you have a database with links to the random objects (word docs, etc.), but descriptions, etc in the database about the objects. Quick and dirty, but not the best solution.

    which, come to think of it, is what is happening in XML anyhow, you are adding tags in the file instead of having descriptive data outside in the database.

    YMMV as far as which method will work better for you.

    --
    "It is a greater offense to steal men's labor, than their clothes"
  20. They are talking about searching by thogard · · Score: 2

    What happens when you don't know what your even looking for? Data mining is more about ways to automaticly find interesting ways of indexing and displaying data than simply looking up known values in unstructured data.

    There is a package that is good at displaying unstructured data and letting you see strange patterns since it has tools to find patterns in the data. Its called Partek.

  21. This can be handled by rho · · Score: 3, Insightful

    This can be handled, and is handled, by metadata. Most OSes do a limp-wristed version of it every day--"that movie I downloaded a few days ago..."

    Natural language grepping through a binary audio file is, no doubt, quite cool, but I believe mostly wasted effort. Well, wasted effort for everybody except IBM, who might sell a few more seats of ViaVoice. I say it's wasted because, most often, it's not the content itself you remember but the circumstances surrounding it. "I saw an article in a magazine, and I read it on the train on the way to Boston--it had something to do with widgets" No ammount of data mining will appropriately pull that info out of a simple text file.

    I relate all this in terms of human-interaction, i.e. the computer mining to satisfy the needs of a carbon-based lifeform who regularly purchases Big Macs. Data-mining between computer programs for other computer programs would be a different kettle of fish altogether--and there are a lot of ex-LISP hackers at MIT who would like to know how you got something like that to work, thank you very much.

    Oh, and Apple called--they'd like their Knowledge Navigator back, please.

    --
    Potato chips are a by-yourself food.
  22. Semi structured data, likely the way. by DutchSter · · Score: 2, Interesting

    This reminds me of work I did as an undergrad in my advanced database class. At the end of the term we were given group projects to research and present "future" database concepts. One of them was unstructured data. The conclusions drawn were that right now unstructured data has no real value. The value assigned to a particular element can only be assigned by the human who assesses those values. My group was assigned the task of making unstructured data available to standard databases.

    Consider:
    3822 North Fickle, Frequent Customer - solicit often.

    The inherent meaning is obvious to a human. Most likely a street address followed by a remark of some type. For a computer to correctly (and in a real world industry, the demand is that it be 100% correct, always, or forget it) establish the meanings would require some serious AI.

    Enter XML. As it exists now, XML is the bridge between unstructured data and data which can be formalized into a structured format. As other people have pointed out, XML does solve some of these issues. Semistructured data via XML is a fairly recent innovation and does a nice job adding meaning and definition to otherwise cryptic (to a computer) strings of ASCII characters.

    Going back to the original problem however, XML must be inserted by a human. Until such time as a machine can *establish* intrinsic value on data, there needs to be an intermediate platform. This doesn't help in the example provided where a company keeps data in a Word document. That data is truly unstructured and random. Human interaction can easily destroy the meaning of a document to a computer, without affecting (and perhaps increasing) the comprehension by other people.

    In the near future, it looks like the closest we will truly get is semi-structured, not un-structured data. He who ultimately solves this problem will also solve AI.

  23. This was my final year project thesis by Beliskner · · Score: 2, Informative

    This was my final year project thesis. Just remember the golden rule unstructured 2 structured == convert 2 XML I wrote a [very bad] program in C++/Perl/tcsh IPC=pipes to add XML tags to English, and then index them into a search engine which would use the lingual data stored in the XML tags to help the search.

    NIST does a MASSIVE competition on this annually. I don't want to be an XML-buzzword whore <Arnold Schwarzenegger accent> (XML commando eats Green berets, C++, Java, Perl, COBOL for breakfast)</Arnold Schwarzenegger accent> but you can't beat XML for easily converting anything that you can make sense out of into computer readable format. Real h3cKoRs use SGML, but us underlings have to stick with things we can understand like XML. As for expandability, if we want to encode something else into the document, then just tag-it-and-go

    It took me 200 hours to fish out all these links (before the Google days), I don't want anyone to have to waste as much time as I did feeding the search engines exotic foods. It's a year old so pardon me for the odd broken link, armed with these you could probably turn jello into XML ;-)

    My favourite bookmarx
    PROJect[21 links]
    Beginners' Guide[13 links]
    Berkeley Linguistics Dept. Course Summaries, general stuffzzzzzzzzzzzzzzCryptic IR Vocabulary defined
    Explanations of weird words like hypernym zzzzzzzzzzzzzzHow do we produce and understand speech
    How Inverted Files are Created - Univeristy of Berkeley zzzzzzzzzzzzzzNLP Univ. of Indiana, very good basics e.g. word sense d
    Simple langauge - useful.... zzzzzzzzzzzzzzWhat is Natural Language Processing, links
    What is POS tagging........ zzzzzzzzzzzzzzWord Sense Disambiguation defined
    Word Sense Disambiguation in detail, scroll down far zzzzzzzzzzzzzzWord Sense Disambiguator - LOLITA (tested at MUC-7 and SENSEVAL competition as best)
    XML for the absolute beginner

    HTML, XML stuff + parsers[19 links]
    Apache plug-in that uhhh does stuff with XML zzzzzzzzzzzzzzConvert COM to XML
    convert XML, HTML to Unix pipeable formats zzzzzzzzzzzzzzconverters to and from HTML
    expat XML parser zzzzzzzzzzzzzzHTML Tidy - converts HTML 2 XML + source code!!
    Parse DB (RDBMS, whatever) to XML zzzzzzzzzzzzzzPerl-XML Module List
    PHP Manual XML parser functions - what the hell are they talking about, PHP Virtual M... zzzzzzzzzzzzzzPublic SGML-XML Software
    Pyxie - XML Processor for Python, Perl, etc. zzzzzzzzzzzzzzSGML+XML tools.org
    The XML Resource Centre - massive number of links zzzzzzzzzzzzzzW4F wrapper - wrapper converts XML to HTML
    XFlat - convert flat file into XML zzzzzzzzzzzzzzXML Parsers and other XML stuff
    XML.com - Parsers, etc. zzzzzzzzzzzzzzXML-Data Catalog System - uhhhh looks close
    XTAL's general converter - convert anything 2 XML

    other Background[8 links]
    Is Linux ready for the Enterprise, scalable... zzzzzzzzzzzzzzLinux reliability
    Linux Versus Windows NT, Mark(sysinternals bloke) zzzzzzzzzzzzzzPC reliability (pcworld)
    SPEC - Standard Performance Evaluation Corp. zzzzzzzzzzzzzzSystems benchmarks
    TPC - Transaction Processing Performance Council zzzzzzzzzzzzzzUnix Beats Back NT In EDA Workstation Arena
    Proper TREC(-8) QA systems[2 links]

    pg. 387 LIMSI-CNRS pretty deep parsing[2 links]
    More links....
    NLP, IR links - lots to corpii, etc.

    pg. 575 U. of Ottawa and NRL (shit system, got 0%)[1 links]
    LAKE Lab
    pg. 607! University of Sheffield (crap system, but OPEN SOURCE!)[2 links]
    GATE - FREE IE app w`source code
    LaSIE - ER, coreference, template (cv)

    pg. 617 Univ of Surrey (inconclusive matches)[2 links]
    System Quirk - Or is this their search system..... Hmmmmmm
    Univ of Surrey - pointers (hopefully this is their WILDER search system...)

    SMU - Pg. 65[1 links]
    Natural Language Processing Laboratory at SMU

    Textract[2 links]
    Cymfony - Technology
    Textract - State of the Art Information Extraction

    Xerox uhhhhh maybe[1 links]
    Xerox Palo Alto Research Center
    (OVERVIEW) 1999 TREC-8 Q&A Track Home Page
    NLP bloke, Univ Sussex


    Tcl-Tk[4 links] Tcl tutorial
    Tcl-Tk Contributed Programs Index
    Tcl-Tk Resources, sources
    TclXML - manipulating XML using Tcl-Tk
    Artificial Natural Language - Is this what I'm trying to parse into...
    Comparison of Indexers - Prise vs. Inquery vs. MG, etc.
    Eagles - Language Engineering Standards
    Language Technology Group - lots of modules!
    LDC - Linguistic Data Consortium, lots of corpora
    Lexical Resources
    Links 2 resources, indexers.....
    Lots of IR stuff, University of uhhh
    Managing Gigabytes Indexer
    Managing Gigabytes Manuals and stuff
    Htdig search system
    NLP & IR (NLPIR, NIST) Group
    OVERVIEW OF MUC-7-MET-2
    Perl XML Indexing - XML search engine type thing
    Phrasys Language Processing Software Components (money)
    QA HCI bullshit
    SIGIR - TREC-type thing, resources
    SMART indexer system documentation
    Text REtrieval Conference (TREC) Home Page
    The Natural Language Software Registry
    Thunderstone IE and IR products
    WordNet - FREE DOWNLOADABLE lexical English database

    Page created with URL+, nice utility for working with internet shortcuts
    --
    A caveman dreams of being us, the incalculable power and riches. We dream of being Q, then what?
  24. I would have to HIGHLY disagree... by cr0sh · · Score: 2

    While I would say that the vast majority of posts on /. are mere discussion, etc - there is a small but useful subset buried deep within that arguably contains useful information, or at the very least would serve as a starting point for further research.

    There are a TON of "Ask Slashdot" articles with very valuable information. I also see in this article many valuable posts (especially that one with the tons of links on machine learning and mining). I also remember some valid ideas bandied about back on the homebrew rollercoaster posting. I also remember seeing information on a posting about mozilla yesterday talking about how to get nice looking fonts in X. Finally, I remember quite some time back (possible up to 2 years ago) an article on AI, in which two individuals, who seemed to know their shit at minimum, and at best were both neuroscientists - arguing about how neurons worked and how the brain "thinks" - most of that went WAAAY over my head, but it was valuable information (or at least it might be a stepping stone).

    I see this all the time here on /. - true, there is a ton of SPAM and troll posts, etc to wade through, but that is what we are discussing here - how do you "mine" through the ore to get to that nugget of "gold"?

    --
    Reason is the Path to God - Anon
  25. Re:Wow! But don't you have to train this system? by foobar104 · · Score: 2

    It's not "training," really. In the demo I saw, they gave VideoLogger four or five video frames that had Saddam Hussein in them, and drew little boxes around his face to identify it, then assigned a keyword to it.

    Then they ran some news footage through the system that had other pictures of Hussein in it. VideoLogger picked him out and assigned the keyword "Saddam Hussein" to the clip. It got did this on face recognition, not speech or CC recognition, because the video clip was from the Russian TV news!

    It was pretty cool, even though it was just a demo.