Slashdot Mirror


Extracting Meaning From Millions of Pages

freakshowsam writes "Technology Review has an article on a software engine, developed by researchers at the University of Washington, that pulls together facts by combing through more than 500 million Web pages. TextRunner extracts information from billions of lines of text by analyzing basic relationships between words. 'The significance of TextRunner is that it is scalable because it is unsupervised,' says Peter Norvig, director of research at Google, which donated the database of Web pages that TextRunner analyzes. The prototype still has a fairly simple interface and is not meant for public search so much as to demonstrate the automated extraction of information from 500 million Web pages, says Oren Etzioni, a University of Washington computer scientist leading the project." Try the query "Who has Microsoft acquired?"

26 of 138 comments (clear)

  1. Try the query.... by Finallyjoined!!! · · Score: 3, Funny

    "Who has dumped Vista?"

    --
    If I had an Ass, I'd call it Fanny Bottom, then I could slap my Ass; Fanny Bottom, on the Arse.
    1. Re:Try the query.... by maxume · · Score: 3, Funny

      I tried to read your comment, but I did not attempt to understand it.

      --
      Nerd rage is the funniest rage.
  2. Not entirely helpful by CRCulver · · Score: 5, Interesting

    I suppose the major problem with this is that it cannot tell the difference between truth and lies or urban legends, it just repeats what other people have said, even if they are conspiracy theorists. The query "Who killed JFK?" suggests the CIA did it.

    1. Re:Not entirely helpful by John+Hasler · · Score: 2, Insightful

      The major problem is that it assumes the presence of meaning in Web pages in the first place.

      --
      Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
    2. Re:Not entirely helpful by morgan_greywolf · · Score: 2, Interesting

      Actually, just like any other search, it just shows ALL of the likely results and you are still responsible for determining for yourself which of the statements is true. It says "CIA killed JFK" but the first result it returns is "Lee Harvey Oswald killed JFK". It also seems to pare down the results somewhat, because I know I've seen conspiracies also suggesting that the KGB killed JFK, or that the Mafia killed JFK. I'm guessing that more people think the CIA killed JFK than the KGB or the Mafia.

    3. Re:Not entirely helpful by owlnation · · Score: 4, Funny

      I suppose the major problem with this is that it cannot tell the difference between truth and lies or urban legends, it just repeats what other people have said, even if they are conspiracy theorists. The query "Who killed JFK?" suggests the CIA did it.

      So much like Wikipedia then?

    4. Re:Not entirely helpful by jerep · · Score: 2, Insightful

      it just repeats what other people have said

      I don't see anything new here, most people have done this since the beginning of time.

    5. Re:Not entirely helpful by thedonger · · Score: 2, Funny

      it just repeats what other people have said

      I don't see anything new here, most people have done this since the beginning of time.

      Yeah, Textrunner just repeats what other people have said, like most people since the beginning of time.

      --
      Help fight poverty: Punch a poor person.
    6. Re:Not entirely helpful by somersault · · Score: 2, Interesting

      I suppose the major problem with this is that it cannot tell the difference between truth and lies or urban legends

      Most humans can't either, how do you expect a search engine to?

      There will be a lot of false positives and negatives that will be hard to identify as such unless it directly works with something like snopes.com , which kind of defeats the purpose because it means someone has had to research every question anyway.

      If a project like this which simply scoured the whole 'net, you wouldn't really be able to verify anything beyond people's opinions or beliefs, which may or may not be 'true'.

      I think something like this would work really well for factual results if it was only allowed to draw conclusions from verified sources, say something like Wikipedia articles that have been verified by experts in the appropriate field (I've not been following all this type of thing recently but perhaps that is what Wolfram Alpha does already). It could perhaps be useful to have it search the general internet for supplementary results for some questions though, especially those of a philosophical nature where it may be impossible to establish definite answers ("is there a god" and the like).

      --
      which is totally what she said
    7. Re:Not entirely helpful by msbmsb · · Score: 2, Informative

      Semantic processing systems like this (it's not something new) aren't usually able to determine correctness. The truth of a statement is assumed and the best these NLP engines can do at the moment is identify conflicts and maybe use some reputation metrics to assign a veracity rating to a particular statement, or notify the user that there are differing conclusions. These systems are just really, like the summary states, "information extraction" systems. Just as a regular search engine will return you the results from the data set, that's what these types of semantic extraction engines usually do, except the data is processed in a semantically-organized way so that you can query with semantics/natural language constraints instead of just keywords and boolean constraints.

      There are some that incorporate some intention or opinion polarity detection, but even those are not capable to sorting "truth" versus "conspiracy".

      Additionally, semantic extraction output, like named entities and semantic relations, are useful for many other applications.

  3. Nascent AI? by Drakkenmensch · · Score: 4, Funny
    I've always viewed intelligence as the ability to take unrelated facts and create new and original ideas from their synthesis. This project may very well lead to new ideas to create the first true AI.

    I'll start stockpiling food and armor piercing rounds for the moment Skynet goes live.

  4. 500 million web pages can't be wrong by Dunbal · · Score: 4, Funny

    Yet strangely, I get a result of:

    TextRunner took 9 seconds.
    Retrieved 0 results for what is the airspeed velocity of an unladen swallow?.

    Meh, call me when this stuff can answer the really USEFUL questions in life.

    --
    Seven puppies were harmed during the making of this post.
    1. Re:500 million web pages can't be wrong by JDHannan · · Score: 3, Funny

      And even worse:

      Retrieved 0 results for what is the answer to life, the universe and everything?.

    2. Re:500 million web pages can't be wrong by sukotto · · Score: 4, Funny

      Obviously it's not indexing http://www.style.org/unladenswallow/

      estimate that the average cruising airspeed velocity of an unladen European Swallow is roughly 11 meters per second, or 24 miles an hour.

      --
      Come play free flash games on Kongregate!
  5. Zero results by John+Hasler · · Score: 2, Interesting

    I tried half a dozen queries of the sort I often use Google for (example: "What is the velocity of sound in hydraulic fluid?"). No answers.

    --
    Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
  6. Concise by moogsynth · · Score: 2, Interesting

    Try "Who paid SCO?" Concise, to the point. Nice.

  7. Re:So someone donated a copy of my copyrighted pag by Anonymous Coward · · Score: 2, Interesting

    Allowing a search engine to visit a site and allowing somebody to pass your web page content around are two completely different things.

  8. what causes cancer? by umundane · · Score: 5, Funny

    I learned that

    > smoking (387) causes cancer.

    I was also surprised to learn that

    > girls and women (11) cause most cases of cervical cancer

    This is a great resource if you need to cite a reference for a Wikipedia article.

  9. TextRunner confirms it: by guruevi · · Score: 4, Funny

    Who is at Area 51
    aliens (3), Carter (2), Colonel Sanders (2), Hi Group (2) is at Area 51

    Who bombed WTC
    Al Qaeda (5), Bush (5), Clinton (2), 4 more... bombed the WTC

    Who built the pyramids (example on site):
    Egyptians (298), aliens (73), Pharaohs (40), 77 more... built the pyramids

    What contains antioxidants (example on site):
    Coffee (17), Recent scientific research (15), food (6), 5 more... contain significant amounts of antioxidants

    -- man, I gotta get me some more recent scientific research.

    --
    Custom electronics and digital signage for your business: www.evcircuits.com
  10. Slashdot is not ... by Xyberu · · Score: 2

    Slashdot isn't
            a professional news site
            a normal news site
            a social news site
            a News Site
            a valid source
            a reputable source
            the right source
            a healthy online community
            a goddamn online community
            a Terrorist Organization

  11. Re:Wikipedia tried and failed by Colonel+Korn · · Score: 3, Insightful

    That is how Wikipedia was meant to be. A group of statements about subjects, all of which can be referenced to some original source. So that people can look up something quickly and then look at the sources for more definite information....

    Seeing how many people cite Wikipedia directly, use it as the main source for their research and the amount of newspapers that have been reported to directly quote inaccurate facts from Wikipedia... I don't think it is working properly. It requires a lot of optimism to believe "People will use that as a initial source and then verify the information"

    That's not wikipedia's failure. Those same people would just be referencing nothing or a web site with zero public review and commenting without it.

    --
    "I zero-index my hamsters" - Willtor (147206)
  12. Correction.... by wowbagger · · Score: 4, Insightful

    "...that pulls together facts by combing through more than 500 million Web pages."

    Correction:

    "...that pulls together assertions by combing through more than 500 million Web pages."

    Whether those assertions are correct or even reasonable is a completely different issue.

    It might be interesting to then take those assertions and have some means to validate or invalidate them, but currently that's going to require meat, not metal.

    Now, if you could come up with some form of AI^Walgorithm to do that automatically, then you would have something.

  13. Re:Exactly by bxbaser · · Score: 2, Funny

    "The query "Who killed JFK?" suggests the CIA did it"

    Hmmm....And now its not responding because its "slashdotted"

  14. Why WTC name is spelled in American by tepples · · Score: 2, Informative

    Damn my correct spelling of English words!

    Because the World Trade Center was located on American soil, its name is spelled in American dialect.

  15. Retrieved 1 result for does god exist by ebertx · · Score: 2, Funny
    Retrieved 1 result for does god exist. God DOES exist last night (2).

    Well, that answers that question.

  16. Re:Wow, impressive, but prior art... by rm999 · · Score: 2, Interesting

    I think you're missing the point. This is an AI project - it's research. Presumably, the questions you are typing in haven't been processed by a complicated nest of if-thens written by someone who knows English; instead, statistical models of language and meaning were extracted from the internet. Some people claim this is the equivalent of "teaching" a computer.

    The first example, which is what most search engines do, leads to impressive search results but is limited by the logic people can code up. This AI, on the other hand, may be a primitive example of the way Google will work 15 years from now.