Extracting Meaning From Millions of Pages
freakshowsam writes "Technology Review has an article on a software engine, developed by researchers at the University of Washington, that pulls together facts by combing through more than 500 million Web pages. TextRunner extracts information from billions of lines of text by analyzing basic relationships between words. 'The significance of TextRunner is that it is scalable because it is unsupervised,' says Peter Norvig, director of research at Google, which donated the database of Web pages that TextRunner analyzes. The prototype still has a fairly simple interface and is not meant for public search so much as to demonstrate the automated extraction of information from 500 million Web pages, says Oren Etzioni, a University of Washington computer scientist leading the project." Try the query "Who has Microsoft acquired?"
"Who has dumped Vista?"
If I had an Ass, I'd call it Fanny Bottom, then I could slap my Ass; Fanny Bottom, on the Arse.
I suppose the major problem with this is that it cannot tell the difference between truth and lies or urban legends, it just repeats what other people have said, even if they are conspiracy theorists. The query "Who killed JFK?" suggests the CIA did it.
What the heck.
I'll start stockpiling food and armor piercing rounds for the moment Skynet goes live.
Yet strangely, I get a result of:
TextRunner took 9 seconds.
Retrieved 0 results for what is the airspeed velocity of an unladen swallow?.
Meh, call me when this stuff can answer the really USEFUL questions in life.
Seven puppies were harmed during the making of this post.
"Retrieved 0 results for Is Linux ready for the desktop?."
I tried half a dozen queries of the sort I often use Google for (example: "What is the velocity of sound in hydraulic fluid?"). No answers.
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
Try "Who paid SCO?" Concise, to the point. Nice.
Are we moving towards a web in which Google centralises everything on their own pages? These new engines present content without the need to visit pages it originates from. Is Google basically mooching off other people's websites with hardly anything - if anything at all - in return?
It could be dangerous if the only visitor a web site can expect is the Google bot.
The same copyrighted pages that you allowed Google to crawl since you obviously didn't protect it with a robots.txt?
My blog
What? You've found a search engine that honors robots.txt?
But AOL is nothing like Shakespeare.
I want to delete my account but Slashdot doesn't allow it.
Allowing a search engine to visit a site and allowing somebody to pass your web page content around are two completely different things.
Why deal with uncertainties about who-killed-who in the past, when you can have a lot more fun with what could be in the future. ... seems an inside job by Hillary is most probable just below a vicious murder by Ted Nugent. Scary!
"Who killed obama?"
I learned that
> smoking (387) causes cancer.
I was also surprised to learn that
> girls and women (11) cause most cases of cervical cancer
This is a great resource if you need to cite a reference for a Wikipedia article.
Who is at Area 51
aliens (3), Carter (2), Colonel Sanders (2), Hi Group (2) is at Area 51
Who bombed WTC
Al Qaeda (5), Bush (5), Clinton (2), 4 more... bombed the WTC
Who built the pyramids (example on site):
Egyptians (298), aliens (73), Pharaohs (40), 77 more... built the pyramids
What contains antioxidants (example on site):
Coffee (17), Recent scientific research (15), food (6), 5 more... contain significant amounts of antioxidants
-- man, I gotta get me some more recent scientific research.
Custom electronics and digital signage for your business: www.evcircuits.com
That is how Wikipedia was meant to be. A group of statements about subjects, all of which can be referenced to some original source. So that people can look up something quickly and then look at the sources for more definite information....
Seeing how many people cite Wikipedia directly, use it as the main source for their research and the amount of newspapers that have been reported to directly quote inaccurate facts from Wikipedia... I don't think it is working properly. It requires a lot of optimism to believe "People will use that as a initial source and then verify the information"
Slashdot isn't
a professional news site
a normal news site
a social news site
a News Site
a valid source
a reputable source
the right source
a healthy online community
a goddamn online community
a Terrorist Organization
...from me being completely silent, mouth shut and all, like my wife does! And she never had a single reboot in 43 years! Then again ... maybe that's precisely the problem?
Intellectual Property: an immaterial non-entity, most fiercely contended by those with no proper intellect to speak of.
I tried asking the real name of Doctor Who, and the site basically crapped out LOL, totally useless.
I would go with...
But meters per second and miles per hour? WHY?!
I typed in "how does a computer become self aware?" it just said something about it being busy because it's currently controlling california!
"Who is your daddy?" got 0 results.
"We shall grapple with the ineffable, and see if we may not eff it after all." - Douglas Adams
No answers.
That's what they said about SkyNet.
is this closed source ? Any idea what language this is implemented in ?
Apparently Mount Marcy, Mount Elbrus, Mount Kilimanjaro and Mount Etna are all the highest mountain. Then again, I was also informed that "high mountains are the hum of human cities torture", so I think I'll just steer clear of mountains altogether.
Why is my TV suddenly not working anymore?
Try
"What is Slasdot?"
Answer
Digg is Slashdot
\u262D = \u5350
Retrieved 0 results for what is the answer to life, the universe and everything.
FAIL!
Turns out to be way cooler than Wolfram Alpha. Now just think if it has the whole web. Wait, scratch that, I bet wikipedia's already in there. Also, skynet.
"...that pulls together facts by combing through more than 500 million Web pages."
Correction:
"...that pulls together assertions by combing through more than 500 million Web pages."
Whether those assertions are correct or even reasonable is a completely different issue.
It might be interesting to then take those assertions and have some means to validate or invalidate them, but currently that's going to require meat, not metal.
Now, if you could come up with some form of AI^Walgorithm to do that automatically, then you would have something.
www.eFax.com are spammers
"The query "Who killed JFK?" suggests the CIA did it"
Hmmm....And now its not responding because its "slashdotted"
You did.
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
So I take it this thing also hates grammar?
sic transit gloria mundi
love (53), song (19), Life (16), 81 more... is the meaning of life
1) of the 81 more, 42 doesnt show up anywhere
2) the stupid javascript hiding makes copy and paste a pain
same fuckers(2) that Framed Roger Rabbit
I knew it...
Retrieved 8 results for What causes global warming?
Human(10) vommitting.BUTT PLUGS (2).
Sorry everyone... I'll take it out.
TextRunner gets rid of that manual labor. A user can enter, for example, "kills bacteria," and the engine will come up with of pages that offer the insights that "chlorine kills bacteria" or "ultraviolet light kills bacteria" or "heat kills bacteria"--results called "triples"--and provide ways to preview the text and then visit the Web page that it comes from.
Wow, incredible. Because doing a search of "kills bacteria" with the quotes on Google won't get you those kind of results. Oh wait, yeah it will. In fact, it too will "chlorine kills bacteria" and "ultraviolet light kills bacteria" and "heat kills bacteria". And google also provides a way to preview the text and then visit the web page that it comes from.
Yeah, I know, I know, they just put a bad example in the article, but it's a ridiculously bad example.
This has to be played with to be appreciated. On request, it delivered a set of interesting papers about US-EPA misrepresentation of science. And, it returned a nul result for "Has any climate model been validated?"
This is going to be fun
I'm a Programmer. That's one level above Software Engineer and one level below Engineer.
I asked "Where in the world is Carmen San Diego?". The page trhew up a Java error.
I guess nobody really knows.
... you extract millions from the meaning of pages! ;)
Sorry, couldn't resist.
Any sufficiently advanced intelligence is indistinguishable from stupidity.
What makes grass grow?
Answer - (1 thing)
blood...
TextRunner took 2 seconds.
Retrieved 0 results for who performs warrantless wiretapping.
0 results.
Damn my correct spelling of English words!
Because the World Trade Center was located on American soil, its name is spelled in American dialect.
produces 0 results :P
"The first time I got drunk, I got married. The second time I bought a chimpanzee, after that I stayed sober" Arian Seid
Well, that answers that question.
'The significance of TextRunner is that it is scalable because it is unsupervised,' says Peter Norvig, director of research at Google,
I really wondered what he was getting at with this. It seems almost nonsensical, like something someone in marketing would come up with.
Now that the site is slashdotted I know that he means if only a few people use it, it's very scalable, but if a bunch of people are directed to use it (say, through Slashdot) then it doesn't scale very well.
If I had points, parent would be modded funny. This is an interesting resource... but it doesn't answer the real question: Coke, or Pepsi?
No answer provided. Enjoy.
This ain't no upwardly mobile freeway This is the road to hell
"Who invented the internet?"
Gore (396), Americans (24), US (10) 47 more... invented the Internet
"Bush (34) is the best president of the US"
Entering the query "Who is George Bush?" returned the following tidbits among other things:
General Draper was George Bush's guru
Hurricane Katrina is George Bush's Monica Lewinsky
Tony Blair is George Bush's poodle
democratic Iraq is George Bush's formidable legacy
Iraq is George Bush's waterloo
Hillary is the democratic version of good old George W. Bush
blue socks are Critics of George W. Bush
Bruce Bartlett is George W. Bush Bankrupted America
biggest terrorist is George W. Bush
ITYM "parse", but spelling Nazism aside, they are extracting the ideas from the pages (or at least trying to), not the expression of ideas, so copyright doesn't come into play (IANAL, etc.). This is just an attempt at automating the collation of existing research, and indeed similar ideas have been attempted in the past with smaller data sources, particularly in combination with other work in machine learning.
I got zero results. In 500 million pages, this should have been answered 500 million and one times.