Common Crawl Foundation Providing Data For Search Researchers

Posted by Unknown on Monday November 14, 2011 @01:15PM from the doesn't-archive-dot-org-do-that dept.

mikejuk writes with an excerpt from an article in I Programmer: "If you have ever thought that you could do a better job than Google but were intimidated by the hardware needed to build a web index, then the Common Crawl Foundation has a solution for you. It has indexed 5 billion web pages, placed the results on Amazon EC2/S3 and invites you to make use of it for free. All you have to do is setup your own Amazon EC2 Hadoop cluster and pay for the time you use it — accessing the data is free. This idea is to open up the whole area of web search to experiment and innovation. So if you want to challenge Google now you can't use the excuse that you can't afford it." Their weblog promises source code for everything eventually. One thing I've always wondered is why no distributed crawlers or search engines have ever come about.

3 of 61 comments (clear)

Min score:

Reason:

Sort:

Saves you on bandwidth by CmdrPony · 2011-11-14 13:30 · Score: 3, Informative

But it's still a long way to go. They seem to have archive of what they have crawled. That's it. You processing all those pages on EC2 is still going to be extremely costly and time taking.
Re:It should be obvious by icebraining · 2011-11-14 14:51 · Score: 3, Informative

Except the editor is wrong, since distributed search engines do exist.

--
Dilbert RSS feed
Re:Wait, what? by Amouth · 2011-11-14 16:58 · Score: 3, Informative

for a modern "website" 42mb isn't large.. but for any single "webpage" it is quite large and not common - even with tones of images

--
'...if only "Jumping to a Conclusion" was an event in the Olympics.'