Common Crawl Foundation Providing Data For Search Researchers

Posted by Unknown on Monday November 14, 2011 @01:15PM from the doesn't-archive-dot-org-do-that dept.

mikejuk writes with an excerpt from an article in I Programmer: "If you have ever thought that you could do a better job than Google but were intimidated by the hardware needed to build a web index, then the Common Crawl Foundation has a solution for you. It has indexed 5 billion web pages, placed the results on Amazon EC2/S3 and invites you to make use of it for free. All you have to do is setup your own Amazon EC2 Hadoop cluster and pay for the time you use it — accessing the data is free. This idea is to open up the whole area of web search to experiment and innovation. So if you want to challenge Google now you can't use the excuse that you can't afford it." Their weblog promises source code for everything eventually. One thing I've always wondered is why no distributed crawlers or search engines have ever come about.

2 of 61 comments (clear)

Min score:

Reason:

Sort:

Re:Saves you on bandwidth by Gumber · 2011-11-14 14:11 · Score: 5, Insightful

Bitch moan, bitch moan. If I had a need for such a dataset, I think I'd be damn grateful that I didn't have to collect it myself. As for the cost of processing the pages, the article suggests that running a hadoop job on the whole dataset on EC2 might be in the neighborhood of $100. That's not that costly.
Wait, what? by zill · 2011-11-14 14:49 · Score: 5, Interesting

From the article:

It currently consists of an index of 5 billion web pages, their page rank, their link graphs and other metadata, all hosted on Amazon EC2.

The crawl is collated using a MapReduce process, compressed into 100Mbyte ARC files which are then uploaded to S3 storage buckets for you to access. Currently there are between 40,000 and 50,000 filled buckets waiting for you to search.
Each S3 storage bucket is 5TB.

5TB * 40,000 / 5 billion = 42MB/web page

Either they made a typo, my math is wrong, or they started crawling the HD porn sites first. I really hope it's not the latter because 200 petabytes of porn will be the death of so many geeks that the year of Linux on the desktop might never come.