Common Crawl Foundation Providing Data For Search Researchers
mikejuk writes with an excerpt from an article in I Programmer: "If you have ever thought that you could do a better job than Google but were intimidated by the hardware needed to build a web index, then the Common Crawl Foundation has a solution for you. It has indexed 5 billion web pages, placed the results on Amazon EC2/S3 and invites you to make use of it for free. All you have to do is setup your own Amazon EC2 Hadoop cluster and pay for the time you use it — accessing the data is free. This idea is to open up the whole area of web search to experiment and innovation. So if you want to challenge Google now you can't use the excuse that you can't afford it."
Their weblog promises source code for everything eventually. One thing I've always wondered is why no distributed crawlers or search engines have ever come about.
But it's still a long way to go. They seem to have archive of what they have crawled. That's it. You processing all those pages on EC2 is still going to be extremely costly and time taking.
Interesting, However, wouldn't one need to index the data in whatever format they need in order to actually search and get useful results from it? You'd need to pay a fortune in compute time just to analyze that much data. It say's they've indexed it, but I don't see how that helps researchers who will want to run their own indexing and analysis against that dataset. Sure it means you don't have to download and spider all that data, but that's only a very small part of the problem.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Must be a conspiracy set up by Amazon to get people to pay for vast amounts of compute time. Why now allow people to purchase copies of the data on hard disk or tape. 5 billion pages, at 100K each (high estimate perhaps) is 500 TB. If you zip it, you could probably get it down to 10 TB if you compress it with a good algorithm. Not "that much" if this is the kind of research you are interested in.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Because being 'distributed' is not a magic wand. (Nor is 'crowdsourcing', nor 'open source', or half a dozen other terms often used as buzzwords in defiance of the actual (technical) meanings.) You still need substantial bandwidth and processing power to handle the index, being distributed just makes the problems worse as now you need bandwidth and processing power to coordinate the nodes.
A conspiracy? You're going to have to pay someone for the compute time. It's not like a lot of people have big clusters lying around, so lot of people are going to opt to pay Amazon anyway.
As for selling access to the data on physical media, it doesn't look like there is anything to stop you from taking advantage of Amazon's Export Service to get the data set on physical media.
Google's way of coming up with pageranks is fundamentally flawed. It's a popularity test, not an information content test. It leads to link farming. Even worse, it leads everyone, even otherwise well-meaning people, not to cite their sources so they won't lose pagerank by having more outgoing links than incoming ones. That is bad, bad, bad, bad, and bad. Citing sources is a foundation of any real information system, so Google's method will ultimately end in a web full of unsubstantiated blather going in circles. It's happening already, but we've barely begun to sink into the morass.
An essential improvement is coming up with a way to identify and rank by actual information content. No, I have no idea how to do that. I'm just a biologist, struggling with plain old "I." AI is beyond me.
It currently consists of an index of 5 billion web pages, their page rank, their link graphs and other metadata, all hosted on Amazon EC2.
The crawl is collated using a MapReduce process, compressed into 100Mbyte ARC files which are then uploaded to S3 storage buckets for you to access. Currently there are between 40,000 and 50,000 filled buckets waiting for you to search.
Each S3 storage bucket is 5TB.
5TB * 40,000 / 5 billion = 42MB/web page
Either they made a typo, my math is wrong, or they started crawling the HD porn sites first. I really hope it's not the latter because 200 petabytes of porn will be the death of so many geeks that the year of Linux on the desktop might never come.
I mean, hosting the stuffs on Amazon server is one thing - it gonna have to be hosted somewhere, but the thing that I feel uncomfortable is that if anyone wants to do any research on the info they end up have to pay Amazon.
Hmm ....
So you expect the researchers to Fedex you 100000 2TB harddrive to you upon request? We're talking about 200 petabytes of data here. It's gonna take forever to transfer no matter how wide your intertubes are. A shipping container of harddrives is literally the only way to move this much data in a timely manner.
Since there's no easy way to move the data, it only makes sense to run your code on the cluster where the data is currently residing at.
Seriously, the EC2 cluster is already there, setting it up will cost you lots less than building it up from ground. Time costs money too on this planet. Also, most importantly, your 80 dollar box is not going to be able to store metadata on 5 billion web pages and process it at any reasonable IO speed at all.
Go build your own processing cluster and see how long it takes you to do that for less than what EC2 would charge. Once you're finished, you could make a business out of it and compete with Amazon. The last, obligatory step: Profit!
I was promised a flying car. Where is my flying car?