Slashdot Mirror


Common Crawl Foundation Providing Data For Search Researchers

mikejuk writes with an excerpt from an article in I Programmer: "If you have ever thought that you could do a better job than Google but were intimidated by the hardware needed to build a web index, then the Common Crawl Foundation has a solution for you. It has indexed 5 billion web pages, placed the results on Amazon EC2/S3 and invites you to make use of it for free. All you have to do is setup your own Amazon EC2 Hadoop cluster and pay for the time you use it — accessing the data is free. This idea is to open up the whole area of web search to experiment and innovation. So if you want to challenge Google now you can't use the excuse that you can't afford it." Their weblog promises source code for everything eventually. One thing I've always wondered is why no distributed crawlers or search engines have ever come about.

10 of 61 comments (clear)

  1. Saves you on bandwidth by CmdrPony · · Score: 3, Informative

    But it's still a long way to go. They seem to have archive of what they have crawled. That's it. You processing all those pages on EC2 is still going to be extremely costly and time taking.

    1. Re:Saves you on bandwidth by Gumber · · Score: 5, Insightful

      Bitch moan, bitch moan. If I had a need for such a dataset, I think I'd be damn grateful that I didn't have to collect it myself. As for the cost of processing the pages, the article suggests that running a hadoop job on the whole dataset on EC2 might be in the neighborhood of $100. That's not that costly.

    2. Re:Saves you on bandwidth by CmdrPony · · Score: 3, Interesting

      To be honest, if I wanted to work on such data and didn't have lots of money, I would actually prefer collecting it myself. Sure, with EC2 I can easily put more processing power and process it quickly, but I can get dedicated 100mbit server with unlimited bandwidth for around 60-70 dollars a month. It also has more space and processing power than EC2 at that price, and I can process the pages as I download them. That way I would build my database structure as I go, and I'm guaranteed with fixed cost a month.

      Sure, if you're researcher and want to get quick results, then you can run a job for $100-200 against this dataset. One job. And it better not be anything complex, or you're paying more. In the end, if you're short on money, it would probably be better to do the crawling part yourself too. That isn't costly, it's just time taking.

  2. Interesting, however by CastrTroy · · Score: 3, Interesting

    Interesting, However, wouldn't one need to index the data in whatever format they need in order to actually search and get useful results from it? You'd need to pay a fortune in compute time just to analyze that much data. It say's they've indexed it, but I don't see how that helps researchers who will want to run their own indexing and analysis against that dataset. Sure it means you don't have to download and spider all that data, but that's only a very small part of the problem.

    --

    Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    1. Re:Interesting, however by Gumber · · Score: 4, Insightful

      It may or may not be a small part of the problem, but it isn't a small problem to crawl that many web pages. This likely lets people save a lot of time and effort which they can then devote to their unique research.

      Maybe it will cost a fortune to analyze that much data, but there isn't really anyway of getting around the cost if you need that much data. Besides, for what its worth, the linked article suggests that a hadoop run against the data costs about $100. I'm sure the real cost depends on the extent and efficiency of your analysis, but that is hardly "a fortune."

  3. It should be obvious by DerekLyons · · Score: 4, Interesting

    One thing I've always wondered is why no distributed crawlers or search engines have ever come about.

    Because being 'distributed' is not a magic wand. (Nor is 'crowdsourcing', nor 'open source', or half a dozen other terms often used as buzzwords in defiance of the actual (technical) meanings.) You still need substantial bandwidth and processing power to handle the index, being distributed just makes the problems worse as now you need bandwidth and processing power to coordinate the nodes.

    1. Re:It should be obvious by icebraining · · Score: 3, Informative

      Except the editor is wrong, since distributed search engines do exist.

  4. Fix GOOG's braindead pageranking system by quixote9 · · Score: 3, Interesting

    Google's way of coming up with pageranks is fundamentally flawed. It's a popularity test, not an information content test. It leads to link farming. Even worse, it leads everyone, even otherwise well-meaning people, not to cite their sources so they won't lose pagerank by having more outgoing links than incoming ones. That is bad, bad, bad, bad, and bad. Citing sources is a foundation of any real information system, so Google's method will ultimately end in a web full of unsubstantiated blather going in circles. It's happening already, but we've barely begun to sink into the morass.

    An essential improvement is coming up with a way to identify and rank by actual information content. No, I have no idea how to do that. I'm just a biologist, struggling with plain old "I." AI is beyond me.

  5. Wait, what? by zill · · Score: 5, Interesting
    From the article:

    It currently consists of an index of 5 billion web pages, their page rank, their link graphs and other metadata, all hosted on Amazon EC2.

    The crawl is collated using a MapReduce process, compressed into 100Mbyte ARC files which are then uploaded to S3 storage buckets for you to access. Currently there are between 40,000 and 50,000 filled buckets waiting for you to search.

    Each S3 storage bucket is 5TB.

    5TB * 40,000 / 5 billion = 42MB/web page

    Either they made a typo, my math is wrong, or they started crawling the HD porn sites first. I really hope it's not the latter because 200 petabytes of porn will be the death of so many geeks that the year of Linux on the desktop might never come.

    1. Re:Wait, what? by Amouth · · Score: 3, Informative

      for a modern "website" 42mb isn't large.. but for any single "webpage" it is quite large and not common - even with tones of images

      --
      '...if only "Jumping to a Conclusion" was an event in the Olympics.'