Huge Site Ranking Dataset Donated to the Common Crawl Foundation
Greg Lindahl writes "blekko is donating search engine ranking data for 140 million domains and 22 billion urls to the Common Crawl Foundation. Common Crawl is a non-profit dedicated to making the greatest (yet messiest) dataset of our time, the web, available to everyone, including tinkerers, hackers, activists, and new companies. blekko's ranking data will initially be used to improve the quality of Common Crawl's 8 billion webpage public crawl of the web, and eventually will be directly available to the public."
The idea is to give everyone access to crawl data. If you work at a large search company, you have access to crawl data. You can also set up crawlers to get the data yourself, but that is expensive and having countless crawlers doing duplicative work is not ideal. Our idea is that there should be one common repository for crawl data that anyone can use. Researchers are using it for NLP, IR, sentiment analysis and many other things like measuring the adoption of metadata formats http://www.webdatacommons.org/ Educators are using it as a real world dataset to teach big data techniques in the classroom. Developers and entrepreneurs are using it for startups. Sorry I don't have a car analogy :) Feel free to email me if you have any other questions lisa at commoncrawl dot org