Huge Site Ranking Dataset Donated to the Common Crawl Foundation
Greg Lindahl writes "blekko is donating search engine ranking data for 140 million domains and 22 billion urls to the Common Crawl Foundation. Common Crawl is a non-profit dedicated to making the greatest (yet messiest) dataset of our time, the web, available to everyone, including tinkerers, hackers, activists, and new companies. blekko's ranking data will initially be used to improve the quality of Common Crawl's 8 billion webpage public crawl of the web, and eventually will be directly available to the public."
Can I get a copy of the 'Leather Bound 1st Edition'? I'd prefer free shipping.
I didn't realize the web wasn't available to everyone, including tinkerers, hackers, activists, and new companies. Thank $(DEITY) the Common Crawlers are here to make sure that my port 80 hasn't yet been pried from my cold, dead fingers.
John
I've met some of the guys from blekko, and they do solid work, for the web as well as for their site. QED.
How does this project compare with the Internet Archive?
.... its better than being judged.
How does this project compare with the Internet Archive?
commoncrawl.org will be available on archive.org a lot longer than it will be available on commoncrawl.org
Blekko has pretty horrible search results compard to Google, Bing or even Duck Duck Go. So the raw data may be useful, the actual ranking data should not be trusted as being anywhere close to good.
Hi I work at Common Crawl. Internet Archive is awesome and does really important work. The main difference between us and Internet Archive is that you can analyze our data. Internet Archive is a vault and is not available on a platform where you can run jobs against it. Because we put it on Amazon and other compute platforms, anyone can access our data and run jobs against it. If you wanted to do that with Internet Archive's crawl you would have to ask permission, get permission, and download it to your personal data center in order to analyze it. I don't know too many people with a personal data center :)
Lisa
CommonCrawl is awesome! We have already used it for multiple projects at my work. Very cool!
Here.
Because we put it on Amazon and other compute platforms, anyone can access our data and run jobs against it. I don't know too many people with a personal data center :)
Lisa
Thanks. Hopefully you do know a few people with access to a data centre...like AWS.
Hi, Since you work there, do you have any idea how much data we're actually talking about here?
What implications (if any) does this have for Yacy?
http://yacy.net/en/ (the distributed search engine)