Slashdot Mirror


User: commoncrawl

commoncrawl's activity in the archive.

Stories
0
Comments
1
First seen
Last seen
Profile
(view on slashdot.org)

Comments · 1

  1. Re:Wait, what? on Common Crawl Foundation Providing Data For Search Researchers · · Score: 0

    "Write, read, and delete objects containing from 1 byte to 5 terabytes of data each. The number of objects you can store is unlimited". As we mentioned in a separate comment below, there are actually 323694 files in our current bucket, which will grow to 455827 once we consolidate data from an older bucket. Each archive file is 100MB in size (compressed), and the average doc size is around 10K (compressed). As we move towards a more focused and sustained crawl in 2012, the counts in the bucket will continue to see significant growth. We also hope to augment the raw crawl with additional metadata that should make it possible to avoid a complete bulk scan if so desired.