Genetic Database Hits One Billion Entries
ChocSnorfler writes to tell us that the Sanger Institute is reporting that their Genetic Record Database has hit one billion entries, making it the world's largest. From the announcement: "The Trace Archive is a store of all the sequence data produced and published by the world scientific community, including the Sanger Institute's own prodigious output as a world-leading genomics institution. To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. The Archive is 22 Terabytes in size and doubling every ten months."
Wow, that's almost 12U of rack space. Oh my *yawn*
Now the fact that that's all genetic data, that's amazing considering a human is only ~1GB so 22,000 humans worth.
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
If we stacked up all of the useless length metaphors/comparisons from end to end, they'd still add up to a non-useful mental image of a billion genetic records.
I mean, "printed out as a single line of text, it would stretch around the world more than 250 times" means what, in terms of helping us picture this? I take it that we're not supposed to be able to imagine a billion records, but we can all clearly picture some text wrapped around the planet 250 times? Ah, that's much more helpful!
Now, I just got done re-indexing 10 million records in a database, and I can sort of picture 100 times that much work. This is slashdot! More nerdly examples, please.
Don't disappoint your bird dog. Go to the range.
I mean, most of that data is just redundant pairs of A-G C-T T-G etc...
I reckon you could zip it up and it'll fit on a couple of floppy disks.
All this hype about how vastly much paper you get if you print it all out misses the wonder of the thing.
The wonder isn't how BIG the human genome is - the amazing thing is how *TINY* it is.
The human genome is 3 billion base pairs...each base pair is one of only four possibilities - so two bits each. 750 Megabytes...that's one CD-ROM. There is a lot of redundancy in it too - many of those base pairs are never 'expressed' as proteins, many are replicated redundantly dozens of times. So with compression, or even just deleting the junk - you'd get it down to maybe 100 to 200 megs - tops.
I find it utterly amazing that all that complexity is so amazingly compactly encoded.
Yeah - that's a lot of bits of paper - or 600 floppy disks or some other bullshit - but by the standards of modern media, it's MICROSCOPIC.
Announcements like this would do better to explain how LITTLE data this really is - that's the wonder of the thing.
www.sjbaker.org
If I'm doing the math right, that would put the storage needed at about 25EB eight years out from now (about ten doublings is 1024 times the current needs). Which is only 50,000 500GB drives. While certainly quite a lot, if the average hard drive space is even 10GB, times millions of computers just in the US, I think we're set. Seagate probably sells that much storage every week.
I'm pretty sure we'll run out of species to map the genetic info of before we run out of space to store that info.
Still, quite the accomplishment.
How are sites slashdotted when nobody reads TFAs?
What's incredibly more lame is that 99% of the slashdot comments on this article so far are stuck on units of measure. Clearly it's a lot. Instead of debating the length of a piece of string, how about some discussion on how to distribute and analyze so much data. At this point I'd almost welcome some grousing about patents or dumb google DNA-related theories. We're barely scratching the surface on understanding genetic data. Even finding approximate substring matches within samples is fairly difficult. Here we have the world's biggest crossword puzzle which encodes the secrets of life itself and most of you guys are stuck on the point size of the font.