Slashdot Mirror


Genetic Database Hits One Billion Entries

ChocSnorfler writes to tell us that the Sanger Institute is reporting that their Genetic Record Database has hit one billion entries, making it the world's largest. From the announcement: "The Trace Archive is a store of all the sequence data produced and published by the world scientific community, including the Sanger Institute's own prodigious output as a world-leading genomics institution. To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. The Archive is 22 Terabytes in size and doubling every ten months."

6 of 189 comments (clear)

  1. 22TB is nothing. by Duncan3 · · Score: 3, Insightful

    Wow, that's almost 12U of rack space. Oh my *yawn*

    Now the fact that that's all genetic data, that's amazing considering a human is only ~1GB so 22,000 humans worth.

    --
    - Adam L. Beberg - The Cosm Project - http://www.mithral.com/
  2. So tired. So very, very tired. Of that. by ScentCone · · Score: 5, Insightful

    If we stacked up all of the useless length metaphors/comparisons from end to end, they'd still add up to a non-useful mental image of a billion genetic records.

    I mean, "printed out as a single line of text, it would stretch around the world more than 250 times" means what, in terms of helping us picture this? I take it that we're not supposed to be able to imagine a billion records, but we can all clearly picture some text wrapped around the planet 250 times? Ah, that's much more helpful!

    Now, I just got done re-indexing 10 million records in a database, and I can sort of picture 100 times that much work. This is slashdot! More nerdly examples, please.

    --
    Don't disappoint your bird dog. Go to the range.
  3. How big compressed? by Mr_Tulip · · Score: 3, Insightful

    I mean, most of that data is just redundant pairs of A-G C-T T-G etc...

    I reckon you could zip it up and it'll fit on a couple of floppy disks.

  4. The amazing thing is how SMALL it is. by sbaker · · Score: 5, Insightful

    All this hype about how vastly much paper you get if you print it all out misses the wonder of the thing.

    The wonder isn't how BIG the human genome is - the amazing thing is how *TINY* it is.

    The human genome is 3 billion base pairs...each base pair is one of only four possibilities - so two bits each. 750 Megabytes...that's one CD-ROM. There is a lot of redundancy in it too - many of those base pairs are never 'expressed' as proteins, many are replicated redundantly dozens of times. So with compression, or even just deleting the junk - you'd get it down to maybe 100 to 200 megs - tops.

    I find it utterly amazing that all that complexity is so amazingly compactly encoded.

    Yeah - that's a lot of bits of paper - or 600 floppy disks or some other bullshit - but by the standards of modern media, it's MICROSCOPIC.

    Announcements like this would do better to explain how LITTLE data this really is - that's the wonder of the thing.

    --
    www.sjbaker.org
  5. Re:For God's sake, don't print it! by Firehed · · Score: 3, Insightful
    Right. Because hard drives aren't ever going to be made from today forward, and certainly won't get bigger in capacity.

    If I'm doing the math right, that would put the storage needed at about 25EB eight years out from now (about ten doublings is 1024 times the current needs). Which is only 50,000 500GB drives. While certainly quite a lot, if the average hard drive space is even 10GB, times millions of computers just in the US, I think we're set. Seagate probably sells that much storage every week.

    I'm pretty sure we'll run out of species to map the genetic info of before we run out of space to store that info.

    Still, quite the accomplishment.

    --
    How are sites slashdotted when nobody reads TFAs?
  6. Re:Dubious claims by timeOday · · Score: 4, Insightful
    Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.
    Let me interpret for you: it's a lot.

    What's incredibly more lame is that 99% of the slashdot comments on this article so far are stuck on units of measure. Clearly it's a lot. Instead of debating the length of a piece of string, how about some discussion on how to distribute and analyze so much data. At this point I'd almost welcome some grousing about patents or dumb google DNA-related theories. We're barely scratching the surface on understanding genetic data. Even finding approximate substring matches within samples is fairly difficult. Here we have the world's biggest crossword puzzle which encodes the secrets of life itself and most of you guys are stuck on the point size of the font.