Genetic Database Hits One Billion Entries

← Back to Stories (view on slashdot.org)

Genetic Database Hits One Billion Entries

Posted by ScuttleMonkey on Tuesday January 17, 2006 @02:39PM from the start-the-recount dept.

ChocSnorfler writes to tell us that the Sanger Institute is reporting that their Genetic Record Database has hit one billion entries, making it the world's largest. From the announcement: "The Trace Archive is a store of all the sequence data produced and published by the world scientific community, including the Sanger Institute's own prodigious output as a world-leading genomics institution. To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. The Archive is 22 Terabytes in size and doubling every ten months."

19 of 189 comments (clear)

Min score:

Reason:

Sort:

For God's sake, don't print it! by BadAnalogyGuy · 2006-01-17 14:43 · Score: 5, Funny

Some dumbass is always printing 300 pages of documents and hogging the printer. Forchrissakes, just figure out what pages you need and print those! Asshole.

The amount of data here is really enormous. To put it in perspective, if you lined up 7143 blondes, the number of strands of hair present would approximately equal the number of entries in this database.
1. Re:For God's sake, don't print it! by Anonymous Coward · 2006-01-17 14:53 · Score: 4, Funny
  
  I love those things: "To put this in perspective, here's another image or figure that won't fit in the human mind either." They always clear those huge numbers right up for me.
  
  At least your name is "BadAnalogyGuy", which gives you a better excuse than the story submitter.
2. Re:For God's sake, don't print it! by margaret · 2006-01-17 15:00 · Score: 5, Funny
  
  Some dumbass is always printing 300 pages of documents and hogging the printer. Forchrissakes, just figure out what pages you need and print those! Asshole.
  
  Like when I was in grad school, I remember our IT guy was hopping mad because he had to come in on a sunday to reboot the server because some dumbass decided to print the entire mouse chromomome 22 sequence. Something about a spool file and crashing his server...
i love meaningless data by JeanBaptiste · 2006-01-17 14:44 · Score: 5, Funny

"To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. "

I have twice that much data on my 128k thumbdrive, if printed out in 72 point font size.

Anyone care to translate this into volkswagens, or libraries of congress?
1. Re:i love meaningless data by Frogbert · 2006-01-17 15:08 · Score: 5, Funny
  
  No, but to put it in some perspective. It would take over 6 minutes for a japanese school girl to type it all out on her phone.
2. Re:i love meaningless data by Brent+Spiner · 2006-01-17 15:16 · Score: 5, Funny
  
  If you choose a fixed-width font such as 12 point Courier about 75 letters fit on a single line with half inch margins. This means that each letter is about 2.54 millimeters in length. The earth is 24900 miles in circumference that means that it would take 15776640000 letters to stretch around the earth.
  
  If we take a 1967 Volkswagen to be a measuremeant of length then it is 1606.01 times larger than a single letter so it would take 9823500.48 Volkswagi to tailgate around the earth. Multiply that by 250 and you get ~ 2.455875x10^9 Volkswagens.
  
  Since it is quite easy to convert Volkswagens to Library of Congresses I won't go into further detail.
  
  --
  Reality test... am I dreaming?
If printed out... by MarkusQ · 2006-01-17 14:48 · Score: 5, Funny

if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest
Did anybody else think "Wow, I've got a great idea for a mural for the space elevator!"
Anybody?
Uh, well, it's late...
--MarkusQ
Torrent? by mendaliv · 2006-01-17 14:48 · Score: 5, Funny

Would somebody please torrent it?
So tired. So very, very tired. Of that. by ScentCone · 2006-01-17 14:54 · Score: 5, Insightful

If we stacked up all of the useless length metaphors/comparisons from end to end, they'd still add up to a non-useful mental image of a billion genetic records.

I mean, "printed out as a single line of text, it would stretch around the world more than 250 times" means what, in terms of helping us picture this? I take it that we're not supposed to be able to imagine a billion records, but we can all clearly picture some text wrapped around the planet 250 times? Ah, that's much more helpful!

Now, I just got done re-indexing 10 million records in a database, and I can sort of picture 100 times that much work. This is slashdot! More nerdly examples, please.

--
Don't disappoint your bird dog. Go to the range.
1. Re:So tired. So very, very tired. Of that. by jmv · 2006-01-17 15:38 · Score: 4, Funny
  
  More nerdly examples, please.
  
  - It would require 100,000 liters of ink to write down all the 1's and 0's
  - It would take 400 years to transmit it over a 14.4 kbps modem
  * Requiring about 10 Giga Joules
  - If each bit was encoded on a single hydrogen atom, the whold db would weight about 0.1 mg
  - If ones are transmitted as a single (infrared) photon, it would take 0.01 Joules to transmit the whole db
  * You could transmit it 100 times with the energy of a mouse trap
  - It would require about one year for a million monkeys to type it in (without having to guess)
  
  --
  Opus: the Swiss army knife of audio codec
I will be more impressed... by Stachybotris · 2006-01-17 15:00 · Score: 5, Informative

When we figure out what all of that does. For every organism as or more complex than your average bacterium, there's a large amount of what amounts to filler DNA. Viruses don't have this problem, as few of them are large enough to even get by without overlapping reading frames. If you shrink this dataset down to only sequences that encode functional proteins (read: genes), there's still an insane amount of information. If you then remove the introns, the dataset gets even smaller. But of course, we don't really know if the introns and intra-genic regions of DNA (the so-called 'junk DNA') have functions (or how many they have), although some do act as regulators of transcription.

Given that a change of just 1 base in 500 of the 16S rRNA gene is sufficient to differentiate between two different species of bacteria, I have to wonder how many of these entries are quasi-redundant. When you consider how many species of bacteria are known to man, that means that there are literally thousands of potential entries for each gene. Unless, of course, they're storing only consensus sequences, which still vary widely between genera.

Sadly, the trend here seems to be more of 'sequence it, upload it, and patent it' instead of 'sequence it, upload it, figure out what it does/makes, do something useful with it'. Knowing the sequence for the Ubiquitin gene is all well and good, but it's of little practical importance. Being able to construct designer proteins to treat illnesses based on that information, however, is a truly worthy goal. Unfortunately, that's also where the 'patent it' part comes into play...
Re:How do they map their function? by AlanKilian · 2006-01-17 15:04 · Score: 5, Informative

From: http://www.learner.org/channel/courses/biology/tex tbook/genom/genom_7.html

A biological approach to determining the function of a gene is to create a mutation and then observe the effect of the mutation on the organism. This is called a knockout study. While it is not ethical to create knockout mutants in humans, many such mutants are already known, especially those that cause disease. One advantage of having a genome sequence is that it greatly facilitates the identification of genes in which mutations lead to a particular disease.

The mouse, where one can make and characterize knockout mutants, is an excellent model system for studying genetic diseases of humans; its genome is remarkably similar to a human's. Nearly all human genes have homologs in mice, and large regions of the chromosomes are very well conserved between the two species. In fact, human chromosomes can be (figuratively) cut into about 150 pieces, mixed and matched, and then reassembled into the 21 chromosomes of a mouse. Thus, it is possible to create mutants in mice to determine the probable function of the same genes in humans. Genetic stocks of mutant mice have been developed and maintained since the 1940s.

One goal of the mouse genome project is to make and characterize mutations in order to determine the function of every mouse gene. After a particular gene mutation has been linked to a particular disorder, the normal function of the gene may be determined. An example of this approach is the mutated gene that resulted in cleft palates in mice. The researchers found that the gene's normal function is to close the embryo's palate. An understanding of the genetics behind cleft palate in mice may one day be used to help prevent this common birth defect in humans.
Re:How do they map their function? by Stachybotris · 2006-01-17 15:11 · Score: 5, Informative

In most cases they work backwards. You start with a known protein, determine its amino acid sequence, and then convert that into the most likely DNA sequence (accounting for codon bias). Primers/probes are then generated for the 3' and 5' ends of the probable DNA sequence. If you're working with a small genome like that of a bacterium, you can perform a restriction digest to get random hunks of chromosome. These are then amplified via PCR using your designer primers. The final product is then sequenced.

In other cases you can create a gene knockout by splicing a random gene into your gene of interest. This causes your target gene to encode a non-functional protein. Then you watch and see what happens to the test subject. In some cases the creature dies because the gene turned out to be extremely important. In others it results in minor to significant impairment. But because of the complexity of most organisms, single-gene knockouts usually don't have too much effect - the creature has multiple pathways that can accomplish the same goal. This is especially true for critical functions like those in the immune system.
Re:22TB is nothing. by TheSpoom · 2006-01-17 15:36 · Score: 4, Funny

I'm pretty sure storing humans on your hard drive is illegal.

--
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs
So what? by Anon.Pedant · 2006-01-17 15:54 · Score: 4, Funny

I'm not impressed. I already have genetic material all over my computer.

(Oops, did I just admit something bad?)
The amazing thing is how SMALL it is. by sbaker · 2006-01-17 16:08 · Score: 5, Insightful

All this hype about how vastly much paper you get if you print it all out misses the wonder of the thing.

The wonder isn't how BIG the human genome is - the amazing thing is how *TINY* it is.

The human genome is 3 billion base pairs...each base pair is one of only four possibilities - so two bits each. 750 Megabytes...that's one CD-ROM. There is a lot of redundancy in it too - many of those base pairs are never 'expressed' as proteins, many are replicated redundantly dozens of times. So with compression, or even just deleting the junk - you'd get it down to maybe 100 to 200 megs - tops.

I find it utterly amazing that all that complexity is so amazingly compactly encoded.

Yeah - that's a lot of bits of paper - or 600 floppy disks or some other bullshit - but by the standards of modern media, it's MICROSCOPIC.

Announcements like this would do better to explain how LITTLE data this really is - that's the wonder of the thing.

--
www.sjbaker.org
1. Re:The amazing thing is how SMALL it is. by The+Step+Child · 2006-01-17 17:06 · Score: 4, Interesting
  
  Just as amazing is that there are only about 25,000 protein coding genes in the entire human genome (though obviously there are more proteins possible through splicing and post-translational modification, but I digress). Also amazing is the precision in which the chromosomes wind up all that DNA. Imagine taking a piece of yarn miles and miles long and compacting it into something that could fit into a paper bag - now imagine someone asking you to take out a VERY specific piece of that yarn and exposing it from your roll, disturbing the rest of the yarn as little as possible, then putting it back exactly as it was before when they're finished with it...that's basically what each chromosome has to do when genes are expressed. And it's all mediated by proteins coded in that very DNA.
I've read the whole thing.. by tinrobot · 2006-01-17 16:27 · Score: 4, Funny

I won't give away the ending, but my favorite part is:

ctattggacttggaatcggatattggacacttggaatcggata
Re:Dubious claims by timeOday · 2006-01-17 16:43 · Score: 4, Insightful

Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.
Let me interpret for you: it's a lot.
What's incredibly more lame is that 99% of the slashdot comments on this article so far are stuck on units of measure. Clearly it's a lot. Instead of debating the length of a piece of string, how about some discussion on how to distribute and analyze so much data. At this point I'd almost welcome some grousing about patents or dumb google DNA-related theories. We're barely scratching the surface on understanding genetic data. Even finding approximate substring matches within samples is fairly difficult. Here we have the world's biggest crossword puzzle which encodes the secrets of life itself and most of you guys are stuck on the point size of the font.