Genetic Database Hits One Billion Entries
ChocSnorfler writes to tell us that the Sanger Institute is reporting that their Genetic Record Database has hit one billion entries, making it the world's largest. From the announcement: "The Trace Archive is a store of all the sequence data produced and published by the world scientific community, including the Sanger Institute's own prodigious output as a world-leading genomics institution. To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. The Archive is 22 Terabytes in size and doubling every ten months."
genetic information of organisms - mice, fish, flies, bacteria and, of course, humans... All the data are freely available to the world scientific community (http://trace.ensembl.org/) Sweet, now I can finally build myself that fleet of flying super monkeys I've always wanted!
Some dumbass is always printing 300 pages of documents and hogging the printer. Forchrissakes, just figure out what pages you need and print those! Asshole.
The amount of data here is really enormous. To put it in perspective, if you lined up 7143 blondes, the number of strands of hair present would approximately equal the number of entries in this database.
I could make this sentence wrap around the world a zillion times if I used 10^100 point text.
"To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest. "
I have twice that much data on my 128k thumbdrive, if printed out in 72 point font size.
Anyone care to translate this into volkswagens, or libraries of congress?
Wow, that's almost 12U of rack space. Oh my *yawn*
Now the fact that that's all genetic data, that's amazing considering a human is only ~1GB so 22,000 humans worth.
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest.
Such claims should be taken with a grain of salt until they reveal what fonts and point sizes they use.
"...every ten MINUTES." Imagine we'd look like the Ferengi with loads of teeth and slick heads.
Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
if it were printed out as a single line of text, it would stretch around the world more than 250 times. Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest
Did anybody else think "Wow, I've got a great idea for a mural for the space elevator!"
Anybody?
Uh, well, it's late...
--MarkusQ
Would somebody please torrent it?
This enormous archive will devour us all.. ARGHH!
-AlexC
This is a real question...
How the scientist do that?
They wiggle this gen, and see what happens?
How do they go for the "scientific method" of experimentation?
Â_Â
...is in LoC's (Libraries of Congress).
He referenced A4 paper, so he's obviously not in the U.S. They use the metric system overseas.
If we stacked up all of the useless length metaphors/comparisons from end to end, they'd still add up to a non-useful mental image of a billion genetic records.
I mean, "printed out as a single line of text, it would stretch around the world more than 250 times" means what, in terms of helping us picture this? I take it that we're not supposed to be able to imagine a billion records, but we can all clearly picture some text wrapped around the planet 250 times? Ah, that's much more helpful!
Now, I just got done re-indexing 10 million records in a database, and I can sort of picture 100 times that much work. This is slashdot! More nerdly examples, please.
Don't disappoint your bird dog. Go to the range.
When we figure out what all of that does. For every organism as or more complex than your average bacterium, there's a large amount of what amounts to filler DNA. Viruses don't have this problem, as few of them are large enough to even get by without overlapping reading frames. If you shrink this dataset down to only sequences that encode functional proteins (read: genes), there's still an insane amount of information. If you then remove the introns, the dataset gets even smaller. But of course, we don't really know if the introns and intra-genic regions of DNA (the so-called 'junk DNA') have functions (or how many they have), although some do act as regulators of transcription.
Given that a change of just 1 base in 500 of the 16S rRNA gene is sufficient to differentiate between two different species of bacteria, I have to wonder how many of these entries are quasi-redundant. When you consider how many species of bacteria are known to man, that means that there are literally thousands of potential entries for each gene. Unless, of course, they're storing only consensus sequences, which still vary widely between genera.
Sadly, the trend here seems to be more of 'sequence it, upload it, and patent it' instead of 'sequence it, upload it, figure out what it does/makes, do something useful with it'. Knowing the sequence for the Ubiquitin gene is all well and good, but it's of little practical importance. Being able to construct designer proteins to treat illnesses based on that information, however, is a truly worthy goal. Unfortunately, that's also where the 'patent it' part comes into play...
All well and good, but how many Libraries of Congress does 2.5 Mt Everest / A4 pages equal?
My calculator has no Mt Everest button.
(2,3-Benzopyrrole)
use my sequence generator:
ruby -e 'while 1; print "c a t g".split[(rand 4)]; end'
Just hit control-c when the sequence is long enough to suit you
Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest.
You can't do that with ordinary A4 paper. You need to reinforce it on the sides at least so it won't tumble over. Plus, I doubt the paper would sit still with the high winds once it gets above a few thousand feet. Sheesh.
Pfft. I would be more impressed if it was all running on MSDE.
I don't keep a lid on my coffee so when I walk around I look busy -me
Something tells me a 22TB MS Access table just wouldn't cut it. :-P
This sig rocks the casbah.
Are they using the latest MSSQL 2005 beta 3?
Now I know who waits in line at 5am at CircuitCity to get the $40 after rebate HDDs! You should be ashamed CmdrTaco, come back when your measly 1/4 to 1/2 TB doubles every ten months.
1 billion entries = ~22 Terabytes
1 billion x 1,000 Bytes = ~0.9 Terabytes
Which means, on average, your genetic code can be stored in 22KB.
Just an interesting thought.
google.slashdot
I mean, most of that data is just redundant pairs of A-G C-T T-G etc...
I reckon you could zip it up and it'll fit on a couple of floppy disks.
dude, imagine how many chicks you are going to get.
I kid you not.
Please establish a hypertext link to this message. Spread the word!
As reported on /. the standard units of measurement are:
Football Fields in Length
Mt Everest in Height (even tho the avg person has no idea how tall it really is).
Olympic Sized Swimming Pools in Volume (which again the avg person has no idea)
Number of Chins in a Chinese phonebook (when talking about someone's momma).
\/\/3 pwn j00! \/\/3 g07 y0ur DN@ 0n 0ur d@7@b@$3 b17ch!
Anyone thought of the privacy issues of storing human DNA in a public database?
I am not a number, I refuse to be processed and let some strangers catalog my DNA into a public database.
Remember, Slashdot does not have a -1 disagree moderation, and no, troll, flamebait, and overrated are not substitutes.
How much data is this in comparison to the amount that google stores? Seems like google would be storing a lot more.
"Anyone care to translate this into volkswagens, or libraries of congress?"
How about "number of slashdot dupes"?
These people are obviously not aware that the standard unit of measurements for the press is Rhode Island and Texas. Without phrasing it in these units, I have no idea how much data that really is.
The world moves for love. It kneels before it in awe.
I had my gf read the summary of this article, and she promptly said, "Now that's bigger than Jesus!" :)
I don't respond to AC's.
idiot...
What a schtoopidttt analogy.
I'm not impressed. I already have genetic material all over my computer.
(Oops, did I just admit something bad?)
Printing it out on pages of A4 would produce a stack of paper two-and-a-half times as high as Mount Everest.
Tapping it out on morse code would take 10000 drummers 5 years!
Expressing it in smoke signals would burn 100 amazon rain forests!
Putting it in fortune cookies would require flour and sugar with the same approximate mass as the moon!
And sending it in semaphore would require every man, woman and child on the planet to signal nonstop with every flag ever made until the year 2010!
That's a lot of data.
And by standard, I mean: whatever MS Office defaults to
Diana Hacker's "A Writer's Reference" says the same thing.
/I'm not a grammar Nazi, I was forced to purchase it many years ago and have kept it handy ever since.
[Fuck Beta]
o0t!
"I made this half-pony half-monkey monster to please you,
But I get the feeling that you don't like it.
What's with all the screaming?
You like monkeys, you like ponies,
Maybe you don't like monsters so much?
Maybe I used too many monkeys;
Isn't it enough to know that I ruined a pony making a gift for you?"
-Skullcrusher Mountain by Jonathan Coulton
All this hype about how vastly much paper you get if you print it all out misses the wonder of the thing.
The wonder isn't how BIG the human genome is - the amazing thing is how *TINY* it is.
The human genome is 3 billion base pairs...each base pair is one of only four possibilities - so two bits each. 750 Megabytes...that's one CD-ROM. There is a lot of redundancy in it too - many of those base pairs are never 'expressed' as proteins, many are replicated redundantly dozens of times. So with compression, or even just deleting the junk - you'd get it down to maybe 100 to 200 megs - tops.
I find it utterly amazing that all that complexity is so amazingly compactly encoded.
Yeah - that's a lot of bits of paper - or 600 floppy disks or some other bullshit - but by the standards of modern media, it's MICROSCOPIC.
Announcements like this would do better to explain how LITTLE data this really is - that's the wonder of the thing.
www.sjbaker.org
...the entire database would fit on just one sheet of A(-24) paper. (Yes, I actually did the math.)
There's no failure quite as dissatisfying as a complete and total solution to the wrong problem.
Why isn't this thread submitted by Beatles, is he slacking off?
I won't give away the ending, but my favorite part is:
ctattggacttggaatcggatattggacacttggaatcggata
Go FoxPro!
The Archive is 22 Terabytes in size and doubling every ten months.
Doubling every 10 months? I think hard drives are doing that as well, or damn close to it. A few years ago, 22 terabytes sounded like a lot, but these days, not so much. I've got half a terabyte in my server and another half in the other two computers in my home and if I didn't regularly burn stuff to DVD, I would have run out of space a long time ago. Terabytes just aren't what they used to be. Well, they are and they aren't.
The Archive is 22 Terabytes in size and doubling every ten months.
Wow, in a coupla years they'll need Google to help them store data
I love humanity, it is people I hate
All your base (pairs) belong to us.
Got a torrent?
I want to print it out to read off screen...
ha, i read the title as "Generic database hits 1 billion entries" and i was wondering for awhile what all this talk about genetics was... oops! lol
To grasp how much data is in the Archive, if it were printed out as a single line of text, it would stretch around the world more than 250 times.
Not at a 100 million DPI it won't.
The Archive is 22 Terabytes in size and doubling every ten months.
I doubt that. Surely that means by the end of the day it will be:
22 * 2^144 Terabytes = 5*10^44 Terabytes
in size.....I don't even know what you call that!
Now, if you want to do something really cool with that database, you'd blast it against itself using no repeat masking. Or just blast it against the repeats database :-)
---- Take the Space Quiz!
Here are the 'official' calculations....
The 1 billion traces equates to 800 billion letters of genetic information.
70*50 is a solid page at times new roman 12 point font == 3,500 characters
100 sheets is 1cm high. = 350,000 letters
800,000,000,000/350,000 = 2,285,714.29
So the stack of paper would be 22,857M high
22.8 kilometers.
Mount Everest is 8.848 KM high.
So the stack of paper would be 2 1/2 times the height of Everest.
If those are the full sequences, and the bio technology evolves enough so that build the full sequence out of digital data...
Woa.. Just imagine the possibilities.
We won't have to feel guilty for extinct species anymore!
PS.: Anyone wanna join my safari party next weekend?
The word "The" printed out as a single line could strretch around the world two hundreds and fifty one times, given a sufficiently large font.
While that is crazy, it begs the question, are they thinking in points? 10? 11? 12? 72? Why didn't that say 500 times? 1000 times? a million times?
Is there an rfc for this specification of measurement? Can I order things in 'printed word lengths around the world'?
Can I measure my penis with this?
Does google calculator support this?
I shot the sheriff but I sold the deputy some SCO licenses.
please type the word in this image: sheriff
random letters - if you are visually impaired, please email us at pater@slashdot.org
Hello, visually impaired. I hope you are reasing this, either in a large font or some braille device.
Did you email fatboy slim, I mean cowboy neal about this CAPTCHA? What did the tellytubby do about it?
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
I'm pretty sure that gene information is copyrighted, and the whole project should be canned before some association takes up suing kids for looking up information for science class.
Rather than go through the entire process you outline, one could avoid a great deal of the wet work but sequencing the protein and then jumping into computer space; searching the genome database for hits.
This assumes you're organism of study has been sequenced, but that isn't uncommon for a number of reasons.
Damn you.
The signal data is composed of peaks and troughs across 4 channels, corresponding to the 4 base types. A peak in a channel corresponds to a base of that type passing in front of the detector. In your typical sampling configuration, a peak is made up of about 12 data pts.
Now, since each sampled point in the signal is stored as a 4 byte int and the base for that peak is stored as a 1 byte char, then you've got basically a 192:1 ratio of techincally superfluous signal data to actual DNA sequence.
Since there are yet other peices of information in the file, this ratio is actually larger.
Of course, there is a good reason for keeping trace data rather than just the DNA sequences, the notion being that you have more information with which to validate the integrity of what you've done. There have been cases where scientific databases have had their data integrity damaged over time by low quality (ie. mistakes) submissions.
In this case, they're retain the wrong file type, as it doesn't store the original unfiltered data signal, only a heavily filtered and manipulated one. Most modern basecallers start from the original unfiltered data to gain more advantage through better processing, you cannot do this with the file type they are retaining.
you now get...
TCGGAGACCAAGGCAAGGAAGCA...
Mostly human, better watch this one, he might do something soon...
www.sanger.ac.uk/ - 1.2Tb - Feb 17, 1965 - Cached - Similar pages - Remove result
AGGCATCGATCAGTCAAGTCAACA...
Bad speller. Looks kind of odd in daylight...
www.sanger.ac.uk/ - 1.2Tb - Jan 10, 1970 - Cached - Similar pages - Remove result
CCGGTGACCAAGGTAAGGATGCA...
Beneath the Digg comments IQ threshold, mostly harmless
www.sanger.ac.uk/ - 14k - Sept 11, 1982 - Cached - Similar pages - Remove result
Try your search again on Google Book Search
Gooooooooooooooogle >
Result Page: 1 2 3 4 5 6 7 8 9 10 Next
(you are meant to smile btw)
I can't confirm this, maybe someone can tho. I had an oracle training course last year and the instructor told us she had someone from sanger working on the human genome stuff, and their database was something daft like 2 columns wide. It was used in an example to explain the intricacies of hot backups and such..
Interesting if its true!
it kinda makes me sad that this is considered a lot (with something useful in science) when AOL has petabytes of AIM logs sitting at their server farms. sad indeed.
The data doubles every 10 months computing power doubles every 18 months were going to hit a problem sooner or later...
...can you imagine how much it would cost to have it bound?
Really, though, they should come up with a better comparison. "If burned to CD, it would take half as many CDs as AOL sends out in a year".
I'm pretty undecided as to what I should think of this project.m l)
A sort of "Opensource genetics" organisation seems like a good idea at first. The fact that information likely to help researchers is made public is quite a good thing in my view, be it data about genes, the 1958 census of the Uzbek population, or about how many people in Uzbekistan wear jeans.
At least, this is far less freaky than a biotech company getting an "exclusive contract" from the Icelandic parliament to get access to the centralized database of all the Icelandic peoples' genealogical, genetic, and personal medical information. (See details here: http://www.actionbioscience.org/genomic/hlodan.ht
Yet, the information published by the Sanger Institute seems to be used mainly by private firms (Quote: "Dotcoms are responsible for about 80% of download each week"). I just wonder whether the Institute assesses these firms' goals before letting them download the data. I wouldn't be too glad to learn that they gave it to companies using genetic engineering for purposes other than medical.
see this free access paper.
Finally, my search for the chosen one might yield something tangible! The Rambaldi device will be MINE! MU-HAHAHAHA!
how many Texases could one wallpaper with the sheets from this hypothetical printout? Anyone? It's a challenge.
ouch... that hurt... that was the funniest thing I've read in a long time... a fleet of flying super monkeys... rotflmao!!!!
SELECT * FROM GENOME;
Most of the sequence in the trace archive is from large genome sequencing projects where we intentionally oversample genomes to 6-10x or more. Also each trace while averaging 864 characters only contains about 600 bp of real data. What this means is that the trace archive currently represents about 60 billion basepairs of unique sequence. In human genome numbers thats 20 genomes worth of data. The approximate output in traces right now is max about 30 million per month so the 1 billion traces represents 33 months at the current output of the world's sequencing centers. As the trace archive was started after the human genome project, most of the traces related to the human genome aren't in the repository. It is not difficult to produce massive amounts of sequencing data- the trick is in turning it into something that one can use to answer scientific questions.