Genome Researchers Have Too Much Data

← Back to Stories (view on slashdot.org)

Genome Researchers Have Too Much Data

Posted by Soulskill on Friday December 2, 2011 @07:28AM from the we-should-try-storing-it-in-dna dept.

An anonymous reader writes "The NY Times reports, 'The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law. The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data. Now, it costs more to analyze a genome than to sequence a genome. There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"

7 of 239 comments (clear)

Min score:

Reason:

Sort:

Re:Time for the scientists to ge to work by BagOBones · 2011-12-02 07:38 · Score: 5, Insightful

Research team finds important role for junk DNA
http://www.princeton.edu/main/news/archive/S24/28/32C04/
Accept in the field of DNA they still don't know what is and is not important.

--
EA David Gardner -"... but the consumers have proven that actually what they want is fun."
Re:Wrong problem by bugs2squash · 2011-12-02 07:40 · Score: 5, Funny

If only they had some kind of small living cell it could be stored in...

--
Nullius in verba
as a genome researcher by ecorona · 2011-12-02 07:44 · Score: 5, Informative

As a genome researcher, I'd like to point out that I, for one, do not have nearly enough genome data. I simply need about 512GB of RAM on a computer with a hard drive that is about 100x faster than my current SSD, and processing power about 1000x cheaper. Right now, I bite the bullet and carefully construct data structures and implement all sorts of tricks make the most out of the RAM I do have, minimize how much I have to use a hard drive, and extract every bit of performance available out of my 8 core machine. I wait around and eventually get things done, but my research would go way faster and be more sophisticated if I didn't have these hardware limitations.
Re:Wrong problem by TooMuchToDo · 2011-12-02 07:51 · Score: 5, Informative

Genomes have *a lot* of redundant data across multiple genomes. It's not hard to do de-duplication and compression when you're storing multiple genomes in the same storage system.
Wikipedia seems to agree with me:
http://en.wikipedia.org/wiki/Human_genome#Information_content

The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.
Disclaimer: I have worked on genome data storage and analysis projects.
Re:Wrong problem by GAATTC · 2011-12-02 08:00 · Score: 5, Informative

Nope - the bottleneck is largely analysis. While the volume of the data is sometimes annoying in terms of not being able to attach whole data files to emails (19GB for a single 100bp flow cell lane from a HiSeq2000) it is not an intellectually hard problem to solve and it really doesn't contribute significantly to the cost of doing these experiments (compared to people's salaries). The intellectually hard problem has nothing to do with data storage. As the article states "The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.". We just finished up generating and annotating a de novo transcriptome (sequences of all of the expressed genes in an organism without a reference genome). Sequencing took 5 days and cost ~$1600. Analysis is going on 4 months and has taken at least one man year at this point and there is still plenty of analysis to go.
Re:Wrong problem by StikyPad · 2011-12-02 08:05 · Score: 5, Funny

Warning: Monkeying with lossy compression for human genomic data may lead to monkeys.

--
https://www.eff.org/https-everywhere
Re:Last post by NFN_NLN · 2011-12-02 08:18 · Score: 5, Funny

There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"
Perhaps they can come up with a new type of storage mechanism modeled after nature. They could store this data in tight helical structures and instead of base 2 use base 4.