Genome Researchers Have Too Much Data

← Back to Stories (view on slashdot.org)

Genome Researchers Have Too Much Data

Posted by Soulskill on Friday December 2, 2011 @07:28AM from the we-should-try-storing-it-in-dna dept.

An anonymous reader writes "The NY Times reports, 'The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law. The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data. Now, it costs more to analyze a genome than to sequence a genome. There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"

25 of 239 comments (clear)

Min score:

Reason:

Sort:

Wrong problem by sunderland56 · 2011-12-02 07:30 · Score: 4, Interesting

They don't have too much data, they have insufficient affordable storage.
1. Re:Wrong problem by bugs2squash · 2011-12-02 07:40 · Score: 5, Funny
  
  If only they had some kind of small living cell it could be stored in...
  
  --
  Nullius in verba
2. Re:Wrong problem by jacoby · 2011-12-02 07:43 · Score: 4, Insightful
  
  Yes and no. It isn't just storage. What we have comes off the the sequencers as TIFFs first, and after the first analysis we toss the TIFFs to free up some big space. But that's just the first analysis, and we go to machines with kilo-cores and TBs of memory in multiple modes, and many of our tools are not yet written to be threaded.
3. Re:Wrong problem by TooMuchToDo · 2011-12-02 07:51 · Score: 5, Informative
  
  Genomes have *a lot* of redundant data across multiple genomes. It's not hard to do de-duplication and compression when you're storing multiple genomes in the same storage system.
  Wikipedia seems to agree with me:
  http://en.wikipedia.org/wiki/Human_genome#Information_content
  
  The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.
  Disclaimer: I have worked on genome data storage and analysis projects.
4. Re:Wrong problem by GAATTC · 2011-12-02 08:00 · Score: 5, Informative
  
  Nope - the bottleneck is largely analysis. While the volume of the data is sometimes annoying in terms of not being able to attach whole data files to emails (19GB for a single 100bp flow cell lane from a HiSeq2000) it is not an intellectually hard problem to solve and it really doesn't contribute significantly to the cost of doing these experiments (compared to people's salaries). The intellectually hard problem has nothing to do with data storage. As the article states "The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.". We just finished up generating and annotating a de novo transcriptome (sequences of all of the expressed genes in an organism without a reference genome). Sequencing took 5 days and cost ~$1600. Analysis is going on 4 months and has taken at least one man year at this point and there is still plenty of analysis to go.
5. Re:Wrong problem by StikyPad · 2011-12-02 08:05 · Score: 5, Funny
  
  Warning: Monkeying with lossy compression for human genomic data may lead to monkeys.
  
  --
  https://www.eff.org/https-everywhere
6. Re:Wrong problem by Anonymous Coward · 2011-12-02 08:11 · Score: 3, Informative
  
  It's not lossy compression.
  You store the first human's genome exactly. Then you store the second as a bitmask of the first -- 1 if it matches, 0 if it doesn't. You'll have 99% 1's and 1% 0's. You then compress this.
  Of course it's more complicated than this due to alignment issues, etc, but this need not be lossy compression
Nope by masternerdguy · 2011-12-02 07:32 · Score: 3, Insightful

No such thing as too much data on a scientific topic.

--
To offset political mods, replace Flamebait with Insightful.
Bad... by Ixne · 2011-12-02 07:33 · Score: 3, Insightful

Throwing out data in order to be able to analyze other data, especially when it comes to genes and how they interact, sounds like one of the worst ideas I've heard.
1. Re:Bad... by Samantha+Wright · 2011-12-02 08:06 · Score: 3, Informative
  
  Although that isn't quite what we're talking about here, reductionism in biology has been an ongoing problem for decades. Traditional biochemists often reduce the system they're examining to simple gene-pair interactions, or perhaps a few components at once, and focus only on the disorders that can be succinctly described by them. That's why very small-scale issues like haemophilia and sickle-cell anaemia were sorted out so early on. As diseases with larger and more complex origins become more important, research and money is being directed toward them. Cancer has been by far the most powerful driving force in the quest to understand biology from a broader viewpoint, primarily because it's integrally linked to a very important, complicated process (cell replication) that involves hundreds if not thousands of genes, miRNAs, and proteins.
  
  --
  Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Time for the scientists to ge to work by Hentes · 2011-12-02 07:35 · Score: 4, Insightful

Most scientific topics are like this, there is too much raw data to analize it all. But a good scientist can spot the patterns and can distinguish between important stuff and noise.
1. Re:Time for the scientists to ge to work by BagOBones · 2011-12-02 07:38 · Score: 5, Insightful
  
  Research team finds important role for junk DNA
  http://www.princeton.edu/main/news/archive/S24/28/32C04/
  Accept in the field of DNA they still don't know what is and is not important.
  
  --
  EA David Gardner -"... but the consumers have proven that actually what they want is fun."
2. Re:Time for the scientists to ge to work by sirlark · 2011-12-02 08:18 · Score: 4, Insightful
  
  A good scientist will design the experiment before collecting the data. If he spots patterns, it's because something interesting happened to another experiment. Then he'll design a new experiment to collect data on the interesting thing.
  Flippant response: A good scientist doesn't delete his raw data...
  More sober response: Except to do an experiment said scientist might need a sequence. And that sequence needs to be stored somewhere, often in a publicly accessible database as per funding stipulations. And that sequence has literally gigabytes more information than he needs for his experiment, because he's only looking at part of the sequence. Consider also that sequencing a small genome may take a few days in the lab, but annotating can take weeks or even months of human time. And the sequence is just the tip of the iceberg, it doesn't tell us anything because we need to know how the genome is expressed, and how the expressed genes are regulated, and how they are modified after transcription, and how they are modified after translation, and how the proteins that translation forms interact with other proteins and sometimes with the DNA itself. Life is messy, and singling out stuff for targeted experimentation in the biosciences is a lot more difficult than in physics, and even chemistry.
  
  Seriously, this is a non-problem. Don't waste resources keeping and managing the data if you can make more. And I can't imagine how you can't make more data from DNA. The stuff is everywhere.
  Sequencing may be getting cheaper, but it's not so cheap that scientists facing funding cuts can afford to throw away data simply to recreate it. Also, DNA isn't the only thing that's sequenced or used. Protein's are notoriously hard to purify and sequence, RNA can also be difficult to get in sufficient quantities. The only reason DNA is plentiful is because it's so easy to copy using PCR, but those copies are not necessarily perfect.
They should learn by hbar+squared · 2011-12-02 07:43 · Score: 4, Insightful

...from CERN. Sure, the Grid was massively expensive, but I doubt genome researchers are generating 27 TB of data per day.
Is it .. by ackthpt · 2011-12-02 07:43 · Score: 3, Interesting

Is it outpacing their ability to file patents on genome sequences?

--

A feeling of having made the same mistake before: Deja Foobar
as a genome researcher by ecorona · 2011-12-02 07:44 · Score: 5, Informative

As a genome researcher, I'd like to point out that I, for one, do not have nearly enough genome data. I simply need about 512GB of RAM on a computer with a hard drive that is about 100x faster than my current SSD, and processing power about 1000x cheaper. Right now, I bite the bullet and carefully construct data structures and implement all sorts of tricks make the most out of the RAM I do have, minimize how much I have to use a hard drive, and extract every bit of performance available out of my 8 core machine. I wait around and eventually get things done, but my research would go way faster and be more sophisticated if I didn't have these hardware limitations.
1. Re:as a genome researcher by Overzeetop · 2011-12-02 08:01 · Score: 4, Insightful
  
  It will come, but it doesn't make the wait less frustrating. I'm an aerospace engineer, and I remember building and preparing structural finite element models by hand on virtual "cards" (I'm not old enough to have used actual cards), and trying to plan my day around getting 2-3 alternate models complete so that I could run the simulations overnight. In the span of 5 years, I was building the models graphically on a PC, and runs were taking less than 30 minutes. Now, I can do models of foolish complexity and I fret when a run takes more than a minute, wondering if the computer has hung on a matrix inversion that isn't converging.
  You should, in some ways, feel lucky you weren't trying to do this twenty years ago. I understand your frustration, though.
  Just think - in twenty years, you'll be able to tell stories about hand coding optimizations and efficiencies to accommodate the computing power, as you describe to your intern why she's getting absolute garbage results from what looks like a very complete model of her project.
  
  --
  Is it just my observation, or are there way too many stupid people in the world?
Where does it all come from? by WaffleMonster · 2011-12-02 07:51 · Score: 3, Funny

I was under the impression the complete DNA sequence for a human can be stored on an ordinary CD.
Given the amount of data mentioned in TFA it it begs the question what the hell are they sequencing? The genome of everyone on the planet?
Re:Work! by Anonymous Coward · 2011-12-02 07:58 · Score: 3, Funny

I see an opportunity for work, and jobs.
Wozniak. He is called Wozniak. But opportunity will have to wait, because Jobs is dead. Sorry to break it to you like this.
Come on, every story has an Apple angle, if you look at it the right way.. in fact, I bet those researchers could store all that data on an iPod if they wanted! You can plug it right in and sync with iTunes!
Drops in NGS Costs Outpacing Storage Costs by Anonymous Coward · 2011-12-02 08:02 · Score: 4, Informative

The big problem is that the dramatic decreases in sequencing costs driven by next-gen sequencing (in particular the Illumina HiSeq 2000, which produces in excess of 2TB of raw data per run) have outpaced the decreases in storage costs. We're getting to the point where storing the data is going to be more expensive than sequencing it. I'm a grad student working in a lab with 2 of the HiSeqs (thank you HHMI!) and our 300TB HP Extreme Storage array (not exactly "extreme" in our eyes) is barely keeping up (on top of the problems were having with datacenter space, power, and cooling).
I'll reference an earlier /. post about this:
http://science.slashdot.org/story/11/03/06/1533249/graphs-show-costs-of-dna-sequencing-falling-fast
There are some solutions to the storage problems such as Goby (http://campagnelab.org/software/goby/) but those require additional compute time, and we're already stressing our compute cluster as is. Solutions like "the cloud(!)" don't help much when you 10TB of data to transfer just to start the analysis - the connectivity just isn't there.
Re:ASCII storage? by Samantha+Wright · 2011-12-02 08:11 · Score: 3, Informative

ASCII storage of nucleotide and protein information is actually very standard. The most widespread format is called FASTA, named after the fast alignment program that introduced it. When you sequence a whole genome on a second-generation sequencing platform (like Illumina or SOLiD), there's a step in the process where you end up with a huge (10-100 GB) text file containing little puzzle pieces of DNA that must then be assembled by a specialized program. These files usually don't hang around very long, but the point of keeping them in this inefficient storage format is, simply, performance: CPUs are oriented toward byte-based computing at a minimum, and so frequent compression/decompression becomes prohibitively inefficient.

Big biotechnology purchases are typically hundreds of thousands of dollars though, so most labs are used to shelling out for this kind of price bracket.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Last post by NFN_NLN · 2011-12-02 08:18 · Score: 5, Funny

There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"
Perhaps they can come up with a new type of storage mechanism modeled after nature. They could store this data in tight helical structures and instead of base 2 use base 4.
It's not the data by thisisauniqueid · 2011-12-02 09:04 · Score: 3, Insightful

It's not that there's too much data to store. There's too much to analyze. Storing 1M genomes is tractable today. Doing a pairwise comparison of 1M genomes requires half a trillion whole-genome comparisons. Even Google doesn't compute on that scale yet. (Disclaimer: I'm a postdoc in computational biology.)
Re:Last post by edremy · 2011-12-02 09:32 · Score: 4, Informative

The error rate is too high- data copying using that medium and the best available (naturally derived) technology makes an error roughly every 100,000 bases. There are existing correction routines, but far too much data is damaged on copy, even given the highly redundant coding tables.
Then again, it could be worse: you could use the single strand formulation. Error rates are far higher. This turns out to be a surprisingly effective strategy for organisms using it, although less so for the rest of us.

--
"Seven Deadly Sins? I thought it was a to-do list!"
TEDx Talk on the Subject by rockmuelle · 2011-12-02 10:00 · Score: 3, Informative

I did a talk on this a few years back at TEDx Austin (shameless self promotion): http://www.youtube.com/watch?v=8C-8j4Zhxlc
I still deal with this on a daily basis and it's a real challenge. Next-generation sequencing instruments are amazing tools and are truly transforming biology. However, the basic science of genomics will always be data intensive. Sequencing depth (the amount of data that needs to be collected) is driven primarily by the fact that genomes are large (e. coli has around 5 M bases in it's genome, humans have around 3 billion) and biology is noisy. Genomes must be over-sampled to produce useful results. For example, detecting variants in a genome requires 15-30x coverage. For a human, this equates to 45-90 Gbases or raw sequence data, which is roughly 45-90 GB of stored data for a single experiment.
The two common solutions I've noticed mentioned often in this thread, compression and clouds, are promising, but not yet practical in all situations. Compression helps save on storage, but almost every tool works on ASCII data, so there's always a time penalty when accessing the data. The formats of record for genomic sequences are also all ASCII (fasta, and more recently fastq), so it will be a while, if ever, before binary formats become standard.
The grid/cloud is a promising future solution, but there are still some barriers. Moving a few hundred gigs of data to the cloud is non-trivial over most networks (yes, those lucky enough to have Internet2 connections can do it better, assuming the bio building has a line running to it) and, despite the marketing hype, Amazon does not like it when you send disks. It's also cheaper to host your own hardware if you're generating tens or hundreds of terabytes. 40 TB on Amazon costs roughly $80k a year whereas 40 TB on an HPC storage system is roughly $60k total (assuming you're buying 200+ TB, which is not uncommon). Even adding an admin and using 3 years' depreciation, it's cheaper to have your own storage. The compute needs are rather modest as most sequencing applications are I/O bound - a few high memory (64 GB) nodes are all that's usually needed.
Keep in mind, too, that we're asking biologists to do this. Many biologists got into biology because they didn't like math and computers. Prior to next-generation sequencing, most biological computation happened in calculators and lab notebooks.
Needless to say, this is a very fun time to be a computer scientist working in the field.
-Chris