Genome Researchers Have Too Much Data
An anonymous reader writes "The NY Times reports, 'The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law. The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data. Now, it costs more to analyze a genome than to sequence a genome. There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"
All previous posts have been purged due to too much data.
They don't have too much data, they have insufficient affordable storage.
No such thing as too much data on a scientific topic.
To offset political mods, replace Flamebait with Insightful.
Throwing out data in order to be able to analyze other data, especially when it comes to genes and how they interact, sounds like one of the worst ideas I've heard.
Most scientific topics are like this, there is too much raw data to analize it all. But a good scientist can spot the patterns and can distinguish between important stuff and noise.
...from CERN. Sure, the Grid was massively expensive, but I doubt genome researchers are generating 27 TB of data per day.
Is it outpacing their ability to file patents on genome sequences?
A feeling of having made the same mistake before: Deja Foobar
As a genome researcher, I'd like to point out that I, for one, do not have nearly enough genome data. I simply need about 512GB of RAM on a computer with a hard drive that is about 100x faster than my current SSD, and processing power about 1000x cheaper. Right now, I bite the bullet and carefully construct data structures and implement all sorts of tricks make the most out of the RAM I do have, minimize how much I have to use a hard drive, and extract every bit of performance available out of my 8 core machine. I wait around and eventually get things done, but my research would go way faster and be more sophisticated if I didn't have these hardware limitations.
I would figure most genomes are highly compressible. Especially if compressed against thousands of samples of a species and even across different species.
I have half my mothers genome and half my fathers. I couldn't have that many mutations. To store all three genomes couldn't take more than 2.0001 times the size of a human genome.
Oh hey look you made another account to goatse /. with. Good job.
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
I was under the impression the complete DNA sequence for a human can be stored on an ordinary CD.
Given the amount of data mentioned in TFA it it begs the question what the hell are they sequencing? The genome of everyone on the planet?
Wozniak. He is called Wozniak. But opportunity will have to wait, because Jobs is dead. Sorry to break it to you like this.
Come on, every story has an Apple angle, if you look at it the right way.. in fact, I bet those researchers could store all that data on an iPod if they wanted! You can plug it right in and sync with iTunes!
From the article "three billion bases of DNA in a set of human chromosomes". A base may hold 1 of 4 values A, C, G and T. So each base can be represented with 2 bits. 2 bits * 3 billion = 750MB.
Someone needs to introduce these researchers to the 'diff' program.
The big problem is that the dramatic decreases in sequencing costs driven by next-gen sequencing (in particular the Illumina HiSeq 2000, which produces in excess of 2TB of raw data per run) have outpaced the decreases in storage costs. We're getting to the point where storing the data is going to be more expensive than sequencing it. I'm a grad student working in a lab with 2 of the HiSeqs (thank you HHMI!) and our 300TB HP Extreme Storage array (not exactly "extreme" in our eyes) is barely keeping up (on top of the problems were having with datacenter space, power, and cooling).
I'll reference an earlier /. post about this:
http://science.slashdot.org/story/11/03/06/1533249/graphs-show-costs-of-dna-sequencing-falling-fast
There are some solutions to the storage problems such as Goby (http://campagnelab.org/software/goby/) but those require additional compute time, and we're already stressing our compute cluster as is. Solutions like "the cloud(!)" don't help much when you 10TB of data to transfer just to start the analysis - the connectivity just isn't there.
ASCII storage of nucleotide and protein information is actually very standard. The most widespread format is called FASTA, named after the fast alignment program that introduced it. When you sequence a whole genome on a second-generation sequencing platform (like Illumina or SOLiD), there's a step in the process where you end up with a huge (10-100 GB) text file containing little puzzle pieces of DNA that must then be assembled by a specialized program. These files usually don't hang around very long, but the point of keeping them in this inefficient storage format is, simply, performance: CPUs are oriented toward byte-based computing at a minimum, and so frequent compression/decompression becomes prohibitively inefficient.
Big biotechnology purchases are typically hundreds of thousands of dollars though, so most labs are used to shelling out for this kind of price bracket.
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Bioinformatics is indeed a very lucrative profession, but few programmers have the willingness to memorize the huge canon of data while they're in college that is required to be proficient in it. The curriculum is about 70% computer science and 30% life sciences, including organic chemistry at some universities.
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
This seems like just the kind of problem that AI will help with narrowing the field of 'interesting' things to look at. Either that or better ways to search through the data that is available along with better ways to store said data will probably work.
Way back in 1993, I visited an atomic laboratory in Pennsylvania. On the tour, they showed us the 30,000 core computing machine they had purchased several years before. "We still can't program it".
30 seconds later he pointed to the next piece of metal.
This is our 120,000 core computer.
I raised my hand "Why did you buy a 120,000 core machine when you can't even program the 30,000 core machine!"
"Well it's faster."
one of my early lessons in big companies attacking the wrong problem.
God: "I don't leave footprints!"
A couple of researchers in Sydney think they've got a model for searching the genoma much more efficiently. They're trying to fund their research and development with crowdsourcing: http://rockethub.com/projects/4065-a-gps-for-the-genome : "The PASTE project [is] based on a new number system we call Permutahedral Indexing - P.I. for short, an N-dimensional map that efficiently locates and interrelates complex datasets in the space of all possible data. P.I. does this efficiently even when the data has hundreds of independent dimensions and comes in petabytes and exabytes."
They don't seem to need much money in the scheme of things - I might just throw in $25.
Done: NCBI, DDBJ, and Ensembl all perform that role. The problem is what to do with all of it.
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
So, why can't they compress the data at the level of proteins? I mean it takes thousands of DNA base pairs to code for 1 protein, like hemoglobin, so instead of storing all that just say "here is the DNA sequence for protein X". Any exceptions, like mutations could then be indicated as "at position 758, the A is replaced by a G".
Of course if there is something REALLY novel, like a bioengineered virus that used different (non-standard) 3 base pair codons to encode the same amino acid, this kind of data compression wouldn't work but for 99.9999% of "natural" cases it would. (I saw this idea in the tv series "regenesis"). So for these (hopefully rare, it was for a bio-weapon!) cases a different type of compression would be used. "My" compression algorithm would, of course, break which would be a good indication this wasn't a natural DNA sequence.
I am neither a bio-expert nor a compression expert but this seems to me to be similar to the problem of compressing a vast library of books. Is it best to compress at the level of letters, words or even sentences? I'm only guessing what this entails because I'm not a linguist either! :(
(Then there's the whole business of introns or exons which "seem" to be content/protein free but I understand contain lots of regulatory information despite their repetitive nature. I would imagine these could be handled by some sort of pattern RLE.)
It's not that there's too much data to store. There's too much to analyze. Storing 1M genomes is tractable today. Doing a pairwise comparison of 1M genomes requires half a trillion whole-genome comparisons. Even Google doesn't compute on that scale yet. (Disclaimer: I'm a postdoc in computational biology.)
I did a talk on this a few years back at TEDx Austin (shameless self promotion): http://www.youtube.com/watch?v=8C-8j4Zhxlc
I still deal with this on a daily basis and it's a real challenge. Next-generation sequencing instruments are amazing tools and are truly transforming biology. However, the basic science of genomics will always be data intensive. Sequencing depth (the amount of data that needs to be collected) is driven primarily by the fact that genomes are large (e. coli has around 5 M bases in it's genome, humans have around 3 billion) and biology is noisy. Genomes must be over-sampled to produce useful results. For example, detecting variants in a genome requires 15-30x coverage. For a human, this equates to 45-90 Gbases or raw sequence data, which is roughly 45-90 GB of stored data for a single experiment.
The two common solutions I've noticed mentioned often in this thread, compression and clouds, are promising, but not yet practical in all situations. Compression helps save on storage, but almost every tool works on ASCII data, so there's always a time penalty when accessing the data. The formats of record for genomic sequences are also all ASCII (fasta, and more recently fastq), so it will be a while, if ever, before binary formats become standard.
The grid/cloud is a promising future solution, but there are still some barriers. Moving a few hundred gigs of data to the cloud is non-trivial over most networks (yes, those lucky enough to have Internet2 connections can do it better, assuming the bio building has a line running to it) and, despite the marketing hype, Amazon does not like it when you send disks. It's also cheaper to host your own hardware if you're generating tens or hundreds of terabytes. 40 TB on Amazon costs roughly $80k a year whereas 40 TB on an HPC storage system is roughly $60k total (assuming you're buying 200+ TB, which is not uncommon). Even adding an admin and using 3 years' depreciation, it's cheaper to have your own storage. The compute needs are rather modest as most sequencing applications are I/O bound - a few high memory (64 GB) nodes are all that's usually needed.
Keep in mind, too, that we're asking biologists to do this. Many biologists got into biology because they didn't like math and computers. Prior to next-generation sequencing, most biological computation happened in calculators and lab notebooks.
Needless to say, this is a very fun time to be a computer scientist working in the field.
-Chris
Though, there is quite a lot of that being generated these days.
The problem is the *raw* data - the files that come directly off of the sequencing instruments.
When we sequenced the human genome, everything came off the instrument as a 'trace file' - 4 different color traces, one representing a fluorescent dye for each base. These files are larger than text, but you store the data on your local hard drive and do the base calling and assembly on a desktop or beefy laptop by today's standards.
2nd gen sequencers (Illumina, 454, etc) take images, and a lot of them, generating many GB of data for even small runs. The information is lower quality, but there is a lot more of it. You need a nice storage solution and a workstation grade computer to realistically analyze this data.
3rd gen sequencers are just coming out, and they don't take pictures - they take movies with very high frame rates. Single molecule residence time frame rates. Typically, you don't store the rawest data - the instrument interprets it before the data gets saved out for analysis. You need high end network attached storage solutions to store even the 'interpreted' raw data, and you'd better start thinking about a cluster as an analysis platform.
This is what the article is really about - do you keep your raw 2nd and 3rd gen data? If you are doing one genome, sure! why not? If you are a genome center running these machines all the time, you just can't afford to do that, though. No one can really - the monetary value of the raw data is pretty low, you aren't going to get much new out of it once you've analyzed it, and your lab techs are gearing up to run the instrument again overnight...
The trick is that this puts you at odds with data retention policies that were written at a time when you could afford to retain all of your data...
-V-
Who can decide a priori? Nobody.
-Sartre
Yeah, back when Slashdot ran at 2400 bps, the comment limit was shorter than Twitter. But not to worry, like the Witnesses, the "great crowd" with seven-digit UIDs are relegated to a paradise on earth.
I have to say in 1981 making those decisions I felt like I was providing enough freedom for ten years, that is the move from 64K to 640K felt like something that would last a great deal of time.
The complaints as Gates recalls began in five years. He was off by a factor of two. I remember 1981 clear as day. There was hardly a baseline by which to judge the trajectory of the home computer. A monochrome 80 column display with mixed case was state of the art. By the end of 1982, the PC was selling a decimal order of magnitude faster than IBM projected, which put a whole different spin on enough. Volume drove down cost, and lower cost made eyes bigger sooner than almost anyone guessed.
I've read a lot from Gates over the years. Arrogant in most regards, but rarely stupid. Gates might have had the sentiment that a 0.33 MIPS processor didn't need 16MB of system memory, and figured that the memory limit would be addressed in a less anemic platform in the fullness of time. No-one in 1981 thought that 8088 byte code would still reign supreme thirty years later, any more than COBOL programmers in the 1960s worried about Y2K.
There's Plenty of Room at the Bottom as capiced already in 1959.
I don't really see a problem here. We have more than enough storage for the amount of analysis we're able to do. It's a short term nuisance that we have to invest some resources in being a little more selective in what we save, until storage or analysis catches up again.
There are some applications of genetics where the error component is the signal you're looking for. These methods are less forgiving of lossy synopsis. There might be room for some improvements to storage and compression algorithms in this space.