The DNA Data Deluge

← Back to Stories (view on slashdot.org)

Posted by samzenpus on Thursday June 27, 2013 @02:07PM from the too-many-letters dept.

the_newsbeagle writes "Fast, cheap genetic sequencing machines have the potential to revolutionize science and medicine--but only if geneticists can figure out how to deal with the floods of data their machines are producing. That's where computer scientists can save the day. In this article from IEEE Spectrum, two computational biologists explain how they're borrowing big data solutions from companies like Google and Amazon to meet the challenge. An explanation of the scope of the problem, from the article: 'The roughly 2000 sequencing instruments in labs and hospitals around the world can collectively generate about 15 petabytes of compressed genetic data each year. To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall. And with sequencing capacity increasing at a rate of around three- to fivefold per year, next year the stack would be around 6 to 10 miles tall. At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station.'"

11 of 138 comments (clear)

Min score:

Reason:

Sort:

At least they're not rolling their own. by The_Wilschon · 2013-06-27 14:10 · Score: 4, Interesting

In high energy physics, we rolled our own big data solutions (mostly because there was no big data other than us when we did so). It turned out to be terrible.

--
SIGSEGV caught, terminating

wait... not that kind of sig.
1. Re:At least they're not rolling their own. by Samantha+Wright · 2013-06-27 15:59 · Score: 4, Informative
  
  I can't comment on the physics data, but in the case of the bio data that the article discusses, we honestly have no idea what to do with it. Most sequencing projects collect an enormous amount of useless information, a little like saving an image of your hard drive every time you screw up grub's boot.lst. We keep it around on the off chance that some of it might be useful in some other way eventually, although there are ongoing concerns that much of the data just won't be high enough quality for some stuff.
  On the other hand, a lot of the specialised datasets (like the ones being stored in the article) are meant as baselines, so researchers studying specific problems or populations don't have to go out and get their own information. Researchers working with such data usually have access to various clusters or supercomputers through their institutions; for example, my university gives me access to SciNet. There's still vying for access when someone wants to run a really big job, but there are practical alternatives in many cases (such as GPGPU computing.)
  Also, I'm pretty sure the Utah data centre is kept pretty busy with its NSA business.
  
  --
  Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Bogus units by vanzin · 2013-06-27 14:17 · Score: 5, Insightful

Everybody knows we should measure the pile height in Libraries of Congress. Or VW Beetles.
The problem will solve itself by Krishnoid · 2013-06-27 14:39 · Score: 5, Funny

To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.
Once that happens, they'll be able to stop storing it on DVDs and move it into the cloud.
Simple. Get the NSA to do it. by Anonymous Coward · 2013-06-27 14:46 · Score: 5, Funny

Publish a scientific, paper stating that potential terrorists or other subversives can be identified via DNA sequencing. The NSA will then covertly collect DNA samples from the entire population, and store everyone's genetic profiles in massive databases. Government will spend the trillions of dollars necessary without question. After all, if you are against it, you want another 9/11 to happen.
Database Replication by VortexCortex · 2013-06-27 14:58 · Score: 4, Insightful

Bit rot is also a big problem with data. So, the data has to be reduplicated to keep entropy from destroying it, which means a self corrective meta data must be used. If only there were a highly compact self correcting self replicating data storage system with 1's and 0's the size of small molecules...
My greatest fear is that when we meet the aliens, they'll laugh, stick us in a holographic projector, and gather around to watch the vintage porn encoded in our DNA.
Storage Non-Problem - Sequences Compresses to MBs by esten · 2013-06-27 15:12 · Score: 5, Informative

Storage is not the problem. Computational power is.
Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB. This means you could store a sequences for everybody in the world in ~132 PB or 0.05% or total worldwide data storage (295 exabytes)
Now the real challenge is more in having enough computational power to read and process the 3 billion letters genetic sequence and designing effective algorithms to process this data.
More info on compression of genomic sequences
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166/
Oddly... I have a clue about this stuff lately by WaywardGeek · 2013-06-27 15:35 · Score: 5, Interesting

Please... entire DNA genomes are tiny... on the order of 1Gb, with no compression. Taking into account the huge similarities to published genomes, we can compress that by at least 1000X. What they are talking about is the huge amount of data spit out by the sequencing machines in order to determine your genome. Once determined, it's tiny.
That said, what I need is raw machine data. I'm having to do my own little exome research project. My family has a very rare form of X-linked color blindness that is most likely caused by a single gene defect on our X chromosome. It's no big deal, but now I'm losing central vision, with symptoms most similar to late-onset Starardt's Disease. My UNC ophthalmologist beat the experts at John Hopkins and Jacksonville's hospital, and made the correct call, directly refuting the other doctor's diagnosis of Stargartd's. She though I had something else and that my DNA would prove it. She gave me the opportunity to have my exome sequenced, and she was right.
So, I've got something pretty horrible, and my ophthalmologist thinks it's most likely related to my unusual form of color blindness. My daughter carries this gene, as does my cousin and one of her sons. Gene research to the rescue?!? Unfortunately... no. There are simply too few people like us. So... being a slashdot sort of geek who refuses to give up, I'm running my own study. Actually, the UNC researchers wanted to work with me... all I'd have to do is bring my extended family from California to Chapel Hill a couple of times over a couple of years and have them see doctors at UNC. There's simply no way I could make that happen.
Innovative companies to the rescue... This morning, Axeq, a company headquartered in MD, received my families DNA for exome sequencing at their Korean lab. They ran an exome sequencing special in April: $600 per exome, with an order size minimum of six. They have been great to work with, and accepted my order for only four. Bioserve, also in MD, did the DNA extraction from whole blood, and they have been even more helpful. The blood extraction labs were also incredibly helpful, once we found the right places (very emphatically not Labcorp or Quest Diagnostics). The Stanford clinic lab manager was unbelievably helpful, and in LA, the lab director at the San Antonio Hospital Lab went way overboard, So far, I have to give Axeq and Bioserve five stars out of five, and the blood draw labs deserve a six.
Assuming I get what I'm expecting, I'll get a library of matched genes, and also all the raw machine output data, for four relatives. The output data is what I really need, since our particular mutation is not currently in the gene database. Once I get all the data, I'll need to do a bit of coding to see if I can identify the mutation. Unfortunately, there are several ways that this could be impossible. For example, "copy number variations", or CNVs, if they go on for over a few hundred base pairs, are unable to be detected with current technology. Ah... the life of a geek. This is yet another field I have to get familiar with...

--
Celebrate failure, and then learn from it - Nolan Bushnell
Re:Who uses DVDs? by Samantha+Wright · 2013-06-27 16:00 · Score: 4, Funny

And we can double storage efficiency by using two stacks! Clearly, they need to hire one of us.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Digital DNA storage anyone ? by the+gnat · 2013-06-27 16:30 · Score: 4, Informative

why aren't they storing it in digital DNA format
Because they need to be able to read it back quickly, and error-free. Add to that, it's actually quite expensive to synthesize that much DNA; hard drives are relatively cheap by comparison.
Yay, AdEnine & 1 click splicing by charlesjo488 · 2013-06-27 19:10 · Score: 4, Funny

Scientists who viewed this sequence also viewed these sequences...