The DNA Data Deluge
the_newsbeagle writes "Fast, cheap genetic sequencing machines have the potential to revolutionize science and medicine--but only if geneticists can figure out how to deal with the floods of data their machines are producing. That's where computer scientists can save the day. In this article from IEEE Spectrum, two computational biologists explain how they're borrowing big data solutions from companies like Google and Amazon to meet the challenge. An explanation of the scope of the problem, from the article: 'The roughly 2000 sequencing instruments in labs and hospitals around the world can collectively generate about 15 petabytes of compressed genetic data each year. To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall. And with sequencing capacity increasing at a rate of around three- to fivefold per year, next year the stack would be around 6 to 10 miles tall. At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station.'"
In high energy physics, we rolled our own big data solutions (mostly because there was no big data other than us when we did so). It turned out to be terrible.
SIGSEGV caught, terminating
wait... not that kind of sig.
We've got bigger storage media than DVDs!
don't store it all on DVDs, then
Everybody knows we should measure the pile height in Libraries of Congress. Or VW Beetles.
If they put it all on hard drives, it would only be 600 feet tall.
why aren't they storing it in digital DNA format?. Seems like a pretty efficient data storage format to me! A couple of grams of the stuff should suffice.
To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.
Once that happens, they'll be able to stop storing it on DVDs and move it into the cloud.
...what a shitty storage medium DVDs are these days.
A cheap 3TB disk drive burned to DVDs will produce a rather unwieldy tower of disks as well.
G.
Publish a scientific, paper stating that potential terrorists or other subversives can be identified via DNA sequencing. The NSA will then covertly collect DNA samples from the entire population, and store everyone's genetic profiles in massive databases. Government will spend the trillions of dollars necessary without question. After all, if you are against it, you want another 9/11 to happen.
Bit rot is also a big problem with data. So, the data has to be reduplicated to keep entropy from destroying it, which means a self corrective meta data must be used. If only there were a highly compact self correcting self replicating data storage system with 1's and 0's the size of small molecules...
My greatest fear is that when we meet the aliens, they'll laugh, stick us in a holographic projector, and gather around to watch the vintage porn encoded in our DNA.
Pretty amazing amount of intellectual capacity, generated by evolution.
Problem solved.
It seems a little overly sensationalist to aggregate the devices together when determining the storage size to make such a dramatic 2 mile high tower of DVD's... If you look at them individually, it's not that much data:
(15 x 10^15 bytes/device) / (2000 devices) / (1 x 10e9 bytes/gb) = 7500GB, or 7.5TB
That's a stack of 4TB hard drives 2 inches high. Or if you must use DVD's, that's a stack of 1600 DVD's 2 meters high.
The LHC generates a shitton of data as well, but from what I've seen (something like this) they use extremely fast integrated circuits to skim the data. Perhaps geneticists could use a similar technique.
Storage is not the problem. Computational power is.
Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB. This means you could store a sequences for everybody in the world in ~132 PB or 0.05% or total worldwide data storage (295 exabytes)
Now the real challenge is more in having enough computational power to read and process the 3 billion letters genetic sequence and designing effective algorithms to process this data.
More info on compression of genomic sequences
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166/
They should use a NoSQL multi-shard vertically intgrated stack with a RESTfull rails driven in-memory virtual multi-parallel JPython enabled solution.
Bingo!
putting the 'B' in LGBTQ+
"...At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station."
And another 10 years after that, the amount of DVDs used will have almost reached the number of AOL CDs sitting in landfills.
Sorry, couldn't help myself with the use of such an absurd metric. Not like we haven't moved on to other forms of storage the size of a human thumbnail that offer 15x the density of a DVD...
You mean ask the NSA how they've already done it.
Table-ized A.I.
Yes, but that took millions of years to develop the simplest versions.
It's astonishing that it took humans only a few millenia to get to that point on our own.
If there were only some way to store the information encoded in DNA in a molecular level storage device... oh wait, face palm.
Yes, storing the finished results of human genomic sequencing is a solved problem. The interim data, though, is massive. Also, not every sequencing experiment is of good ol' Homo sapiens.
Please... entire DNA genomes are tiny... on the order of 1Gb, with no compression. Taking into account the huge similarities to published genomes, we can compress that by at least 1000X. What they are talking about is the huge amount of data spit out by the sequencing machines in order to determine your genome. Once determined, it's tiny.
That said, what I need is raw machine data. I'm having to do my own little exome research project. My family has a very rare form of X-linked color blindness that is most likely caused by a single gene defect on our X chromosome. It's no big deal, but now I'm losing central vision, with symptoms most similar to late-onset Starardt's Disease. My UNC ophthalmologist beat the experts at John Hopkins and Jacksonville's hospital, and made the correct call, directly refuting the other doctor's diagnosis of Stargartd's. She though I had something else and that my DNA would prove it. She gave me the opportunity to have my exome sequenced, and she was right.
So, I've got something pretty horrible, and my ophthalmologist thinks it's most likely related to my unusual form of color blindness. My daughter carries this gene, as does my cousin and one of her sons. Gene research to the rescue?!? Unfortunately... no. There are simply too few people like us. So... being a slashdot sort of geek who refuses to give up, I'm running my own study. Actually, the UNC researchers wanted to work with me... all I'd have to do is bring my extended family from California to Chapel Hill a couple of times over a couple of years and have them see doctors at UNC. There's simply no way I could make that happen.
Innovative companies to the rescue... This morning, Axeq, a company headquartered in MD, received my families DNA for exome sequencing at their Korean lab. They ran an exome sequencing special in April: $600 per exome, with an order size minimum of six. They have been great to work with, and accepted my order for only four. Bioserve, also in MD, did the DNA extraction from whole blood, and they have been even more helpful. The blood extraction labs were also incredibly helpful, once we found the right places (very emphatically not Labcorp or Quest Diagnostics). The Stanford clinic lab manager was unbelievably helpful, and in LA, the lab director at the San Antonio Hospital Lab went way overboard, So far, I have to give Axeq and Bioserve five stars out of five, and the blood draw labs deserve a six.
Assuming I get what I'm expecting, I'll get a library of matched genes, and also all the raw machine output data, for four relatives. The output data is what I really need, since our particular mutation is not currently in the gene database. Once I get all the data, I'll need to do a bit of coding to see if I can identify the mutation. Unfortunately, there are several ways that this could be impossible. For example, "copy number variations", or CNVs, if they go on for over a few hundred base pairs, are unable to be detected with current technology. Ah... the life of a geek. This is yet another field I have to get familiar with...
Celebrate failure, and then learn from it - Nolan Bushnell
Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB.
That's true, but the problem is that as a good scientist you are required by most journals and universities to keep the original sequence data that are coming off these high-through sequencers (aka the fastq files) so that you can show you work so-to-speak if it ever comes into question. These files often contain 30-40x coverage of your 3Gb reference sequence and even compressed are still several GB in size. Additional, because these large-scale sequencing projects are costing millions of dollars, the NIH isn't going to be happy if you lose the data due to drive failure, so you'll need that data duplicated using a RAID setup and offsite backup. So storage is actually a huge problem.
That's your own germline DNA. But it would be cool to get the distinct sequences of all the cells in your body. Most of those cells (by count, not mass) are various microorganisms, lots in your gut, or infections that are making you sick or wearing down your immune system, or a latent conditions like HIV or HPV, and you would see the evolution of a few strains of precancerous / cancerous cells evolving too. Taken altogether that would be a huge amount of DNA. But I guess a lot of the distinct genomes are localized and you couldn't sample them easily.
To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.
NO. This does not put anything "into perspective", except it meant "a lot of data" for the average Joe.
To put it into useful perspective, we should compare with large data encountered in other sciences, such as 25PB per year from the LHC. And that's after aggressively discarding collisions that doesn't look promising in the first pass, it would be orders of magnitude bigger otherwise.
But now just 15PB per year doesn't look that newsworthy, eh?
Oliver.
Or, to put it in even better perspective, if you encoded each bit in a Rubik’s cube and stacked them end to end, it would stretch for half a light year. With the data increasing 5x every year, in less than 2 years the stack will reach to Alpha Centauri and back. In under 8 years, it will be the width of the galaxy. In 16 years, it will be as wide as the entire universe.
Or perhaps a more reasonable perspective is to realize that the entire genetic data collected each year by all sequencers in all labs and hospitals in the world, if stored on SATA disks, would fit in a Subaru Forrester.
This is very much the case. I work as a bioinformatician at a sequencing center, and I would say we see around 50-100G of sequence data for the average run/experiment, which isn't really so bad, certainly not compared to the high energy physics crowd and given a decent network. The trick is what we want to do with the data: some of the processes are embarrassingly parallel, but many algorithms don't lend themselves to that sort of thing. We have a few 1TB ram machines, and even those are limiting in some cases. Many of the problems are NP-hard, and even the for the heuristics we'd ideally use superlinear algorithms, but we can't have that either, it's near linear time (and memory) or bust which sucks.
I'm actually really looking forward to a vast reduction in dataset size and cost in the life sciences, so we can make use of and design better algorithmic methods and get back to answering questions. That's up to the engineers designing the sequencing machines though..
The snow doesn't give a soft white damn whom it touches. -- ee cummings
"At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station."
Use more than one stack. You're welcome.
A single finished genome is not the problem. It is the raw data.
The problem is that any time you sequence a new individual's genome for a species that already has a genome assembly, you need minimum 5x coverage across the genome to reliably find variation. Because of variation in coverage, that means you may have to shoot for >20x coverage to find all the variation. The problem is more complex when you are trying to de novo assemble a genome for a species that does NOT have a genome assembly. In this case, you often have to aim for at least 40x coverage (and in the 100x range may be better).
To get the data, we use next-gen sequencing. To give you an idea of the data output, a single Illumina HiSeq 2000 run produces 3 billion reads. Each "read" is a pair of genomic fragments 100 bases long. That means 600,000,000,000 bases are produced in a single run. The run is stored as a .fastq file, meaning that each base is stored as an ASCII character, and has an associated quality score stored as another ASCII character. So that's 1.2 trillion ASCII characters for a single run, or about 1.09 terabytes uncompressed. This does not include the storage for the (uncompressable) images taken by the sequencing machine in order to call the bases. They can be an order of magnitude larger. A single experiment may involve dozens of such runs.
There is an expectation that these runs will be made available in a public repository when an analysis is published. That puts great stress on places like NIH, where 1.7 quadrillion raw bases have been uploaded in about the last four years:
http://www.ncbi.nlm.nih.gov/Traces/sra/
You are correct when you say that computational power is a bigger problem, but again, this is not related to the three billion bases of the genome, which is trivial in size. Once again, the problem is the raw data. When assembling a new species' genome from scratch, you somehow have to reassemble those 3 billion pairs of 100-base reads. The way that is done is by hashing every single read into pieces about 21 nucleotides long, then storing them all, creating a de Bruijin graph, and navigating through it. The amount of RAM required for this is absolutely insane.
Actually the larger problem at the moment is not so much computational power, its the extremely large amounts of memory (TB range) to assemble the genomes. 1000 core computers are fairly common, ones with 2 TB of memory which is available for weeks at a time are not.
Oddly enough this won't be a problem as newer machines make longer reads which can be assembled with less memory.
I never knew Moores law (which applies to transistor count) now applies to the cost of DNA sequencing....
Here's hoping Moores law applies to the housing market soon too!
If you have the genomes of your parents, and your own genome, yours is about 70 new spot mutations, about 60 crossovers, and you have to specify who your parents were. About 600 bytes of new information per person. You could store the genomes of the entire human race on a couple terabytes if you knew the family trees well enough. I tried to nail down the statistics for that in http://burtleburtle.net/bob/future/geninfo.html .
Within your character's lifestyle as being a main character, you can enterprise in a few quests to perform heroic acts along with create the globe a greater spot for a are in.
fifa 13 coins pc
Scientists who viewed this sequence also viewed these sequences...
As an example, I used to build phylogenetic trees with an algorithm called RAxML. Along comes something called FastTree, which is just or nearly as accurate, but 1-2 orders of magnitude faster..
The amount of biological data doubles every 9 months, processing power doubles every 18 months we have already reached the crossover point.
15 discs is enough, according to the article on storing 1 petabyte on a single DVD we had just one week ago. Problem solved!
Why not have a reference genome? For everyone else, simply store deviations from the reference. Seems a possibility.
Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB. This means you could store a sequences for everybody in the world in ~132 PB or 0.05% or total worldwide data storage (295 exabytes)
For a single delta to a reference, but there's probably lots of redundancy in the deltas. If you have a tree/set of variations (Base human + "typical" Asian + "typical" Japanese + "typical" Okinawa + encoding the diff) you can probably bring the world estimate down by a few orders of magnitude, depending on how much is systematic and how much is unique to the individual.
Live today, because you never know what tomorrow brings
The sequence read archives (such as the one hosted by NCBI) as a repository for this sequencing data, uses "compression by reference," a highly-efficient way to compress and store a lot of the data. The raw data that comes off these sequencers is often >99% homologous to the reference genome (such as human, etc), so the most efficient way to compress and store this data is only to record what is different between the sequence output and the reference genome.
Does it say something about handling this way also internal repeats?
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
I suggest storing it molecular form by pairs of four different bases (guanine, thymine, adenine, cytosine) combined in an aesthetically pleasing, double helical molecule!
Can't we have more meaningful units?
How many Libraries of Congress is that?
To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.
This doesn't put it into perspective at all. What is a DVD and why would I put data on it?
This posting is provided 'AS IS' without warranty of any kind, implied or otherwise.
Whoever modded this "funny" isn't paranoid enough.
pronounced shi-TAWN ?
It's not like they can use the data, it all is or soon will be patented! Even the patent holders are SOL because anything their bit of patented gene interacts with is patented by someone else. What a lovely system we have!
In the future, DVD's will be made much thinner and won't stack up as high.
I believe that is a very efficient storage mechanism
NSA could get to do something nice. They already know how to deal with mountains of useless data. Scanning for cancer and potential terrorists? Mutations of genes and behaviour? Possibilities are endless.