Slashdot Mirror


The DNA Data Deluge

the_newsbeagle writes "Fast, cheap genetic sequencing machines have the potential to revolutionize science and medicine--but only if geneticists can figure out how to deal with the floods of data their machines are producing. That's where computer scientists can save the day. In this article from IEEE Spectrum, two computational biologists explain how they're borrowing big data solutions from companies like Google and Amazon to meet the challenge. An explanation of the scope of the problem, from the article: 'The roughly 2000 sequencing instruments in labs and hospitals around the world can collectively generate about 15 petabytes of compressed genetic data each year. To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall. And with sequencing capacity increasing at a rate of around three- to fivefold per year, next year the stack would be around 6 to 10 miles tall. At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station.'"

34 of 138 comments (clear)

  1. At least they're not rolling their own. by The_Wilschon · · Score: 4, Interesting

    In high energy physics, we rolled our own big data solutions (mostly because there was no big data other than us when we did so). It turned out to be terrible.

    --
    SIGSEGV caught, terminating

    wait... not that kind of sig.
    1. Re:At least they're not rolling their own. by nan0 · · Score: 2

      a brief review of their documentation should shed some light. http://root.cern.ch/root/doc/RootDoc.html

    2. Re:At least they're not rolling their own. by 50000BTU_barbecue · · Score: 2
      You must have learned that early.

      http://en.wikipedia.org/wiki/IBM_1360

      --
      Mostly random stuff.
    3. Re:At least they're not rolling their own. by stox · · Score: 2

      Being the wake in front of the Bleeding Edge, HEP gets to learn all sorts of lessons before everyone else. As a result, you get to make all the mistakes that everyone else gets to learn from.

      --
      "To those who are overly cautious, everything is impossible. "
    4. Re:At least they're not rolling their own. by bdabautcb · · Score: 2

      I'm no techie, I programmed some in basic as a kid thanks to 321 contact, and the last thing I did of note was to put a girl I liked in math's TI on an infinite loop printing 'I got drunk last weekend and couldn't derive' or some such. Been running linux because I inherited a netbook with no disc drive and couldn't get windows to install from USB and I can't afford a new computer, and I've been reading slash for years and read about USB installs. My question is, is there any movement to use compute cycles at publicly funded data centers like the one going up in utah to crunch big data like this that would benefit the public? Is that even possible in the current vitriolic environment regarding data? I am young but old enough to remember people fighting over access to processing power just so they could try out new ideas. Often when someone had an idea good enogh to warrant investigation, their colleagues would go above and beyond to make a run happen.

      --
      Koalas. They're telepathic. Plus, they control the weather. -Margaret
    5. Re:At least they're not rolling their own. by Samantha+Wright · · Score: 4, Informative

      I can't comment on the physics data, but in the case of the bio data that the article discusses, we honestly have no idea what to do with it. Most sequencing projects collect an enormous amount of useless information, a little like saving an image of your hard drive every time you screw up grub's boot.lst. We keep it around on the off chance that some of it might be useful in some other way eventually, although there are ongoing concerns that much of the data just won't be high enough quality for some stuff.

      On the other hand, a lot of the specialised datasets (like the ones being stored in the article) are meant as baselines, so researchers studying specific problems or populations don't have to go out and get their own information. Researchers working with such data usually have access to various clusters or supercomputers through their institutions; for example, my university gives me access to SciNet. There's still vying for access when someone wants to run a really big job, but there are practical alternatives in many cases (such as GPGPU computing.)

      Also, I'm pretty sure the Utah data centre is kept pretty busy with its NSA business.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    6. Re:At least they're not rolling their own. by Samantha+Wright · · Score: 2

      It's a neat thought, but it would never beat the basics. While there are a lot of genes that have common ancestors (called paralogues), the hierarchical history of these genes is often hard to determine or something that pre-dates human speciation; for example, there's only one species (a weird blob a little like a multi-cellular amoeba) that has a single homeobox gene.

      While building a complete evolutionary history of gene families is of great interest to science, it's pointless to try exploiting it for compression when we can just turn to standard string methods; as has been mentioned elsewhere on this story, gzip can be faster than the read/write buffer on standard hard drives. Having to replay an evolutionary history we can only guess at would be a royal pain.

      That being said, we can store individuals' genomes as something akin to diff patches, which brings 3.1 gigabytes of raw ASCII down to about 4 MB of high-entropy data, even before compression.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    7. Re:At least they're not rolling their own. by The_Wilschon · · Score: 2
      Cycles are rarely the issue for us in HEP, and when they are, all we need is more nodes to split the problem into smaller pieces (wiki: embarassingly parallel problem). The actual computational needs are (typically) pretty small. The main bottleneck is usually data throughput. We discard enormous amounts of data (that may or may not be useful, depending on who you ask) simply because we can't store it anywhere close to as fast as we can make it (many orders of magnitude difference between the data production rate and the data storage rate). And then, when we're analyzing the data we've taken, our CPUs tend to sit idle while they wait on the disk to read another block of events, which then take a only a few cycles to add in to the necessary histograms. It only gets worse when the data is somewhere far away on the network. And it gets even worse when you want to select a subset of the data -- with our systems you have to make a full copy of the subset.

      There are two big wins that modern big data has developed that we could benefit greatly from if the switchover costs weren't too high. The first is distributing data over many disks on many nodes and bringing the code to the data instead of bringing the data to the code. The more disks your data is on, the less you have to wait on seek times. The second is storing the data in a way that is not strictly sequential in a single set of files, so that if you want to look at a subset of the data, you can effectively do that without having to make a copy of that subset.

      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
    8. Re:At least they're not rolling their own. by The_Wilschon · · Score: 2

      You should not write a C++ interpreter. You especially shouldn't write an interpreter of a language that looks almost just like C++, but is different from it in unpredictable ways, some of which contribute to bad coding habits and/or make normal C++ more difficult to learn.

      Strictly sequential files are a bad model for data if most of your time is spent constructing more-and-more elaborate subsets of that data. When we want to examine a subset, we practically have to make a complete copy of all the data falling into that subset. You want to make a small tweak to your selection? Make a new copy all over again.

      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
  2. Bogus units by vanzin · · Score: 5, Insightful

    Everybody knows we should measure the pile height in Libraries of Congress. Or VW Beetles.

    1. Re:Bogus units by schivvers · · Score: 2

      I thought the standard was "Statue of Liberty" for height, and "Rhode Islands" for area.

      --
      Life's journey is not to arrive at the grave safely in a well-preserved body, but rather to skid in sideways, totally wo
  3. Re:Who uses DVDs? by schivvers · · Score: 2

    Who measures in feet? That's so archaic! Try using something more modern, like Empire State buildings...or Saturn V rockets.

    --
    Life's journey is not to arrive at the grave safely in a well-preserved body, but rather to skid in sideways, totally wo
  4. Digital DNA storage anyone ? by Anonymous Coward · · Score: 2, Insightful

    why aren't they storing it in digital DNA format?. Seems like a pretty efficient data storage format to me! A couple of grams of the stuff should suffice.

    1. Re:Digital DNA storage anyone ? by Anonymous Coward · · Score: 3, Interesting

      Actually ASCII files are the easiest to process. And since we generally use a handful of ambiguity codes, it's more like ATGCNX. Due to repetitive segments GZIP actually works out better than your proposed 2-bit scheme. We do a lot of UNIX piping through GZIP which is still faster than a magnetic harddrive can retrieve data.

    2. Re:Digital DNA storage anyone ? by the+gnat · · Score: 4, Informative

      why aren't they storing it in digital DNA format

      Because they need to be able to read it back quickly, and error-free. Add to that, it's actually quite expensive to synthesize that much DNA; hard drives are relatively cheap by comparison.

    3. Re:Digital DNA storage anyone ? by wezelboy · · Score: 3, Interesting

      When I had to get the first draft of the human genome onto CD, I used 2 bit substitution and run length encoding on repeats. gzip definitely did not cut it.

  5. The problem will solve itself by Krishnoid · · Score: 5, Funny

    To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.

    Once that happens, they'll be able to stop storing it on DVDs and move it into the cloud.

  6. This just goes to show... by Gavin+Scott · · Score: 3, Informative

    ...what a shitty storage medium DVDs are these days.

    A cheap 3TB disk drive burned to DVDs will produce a rather unwieldy tower of disks as well.

    G.

  7. Simple. Get the NSA to do it. by Anonymous Coward · · Score: 5, Funny

    Publish a scientific, paper stating that potential terrorists or other subversives can be identified via DNA sequencing. The NSA will then covertly collect DNA samples from the entire population, and store everyone's genetic profiles in massive databases. Government will spend the trillions of dollars necessary without question. After all, if you are against it, you want another 9/11 to happen.

  8. Database Replication by VortexCortex · · Score: 4, Insightful

    Bit rot is also a big problem with data. So, the data has to be reduplicated to keep entropy from destroying it, which means a self corrective meta data must be used. If only there were a highly compact self correcting self replicating data storage system with 1's and 0's the size of small molecules...

    My greatest fear is that when we meet the aliens, they'll laugh, stick us in a holographic projector, and gather around to watch the vintage porn encoded in our DNA.

    1. Re:Database Replication by __aasqbs9791 · · Score: 3, Funny

      I propose we call this new data method Data Neutral Assembly.

    2. Re:Database Replication by c0lo · · Score: 2

      Bit rot is also a big problem with data.

      Take a whiff from a piece of meat after 2 weeks at room temperature and compare it with how a DVD smells after the same time.
      Complains on bit rot accepted only after the experiment.

      --
      Questions raise, answers kill. Raise questions to stay alive.
    3. Re:Database Replication by chihowa · · Score: 2

      I know you were going for funny, but much of what you will be smelling in your experiment is from bacteria eating the protein and polysaccharides in the meat. The DNA is remarkably stable and even if some of it is fragmented, you have a massively redundant set in your pile of meat.

      We've sequenced DNA from nearly a million years ago and I regularly store DNA dried out and stuck to a piece of paper. DVDs won't last nearly that long before the dyes start to break down. For a long term archival system, we could do much worse than DNA.

      --
      If you want a vision of the future, imagine a youtube comments section scrolling - forever.
  9. 2000 devices make a lot of data by hawguy · · Score: 2

    It seems a little overly sensationalist to aggregate the devices together when determining the storage size to make such a dramatic 2 mile high tower of DVD's... If you look at them individually, it's not that much data:

    (15 x 10^15 bytes/device) / (2000 devices) / (1 x 10e9 bytes/gb) = 7500GB, or 7.5TB

    That's a stack of 4TB hard drives 2 inches high. Or if you must use DVD's, that's a stack of 1600 DVD's 2 meters high.

  10. Storage Non-Problem - Sequences Compresses to MBs by esten · · Score: 5, Informative

    Storage is not the problem. Computational power is.

    Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB. This means you could store a sequences for everybody in the world in ~132 PB or 0.05% or total worldwide data storage (295 exabytes)

    Now the real challenge is more in having enough computational power to read and process the 3 billion letters genetic sequence and designing effective algorithms to process this data.

    More info on compression of genomic sequences
    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166/

  11. The answer is obvious! by plopez · · Score: 3, Funny

    They should use a NoSQL multi-shard vertically intgrated stack with a RESTfull rails driven in-memory virtual multi-parallel JPython enabled solution.

    Bingo!

    --
    putting the 'B' in LGBTQ+
  12. Re:Simple. Get the NSA to do it. by Tablizer · · Score: 2

    You mean ask the NSA how they've already done it.

  13. Re:Nice amount of intellectual capacity by evoluti by hedwards · · Score: 2

    Yes, but that took millions of years to develop the simplest versions.

    It's astonishing that it took humans only a few millenia to get to that point on our own.

  14. Oddly... I have a clue about this stuff lately by WaywardGeek · · Score: 5, Interesting

    Please... entire DNA genomes are tiny... on the order of 1Gb, with no compression. Taking into account the huge similarities to published genomes, we can compress that by at least 1000X. What they are talking about is the huge amount of data spit out by the sequencing machines in order to determine your genome. Once determined, it's tiny.

    That said, what I need is raw machine data. I'm having to do my own little exome research project. My family has a very rare form of X-linked color blindness that is most likely caused by a single gene defect on our X chromosome. It's no big deal, but now I'm losing central vision, with symptoms most similar to late-onset Starardt's Disease. My UNC ophthalmologist beat the experts at John Hopkins and Jacksonville's hospital, and made the correct call, directly refuting the other doctor's diagnosis of Stargartd's. She though I had something else and that my DNA would prove it. She gave me the opportunity to have my exome sequenced, and she was right.

    So, I've got something pretty horrible, and my ophthalmologist thinks it's most likely related to my unusual form of color blindness. My daughter carries this gene, as does my cousin and one of her sons. Gene research to the rescue?!? Unfortunately... no. There are simply too few people like us. So... being a slashdot sort of geek who refuses to give up, I'm running my own study. Actually, the UNC researchers wanted to work with me... all I'd have to do is bring my extended family from California to Chapel Hill a couple of times over a couple of years and have them see doctors at UNC. There's simply no way I could make that happen.

    Innovative companies to the rescue... This morning, Axeq, a company headquartered in MD, received my families DNA for exome sequencing at their Korean lab. They ran an exome sequencing special in April: $600 per exome, with an order size minimum of six. They have been great to work with, and accepted my order for only four. Bioserve, also in MD, did the DNA extraction from whole blood, and they have been even more helpful. The blood extraction labs were also incredibly helpful, once we found the right places (very emphatically not Labcorp or Quest Diagnostics). The Stanford clinic lab manager was unbelievably helpful, and in LA, the lab director at the San Antonio Hospital Lab went way overboard, So far, I have to give Axeq and Bioserve five stars out of five, and the blood draw labs deserve a six.

    Assuming I get what I'm expecting, I'll get a library of matched genes, and also all the raw machine output data, for four relatives. The output data is what I really need, since our particular mutation is not currently in the gene database. Once I get all the data, I'll need to do a bit of coding to see if I can identify the mutation. Unfortunately, there are several ways that this could be impossible. For example, "copy number variations", or CNVs, if they go on for over a few hundred base pairs, are unable to be detected with current technology. Ah... the life of a geek. This is yet another field I have to get familiar with...

    --
    Celebrate failure, and then learn from it - Nolan Bushnell
  15. Re:Who uses DVDs? by Samantha+Wright · · Score: 4, Funny

    And we can double storage efficiency by using two stacks! Clearly, they need to hire one of us.

    --
    Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
  16. Re:Storage Non-Problem - Sequences Compresses to M by B1ackDragon · · Score: 3, Informative

    This is very much the case. I work as a bioinformatician at a sequencing center, and I would say we see around 50-100G of sequence data for the average run/experiment, which isn't really so bad, certainly not compared to the high energy physics crowd and given a decent network. The trick is what we want to do with the data: some of the processes are embarrassingly parallel, but many algorithms don't lend themselves to that sort of thing. We have a few 1TB ram machines, and even those are limiting in some cases. Many of the problems are NP-hard, and even the for the heuristics we'd ideally use superlinear algorithms, but we can't have that either, it's near linear time (and memory) or bust which sucks.

    I'm actually really looking forward to a vast reduction in dataset size and cost in the life sciences, so we can make use of and design better algorithmic methods and get back to answering questions. That's up to the engineers designing the sequencing machines though..

    --
    The snow doesn't give a soft white damn whom it touches. -- ee cummings
  17. Re:Storage Non-Problem - Sequences Compresses to M by Anonymous Coward · · Score: 3, Interesting

    A single finished genome is not the problem. It is the raw data.

    The problem is that any time you sequence a new individual's genome for a species that already has a genome assembly, you need minimum 5x coverage across the genome to reliably find variation. Because of variation in coverage, that means you may have to shoot for >20x coverage to find all the variation. The problem is more complex when you are trying to de novo assemble a genome for a species that does NOT have a genome assembly. In this case, you often have to aim for at least 40x coverage (and in the 100x range may be better).

    To get the data, we use next-gen sequencing. To give you an idea of the data output, a single Illumina HiSeq 2000 run produces 3 billion reads. Each "read" is a pair of genomic fragments 100 bases long. That means 600,000,000,000 bases are produced in a single run. The run is stored as a .fastq file, meaning that each base is stored as an ASCII character, and has an associated quality score stored as another ASCII character. So that's 1.2 trillion ASCII characters for a single run, or about 1.09 terabytes uncompressed. This does not include the storage for the (uncompressable) images taken by the sequencing machine in order to call the bases. They can be an order of magnitude larger. A single experiment may involve dozens of such runs.

    There is an expectation that these runs will be made available in a public repository when an analysis is published. That puts great stress on places like NIH, where 1.7 quadrillion raw bases have been uploaded in about the last four years:
    http://www.ncbi.nlm.nih.gov/Traces/sra/

    You are correct when you say that computational power is a bigger problem, but again, this is not related to the three billion bases of the genome, which is trivial in size. Once again, the problem is the raw data. When assembling a new species' genome from scratch, you somehow have to reassemble those 3 billion pairs of 100-base reads. The way that is done is by hashing every single read into pieces about 21 nucleotides long, then storing them all, creating a de Bruijin graph, and navigating through it. The amount of RAM required for this is absolutely insane.

  18. Yay, AdEnine & 1 click splicing by charlesjo488 · · Score: 4, Funny

    Scientists who viewed this sequence also viewed these sequences...

  19. Miles of DVDs? by sanman2 · · Score: 2

    Can't we have more meaningful units?

    How many Libraries of Congress is that?