Slashdot Mirror


The DNA Data Deluge

the_newsbeagle writes "Fast, cheap genetic sequencing machines have the potential to revolutionize science and medicine--but only if geneticists can figure out how to deal with the floods of data their machines are producing. That's where computer scientists can save the day. In this article from IEEE Spectrum, two computational biologists explain how they're borrowing big data solutions from companies like Google and Amazon to meet the challenge. An explanation of the scope of the problem, from the article: 'The roughly 2000 sequencing instruments in labs and hospitals around the world can collectively generate about 15 petabytes of compressed genetic data each year. To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall. And with sequencing capacity increasing at a rate of around three- to fivefold per year, next year the stack would be around 6 to 10 miles tall. At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station.'"

138 comments

  1. At least they're not rolling their own. by The_Wilschon · · Score: 4, Interesting

    In high energy physics, we rolled our own big data solutions (mostly because there was no big data other than us when we did so). It turned out to be terrible.

    --
    SIGSEGV caught, terminating

    wait... not that kind of sig.
    1. Re:At least they're not rolling their own. by Anonymous Coward · · Score: 0

      Care to comment on lessons learned?

    2. Re:At least they're not rolling their own. by nan0 · · Score: 2

      a brief review of their documentation should shed some light. http://root.cern.ch/root/doc/RootDoc.html

    3. Re:At least they're not rolling their own. by 50000BTU_barbecue · · Score: 2
      You must have learned that early.

      http://en.wikipedia.org/wiki/IBM_1360

      --
      Mostly random stuff.
    4. Re:At least they're not rolling their own. by Anonymous Coward · · Score: 1

      I rolled my own but forgot what the results were...

    5. Re:At least they're not rolling their own. by stox · · Score: 2

      Being the wake in front of the Bleeding Edge, HEP gets to learn all sorts of lessons before everyone else. As a result, you get to make all the mistakes that everyone else gets to learn from.

      --
      "To those who are overly cautious, everything is impossible. "
    6. Re:At least they're not rolling their own. by bdabautcb · · Score: 2

      I'm no techie, I programmed some in basic as a kid thanks to 321 contact, and the last thing I did of note was to put a girl I liked in math's TI on an infinite loop printing 'I got drunk last weekend and couldn't derive' or some such. Been running linux because I inherited a netbook with no disc drive and couldn't get windows to install from USB and I can't afford a new computer, and I've been reading slash for years and read about USB installs. My question is, is there any movement to use compute cycles at publicly funded data centers like the one going up in utah to crunch big data like this that would benefit the public? Is that even possible in the current vitriolic environment regarding data? I am young but old enough to remember people fighting over access to processing power just so they could try out new ideas. Often when someone had an idea good enogh to warrant investigation, their colleagues would go above and beyond to make a run happen.

      --
      Koalas. They're telepathic. Plus, they control the weather. -Margaret
    7. Re:At least they're not rolling their own. by Samantha+Wright · · Score: 4, Informative

      I can't comment on the physics data, but in the case of the bio data that the article discusses, we honestly have no idea what to do with it. Most sequencing projects collect an enormous amount of useless information, a little like saving an image of your hard drive every time you screw up grub's boot.lst. We keep it around on the off chance that some of it might be useful in some other way eventually, although there are ongoing concerns that much of the data just won't be high enough quality for some stuff.

      On the other hand, a lot of the specialised datasets (like the ones being stored in the article) are meant as baselines, so researchers studying specific problems or populations don't have to go out and get their own information. Researchers working with such data usually have access to various clusters or supercomputers through their institutions; for example, my university gives me access to SciNet. There's still vying for access when someone wants to run a really big job, but there are practical alternatives in many cases (such as GPGPU computing.)

      Also, I'm pretty sure the Utah data centre is kept pretty busy with its NSA business.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    8. Re:At least they're not rolling their own. by Anonymous Coward · · Score: 0

      It's obvious, at least to me, that the genome must be organized hierarchically like a tree, the tree of life. The tree is the most efficient way to organize data because it eliminates unnecessary duplication. Of course, wherever there is a tree, there are also branches and wherever there are branches there must be a branch control hierarchy to manage it all. This is where research should focus, IMO.

    9. Re:At least they're not rolling their own. by Samantha+Wright · · Score: 2

      It's a neat thought, but it would never beat the basics. While there are a lot of genes that have common ancestors (called paralogues), the hierarchical history of these genes is often hard to determine or something that pre-dates human speciation; for example, there's only one species (a weird blob a little like a multi-cellular amoeba) that has a single homeobox gene.

      While building a complete evolutionary history of gene families is of great interest to science, it's pointless to try exploiting it for compression when we can just turn to standard string methods; as has been mentioned elsewhere on this story, gzip can be faster than the read/write buffer on standard hard drives. Having to replay an evolutionary history we can only guess at would be a royal pain.

      That being said, we can store individuals' genomes as something akin to diff patches, which brings 3.1 gigabytes of raw ASCII down to about 4 MB of high-entropy data, even before compression.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    10. Re:At least they're not rolling their own. by msevior · · Score: 1

      But it (mostly) works...

    11. Re:At least they're not rolling their own. by K.+S.+Kyosuke · · Score: 1

      In high energy physics, we rolled our own big data solutions (mostly because there was no big data other than us when we did so). It turned out to be terrible.

      But genetic data isn't particle physics data. It makes perfect sense to roll out a custom "big data" (whatever that crap means) solution because of the very nature of the data stored (at the very least, you will want DNA-specific compression algorithms because there's huge redundancy in the data spread horizontally across the sequenced individuals).

      --
      Ezekiel 23:20
    12. Re:At least they're not rolling their own. by K.+S.+Kyosuke · · Score: 1

      gzip can be faster than the read/write buffer on standard hard drives.

      Gzip of what? Chromosome-at-once? Isn't that the wrong way of traversing the data set, if you're aiming for actual compression? More to the point, gzip, if I'm not mistaken, is good for data with 8-bit boundaries. What if the data gets stored in base-4, six bits per triplet/codon? Finally, talking about string algorithms, I'd have thought that the best way of compressing the stuff would involve mapping the extant alleles and storing only references to them in the individual genomes.

      --
      Ezekiel 23:20
    13. Re:At least they're not rolling their own. by Samantha+Wright · · Score: 1

      Here's the lowdown on how BZGF works, as one example. In this case, there are many short distinct of DNA being stored together, each with offset and quality information, many of which may be identical. The compression is localized to smaller blocks (I'm not sure if they're 4096-byte disk sectors or something else.) You're right that there's probably some performance lost due to the misalignment, but 6 and 8 line up every 24 bits, so at worst that means patterns of four codons or three bytes—and a step of four amino acids is ideal for alpha helix motifs, so it's not all a loss.

      And, yes, regarding individual genomes: I'm pretty sure that'd be all anyone stored if they didn't have to hold onto the FASTQ files for auditability.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    14. Re:At least they're not rolling their own. by The_Wilschon · · Score: 2
      Cycles are rarely the issue for us in HEP, and when they are, all we need is more nodes to split the problem into smaller pieces (wiki: embarassingly parallel problem). The actual computational needs are (typically) pretty small. The main bottleneck is usually data throughput. We discard enormous amounts of data (that may or may not be useful, depending on who you ask) simply because we can't store it anywhere close to as fast as we can make it (many orders of magnitude difference between the data production rate and the data storage rate). And then, when we're analyzing the data we've taken, our CPUs tend to sit idle while they wait on the disk to read another block of events, which then take a only a few cycles to add in to the necessary histograms. It only gets worse when the data is somewhere far away on the network. And it gets even worse when you want to select a subset of the data -- with our systems you have to make a full copy of the subset.

      There are two big wins that modern big data has developed that we could benefit greatly from if the switchover costs weren't too high. The first is distributing data over many disks on many nodes and bringing the code to the data instead of bringing the data to the code. The more disks your data is on, the less you have to wait on seek times. The second is storing the data in a way that is not strictly sequential in a single set of files, so that if you want to look at a subset of the data, you can effectively do that without having to make a copy of that subset.

      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
    15. Re:At least they're not rolling their own. by The_Wilschon · · Score: 2

      You should not write a C++ interpreter. You especially shouldn't write an interpreter of a language that looks almost just like C++, but is different from it in unpredictable ways, some of which contribute to bad coding habits and/or make normal C++ more difficult to learn.

      Strictly sequential files are a bad model for data if most of your time is spent constructing more-and-more elaborate subsets of that data. When we want to examine a subset, we practically have to make a complete copy of all the data falling into that subset. You want to make a small tweak to your selection? Make a new copy all over again.

      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
    16. Re:At least they're not rolling their own. by cthulhu11 · · Score: 1

      I would think that sequence data would reduce especially well with directed compression and especially deduplication.

    17. Re:At least they're not rolling their own. by Samantha+Wright · · Score: 1

      Yeah, it can be compressed pretty aggressively, as we've discussed elsewhere in this comment thread. However, compression performance has to be balanced with IO speed.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    18. Re:At least they're not rolling their own. by delt0r · · Score: 1

      Well in your defense. You [high energy physics] process far more data in an hour than they are talking about producing in years.

      I shouldn't say they. I work with next gen DNA data on a daily basis. The main problem is everyone in biology uses awful flat ascii files for so many things. And databases... well most are so badly done because they are literally done by someone reading the "SQL for Dummies" book as he does it.

      The last but not least of the problems are experimental design. Too often things are sequenced because you just sequence it, a bit like a machine that goes ping.

      --
      If information wants to be free, why does my internet connection cost so much?
  2. Does it have to be said? by Anonymous Coward · · Score: 0

    We've got bigger storage media than DVDs!

  3. obvious solution by Anonymous Coward · · Score: 1

    don't store it all on DVDs, then

  4. Bogus units by vanzin · · Score: 5, Insightful

    Everybody knows we should measure the pile height in Libraries of Congress. Or VW Beetles.

    1. Re:Bogus units by schivvers · · Score: 2

      I thought the standard was "Statue of Liberty" for height, and "Rhode Islands" for area.

      --
      Life's journey is not to arrive at the grave safely in a well-preserved body, but rather to skid in sideways, totally wo
    2. Re:Bogus units by Anonymous Coward · · Score: 0

      I liked using Texases for area, but Billy Bob Thornton ruined it for everyone.

    3. Re:Bogus units by Anonymous Coward · · Score: 0

      I thought Billy Bob Thorntons were measurements of insanity?

    4. Re:Bogus units by Anonymous Coward · · Score: 0

      Or we could just do the math on real world storage (fucking crazy I know). A 2TB drive is about $80, so let's just round up and say $50/1TB * 1024TB/1PB * 15PB = $768000. So less than a million for the entire world's worth of data storage. Pretty insignificant compared to the cost of the sequencing equipment really.

    5. Re:Bogus units by Anonymous Coward · · Score: 0

      You thought wrong. Apparently you're new around here.

  5. Who uses DVDs? by Anonymous Coward · · Score: 0

    If they put it all on hard drives, it would only be 600 feet tall.

    1. Re:Who uses DVDs? by schivvers · · Score: 2

      Who measures in feet? That's so archaic! Try using something more modern, like Empire State buildings...or Saturn V rockets.

      --
      Life's journey is not to arrive at the grave safely in a well-preserved body, but rather to skid in sideways, totally wo
    2. Re:Who uses DVDs? by Anonymous Coward · · Score: 0

      Or AC penises.

    3. Re:Who uses DVDs? by Samantha+Wright · · Score: 4, Funny

      And we can double storage efficiency by using two stacks! Clearly, they need to hire one of us.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    4. Re:Who uses DVDs? by Anonymous Coward · · Score: 1

      Or AC penises.

      We're talking about big size measurements not micro measurements.

  6. Digital DNA storage anyone ? by Anonymous Coward · · Score: 2, Insightful

    why aren't they storing it in digital DNA format?. Seems like a pretty efficient data storage format to me! A couple of grams of the stuff should suffice.

    1. Re:Digital DNA storage anyone ? by Anonymous Coward · · Score: 1

      That brings up an interesting point. I wonder how they ARE storing it? With 4 possible bases, you should only need two bits per. So, 4 per byte, with no compression. I hope they aren't just writing out ASCII files or something ...

    2. Re:Digital DNA storage anyone ? by Anonymous Coward · · Score: 3, Interesting

      Actually ASCII files are the easiest to process. And since we generally use a handful of ambiguity codes, it's more like ATGCNX. Due to repetitive segments GZIP actually works out better than your proposed 2-bit scheme. We do a lot of UNIX piping through GZIP which is still faster than a magnetic harddrive can retrieve data.

    3. Re:Digital DNA storage anyone ? by Anonymous Coward · · Score: 0

      Correct, much of the work (especially experimental "poking around" sort of work) is done with ASCII files. I regularly check out data in vim, transform it with sed or awk, and send it back to gzip when I'm done
      But for certain tasks, or for archival purposes, we do have more advanced compression methods. For example, BAM and CRAM for alignments/assemblies, and SRA for unmapped reads. VCF files efficiently summarize known differences from reference strains, collapsing multi megabases of information into tens of lines. There exist specialized database solutions which allow quick searching/retrieval and somewhat tight storage for several kinds of biological data.

    4. Re:Digital DNA storage anyone ? by the+gnat · · Score: 4, Informative

      why aren't they storing it in digital DNA format

      Because they need to be able to read it back quickly, and error-free. Add to that, it's actually quite expensive to synthesize that much DNA; hard drives are relatively cheap by comparison.

    5. Re:Digital DNA storage anyone ? by wezelboy · · Score: 3, Interesting

      When I had to get the first draft of the human genome onto CD, I used 2 bit substitution and run length encoding on repeats. gzip definitely did not cut it.

    6. Re:Digital DNA storage anyone ? by Anonymous Coward · · Score: 0

      ASCII is fast and error correcting now? I have been long gone!

    7. Re:Digital DNA storage anyone ? by mapkinase · · Score: 1

      I did a first draft of insulin sequence on punch cards.

      --
      I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
    8. Re:Digital DNA storage anyone ? by Anonymous Coward · · Score: 0

      If your high throughput operation is at the level of piping an inefficient textual format through UNIX pipes with GZIP... if you are using GZIP for data at rest while having grave issues with storage size and the data your are working on has a lot of higher-level redundancy that GZIP could not possibly catch... then you are simply not solving your needs with appropriate tools. It sounds like you might be using several orders of magnitude more storage and computing resources than you would really need to. Which I guess might be just the way it has to be if hiring high-cost programmers is not acceptable to the people offering the funding while buying high cost super computers is.

    9. Re:Digital DNA storage anyone ? by the+gnat · · Score: 1

      ASCII is fast and error correcting now?

      Relative to genome sequencing, hell yes it is! For the original sequencing, a relatively high error rate isn't a huge deal because there is massive redundancy in the fragment reads, which is also required to actually assemble all those bits and pieces. But you can see why it's even more inefficient this way...

    10. Re:Digital DNA storage anyone ? by Anonymous Coward · · Score: 0

      I work for one of the larger labs and in fact thats how almost all the data is stored (ascii files) - do a google search on fastq files. (Fastq is just fasta with quality data with it). People particularly on slashdot assume all kinds of things about DNA data and how it compresses. The facts are simple - just about every program out there expects textual representation. And without the quality data (metadata) often the DNA itself is useless.

    11. Re:Digital DNA storage anyone ? by Anonymous Coward · · Score: 0

      Pretty much this. But I know for a fact that there are more than 2K next gen sequencers in the wild currently. Add in the old tech and its a heck of a lot more then 15 petabytes being generated.

      Heck one single cancer sequencing center in Canada has generated published that they are generating 9 petabytes every two years in raw ACIII seq files.

      I am more worried about large scale alignment of this data, Its a trivial thing to program it up but not trivial to actually run it across even 1% of the bank of data being generated.

    12. Re:Digital DNA storage anyone ? by Anonymous Coward · · Score: 0

      You really should look up "fast" and "error correcting". It does not mean what you think it means.

      Now some real info for you:
      Fast? Since ASCII was designed specifically to represent written letters and characters it sucks at just about every other data types (and now also sucks at that since the world has grown and the USA is not the only country making computers). Using ASCII is simple slow and convenient.
      Error correcting? When someone says that data is error correction they do not really mean that it is, they simply mean that the data has extra info in it that will allow it to be corrected when an algorithm is applied to it. (Reed-Solomon is pretty common https://en.wikipedia.org/wiki/Reed–Solomon_error_correction ) ASCII has no such functionality, you can apply this to ASCII but the end result would require a special algorithm to read and the only thing you would get from using ASCII if "bigger data"... but if you are trying to get a grant and want to be able to say "Oh, we have soo much data! We must have more and bigger machines!" then I guess there is a reason to inflate the data volume.

      I will give you three free pro tips (since I have worked with "big data" quite a lot)
      - Adapt the format of the data to the nature of the data.
      - Write software to validate and repair the data.
      - Write software to work the data. (If you want to work with the data in ASCII write software to check it in and out from your "big data storage")
      Ah well, here is a fourth tip... "Spend a little now or a lot later" applies to data too. Spend some time looking at your data and try to figure out every usage case and what can go wrong. Write the software you will need then, now.

      Here is also some real flame bait... Agile sux! Its meant to quickly get crap products to hungry teenagers before you stockholders tear you down, It had no place in a serious project.

    13. Re:Digital DNA storage anyone ? by the+gnat · · Score: 1

      The comparison is between conventional storage in ASCII format, versus storing information in DNA. Whether ASCII is an optimally fast and error free on a purely objective scale, among other forms of conventional digital electronic storage, is irrelevant to the question I was answering.

    14. Re:Digital DNA storage anyone ? by Anonymous Coward · · Score: 0

      So, basically you are saying that you don't have a clue?

  7. The problem will solve itself by Krishnoid · · Score: 5, Funny

    To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.

    Once that happens, they'll be able to stop storing it on DVDs and move it into the cloud.

    1. Re:The problem will solve itself by c0lo · · Score: 1

      To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.

      Once that happens, they'll be able to stop storing it on DVDs and move it into the cloud.

      And before anyone knows, we would have a space elevator within the next 5 years instead of the eternal +25.

      --
      Questions raise, answers kill. Raise questions to stay alive.
    2. Re:The problem will solve itself by Swampash · · Score: 1

      Please, do continue measuring the massless sizeless thing in units of things with mass and size. It makes lots of sense.

  8. This just goes to show... by Gavin+Scott · · Score: 3, Informative

    ...what a shitty storage medium DVDs are these days.

    A cheap 3TB disk drive burned to DVDs will produce a rather unwieldy tower of disks as well.

    G.

    1. Re:This just goes to show... by fonske · · Score: 1

      I remember the picture of Bruce Dickinson (of heavy metal fame) taking a big bite out of two compact discs filled like a (big) sandwich with all things that make you fat, meant to illustrate the robustness of CD's.
      I also remember the feeling when I had to face it that I lost data beyond repair on a CD "backup".

    2. Re:This just goes to show... by delt0r · · Score: 1

      It does not sound that impressive when you say a "box of hard drives" after a year, and a whooping 5 boxes after 5 years.

      Of course it would sound more impressive if they used a stack of punched cards....

      --
      If information wants to be free, why does my internet connection cost so much?
  9. Simple. Get the NSA to do it. by Anonymous Coward · · Score: 5, Funny

    Publish a scientific, paper stating that potential terrorists or other subversives can be identified via DNA sequencing. The NSA will then covertly collect DNA samples from the entire population, and store everyone's genetic profiles in massive databases. Government will spend the trillions of dollars necessary without question. After all, if you are against it, you want another 9/11 to happen.

  10. Database Replication by VortexCortex · · Score: 4, Insightful

    Bit rot is also a big problem with data. So, the data has to be reduplicated to keep entropy from destroying it, which means a self corrective meta data must be used. If only there were a highly compact self correcting self replicating data storage system with 1's and 0's the size of small molecules...

    My greatest fear is that when we meet the aliens, they'll laugh, stick us in a holographic projector, and gather around to watch the vintage porn encoded in our DNA.

    1. Re:Database Replication by __aasqbs9791 · · Score: 3, Funny

      I propose we call this new data method Data Neutral Assembly.

    2. Re:Database Replication by phriot · · Score: 1

      If only there were a highly compact self correcting self replicating data storage system with 1's and 0's the size of small molecules...

      In the future, if sequencing becomes extremely fast and cheap, it might make sense to discard sequencing data after analysis and leave DNA in its original format for storage. That said, if the colony of (bacteria/yeast/whatever you are maintaining your library in) that you happen to pick when you grow up a new batch to maintain the cell line happened to pick up a mutation in your gene of interest, you won't know until you sequence it again. I'm a graduate student in a small academic lab and if I want to "access my stored gene data" in the way you suggest, I need to: 1) Grow an overnight culture from my freezer stock of E. coli carrying a plasmid with my gene of interest inserted in it. 2) Isolate the plasmid DNA. 3) Take a reading on a spectrometer to determine DNA concentration. 4) Prepare a sample for sequencing at the concentration the Core Facility prefers. 5) Fill out an order form for sequencing. 6) Walk the sample over to the Core Facility. 7) Wait 1 to 3 days to get my sequence data back. I can pull up the FASTA file I have from the last time I got this gene sequenced in about 15 seconds.

    3. Re:Database Replication by c0lo · · Score: 2

      Bit rot is also a big problem with data.

      Take a whiff from a piece of meat after 2 weeks at room temperature and compare it with how a DVD smells after the same time.
      Complains on bit rot accepted only after the experiment.

      --
      Questions raise, answers kill. Raise questions to stay alive.
    4. Re:Database Replication by chihowa · · Score: 2

      I know you were going for funny, but much of what you will be smelling in your experiment is from bacteria eating the protein and polysaccharides in the meat. The DNA is remarkably stable and even if some of it is fragmented, you have a massively redundant set in your pile of meat.

      We've sequenced DNA from nearly a million years ago and I regularly store DNA dried out and stuck to a piece of paper. DVDs won't last nearly that long before the dyes start to break down. For a long term archival system, we could do much worse than DNA.

      --
      If you want a vision of the future, imagine a youtube comments section scrolling - forever.
  11. Nice amount of intellectual capacity by evolution by Anonymous Coward · · Score: 0

    Pretty amazing amount of intellectual capacity, generated by evolution.

  12. Just use DNA to store the data. by Anonymous Coward · · Score: 0

    Problem solved.

  13. 2000 devices make a lot of data by hawguy · · Score: 2

    It seems a little overly sensationalist to aggregate the devices together when determining the storage size to make such a dramatic 2 mile high tower of DVD's... If you look at them individually, it's not that much data:

    (15 x 10^15 bytes/device) / (2000 devices) / (1 x 10e9 bytes/gb) = 7500GB, or 7.5TB

    That's a stack of 4TB hard drives 2 inches high. Or if you must use DVD's, that's a stack of 1600 DVD's 2 meters high.

    1. Re:2000 devices make a lot of data by Anonymous Coward · · Score: 0

      Let's look at your units.

      bytes / device / devices / (bytes/gb) = gb/devices^2

      Oops! So the math SHOULD be 15e15*2e3/1e9 = 30e9. GB, not Gb, incidentally. That's actually kind of a lot.

    2. Re:2000 devices make a lot of data by hawguy · · Score: 1

      ---
      Let's look at your units.

      bytes / device / devices / (bytes/gb) = gb/devices^2

      Oops!  So the math SHOULD be 15e15*2e3/1e9 = 30e9.  GB, not Gb, incidentally.  That's actually kind of a lot.
      ---

      Thanks for the critique of the typos in my hastily typed out formula, but it would have meant more if you were correct.

      I spent 10 minutes trying to type out a real formula that would pass Slashdot's "junk" filter, but it kept telling me I had too many junk characters, so here's the closest I could get to a real formula that shows you how the units cancel out (and I had to switch over to "code" format, since neither the <code> nor <pre> tags preserved the spaces). And note that the article said 15PB aggregated across all 2000 devices, not 15PB for each)

      15 petabytes    1 x 10^15 Bytes         GigaByte          7500 GigaBytes
      oooooooooooo  X oooooooooooooooooooo X ooooooooooooooo  = ooooooooooooo
      2000 devices       1 PetaByte           1 x 10^9 Bytes     device

      Here's the math: https://duckduckgo.com/?q=15+%2F+2000+*+10%5E15+%2F+10%5E9

  14. LHC by Azure+Flash · · Score: 1

    The LHC generates a shitton of data as well, but from what I've seen (something like this) they use extremely fast integrated circuits to skim the data. Perhaps geneticists could use a similar technique.

    1. Re:LHC by delt0r · · Score: 1

      We don't in fact produce that much data. The reason they have a lot of data they don't know what to do with is because they didn't do any experimental design in the first place. And its cheap to keep intermediate data around you probably don't need to keep anyway.

      --
      If information wants to be free, why does my internet connection cost so much?
  15. Storage Non-Problem - Sequences Compresses to MBs by esten · · Score: 5, Informative

    Storage is not the problem. Computational power is.

    Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB. This means you could store a sequences for everybody in the world in ~132 PB or 0.05% or total worldwide data storage (295 exabytes)

    Now the real challenge is more in having enough computational power to read and process the 3 billion letters genetic sequence and designing effective algorithms to process this data.

    More info on compression of genomic sequences
    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166/

  16. The answer is obvious! by plopez · · Score: 3, Funny

    They should use a NoSQL multi-shard vertically intgrated stack with a RESTfull rails driven in-memory virtual multi-parallel JPython enabled solution.

    Bingo!

    --
    putting the 'B' in LGBTQ+
    1. Re:The answer is obvious! by Tablizer · · Score: 1

      They should use a NoSQL multi-shard vertically intgrated stack with a RESTfull rails driven in-memory virtual multi-parallel JPython enabled solution.

      Brog, that tech stack is like soooo month-ago

    2. Re:The answer is obvious! by Anonymous Coward · · Score: 0

      Is it web-scale?

      http://www.youtube.com/watch?v=b2F-DItXtZs

    3. Re:The answer is obvious! by Samantha+Wright · · Score: 1

      Wait, I've heard this one before.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    4. Re:The answer is obvious! by K.+S.+Kyosuke · · Score: 1

      They should use a NoSQL multi-shard vertically intgrated stack with a RESTfull rails driven in-memory virtual multi-parallel JPython enabled solution.

      Sounds like the technological equivalent of the human body => sounds about right!

      --
      Ezekiel 23:20
    5. Re:The answer is obvious! by wvmarle · · Score: 1

      Come on, that are yesterday's buzzwords! You can do better, I'm sure.

  17. AO-Hell metrics... by geekmux · · Score: 1

    "...At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station."

    And another 10 years after that, the amount of DVDs used will have almost reached the number of AOL CDs sitting in landfills.

    Sorry, couldn't help myself with the use of such an absurd metric. Not like we haven't moved on to other forms of storage the size of a human thumbnail that offer 15x the density of a DVD...

    1. Re:AO-Hell metrics... by Samantha+Wright · · Score: 1

      I think you mean "exciting and hitherto unleveraged microwaveable coaster opportunities."

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
  18. Re:Simple. Get the NSA to do it. by Tablizer · · Score: 2

    You mean ask the NSA how they've already done it.

  19. Re:Nice amount of intellectual capacity by evoluti by hedwards · · Score: 2

    Yes, but that took millions of years to develop the simplest versions.

    It's astonishing that it took humans only a few millenia to get to that point on our own.

  20. I have a solution, molecular storage by Proudrooster · · Score: 1

    If there were only some way to store the information encoded in DNA in a molecular level storage device... oh wait, face palm.

  21. Re:Storage Non-Problem - Sequences Compresses to M by Anonymous Coward · · Score: 0

    Yes, storing the finished results of human genomic sequencing is a solved problem. The interim data, though, is massive. Also, not every sequencing experiment is of good ol' Homo sapiens.

  22. Oddly... I have a clue about this stuff lately by WaywardGeek · · Score: 5, Interesting

    Please... entire DNA genomes are tiny... on the order of 1Gb, with no compression. Taking into account the huge similarities to published genomes, we can compress that by at least 1000X. What they are talking about is the huge amount of data spit out by the sequencing machines in order to determine your genome. Once determined, it's tiny.

    That said, what I need is raw machine data. I'm having to do my own little exome research project. My family has a very rare form of X-linked color blindness that is most likely caused by a single gene defect on our X chromosome. It's no big deal, but now I'm losing central vision, with symptoms most similar to late-onset Starardt's Disease. My UNC ophthalmologist beat the experts at John Hopkins and Jacksonville's hospital, and made the correct call, directly refuting the other doctor's diagnosis of Stargartd's. She though I had something else and that my DNA would prove it. She gave me the opportunity to have my exome sequenced, and she was right.

    So, I've got something pretty horrible, and my ophthalmologist thinks it's most likely related to my unusual form of color blindness. My daughter carries this gene, as does my cousin and one of her sons. Gene research to the rescue?!? Unfortunately... no. There are simply too few people like us. So... being a slashdot sort of geek who refuses to give up, I'm running my own study. Actually, the UNC researchers wanted to work with me... all I'd have to do is bring my extended family from California to Chapel Hill a couple of times over a couple of years and have them see doctors at UNC. There's simply no way I could make that happen.

    Innovative companies to the rescue... This morning, Axeq, a company headquartered in MD, received my families DNA for exome sequencing at their Korean lab. They ran an exome sequencing special in April: $600 per exome, with an order size minimum of six. They have been great to work with, and accepted my order for only four. Bioserve, also in MD, did the DNA extraction from whole blood, and they have been even more helpful. The blood extraction labs were also incredibly helpful, once we found the right places (very emphatically not Labcorp or Quest Diagnostics). The Stanford clinic lab manager was unbelievably helpful, and in LA, the lab director at the San Antonio Hospital Lab went way overboard, So far, I have to give Axeq and Bioserve five stars out of five, and the blood draw labs deserve a six.

    Assuming I get what I'm expecting, I'll get a library of matched genes, and also all the raw machine output data, for four relatives. The output data is what I really need, since our particular mutation is not currently in the gene database. Once I get all the data, I'll need to do a bit of coding to see if I can identify the mutation. Unfortunately, there are several ways that this could be impossible. For example, "copy number variations", or CNVs, if they go on for over a few hundred base pairs, are unable to be detected with current technology. Ah... the life of a geek. This is yet another field I have to get familiar with...

    --
    Celebrate failure, and then learn from it - Nolan Bushnell
    1. Re:Oddly... I have a clue about this stuff lately by Samantha+Wright · · Score: 1

      CNVs actually can be detected if you have enough read depth; it's just that most assemblers are too stupid (or, in computer science terms, "algorithmically beautiful") to account for them. SAMTools can generate a coverage/pileup graph without too much hassle, and it should be obvious where significant differences in copy number occur.

      (Also, the human genome is about 3.1 gigabases, so about 3.1 GB in FASTA format. De novo assembles will tend to be smaller because they can't deal with duplications.)

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    2. Re:Oddly... I have a clue about this stuff lately by B1ackDragon · · Score: 1

      Dang, this is cool. In a post above I mentioned I work as a bioinformatician at a sequencing center--if you need guidance or advice, contact me and I'll see if I can point you in the right direction!

      --
      The snow doesn't give a soft white damn whom it touches. -- ee cummings
    3. Re:Oddly... I have a clue about this stuff lately by amacbride · · Score: 1

      As this sort of thing is my day job, I find this sort of thing really cool, and I'd be happy to help if I can. I'd recommend looking into snpEff, it's pretty straightforward to use, and is available on SourceForge. (Feel free to track me down and message me, I think we overlapped at Cal.)

    4. Re:Oddly... I have a clue about this stuff lately by Anonymous Coward · · Score: 0

      That sucks man. I'd suggest finding a bioinformatics grad student to help you. It's much trickier than you think and you'll go blind before you figure it out.

      Someone who knows what they were doing could do it in weeks, it would take you much longer just to begin figuring it out.

    5. Re:Oddly... I have a clue about this stuff lately by the+biologist · · Score: 1

      I agree, CNVs are really easy to detect if you have the read depth. I've been using the samtools pileup output to show CNVs in my study organism. However, to make the results mean anything to most people, I've got to do a few more steps of processing to get all that data in a nice visual format.

      If you don't have the read depth, you lose the ability to discriminate small CNVs from noise. Large CNVs, such as for whole chromosomes, are readily observed even in datasets with minimal coverage.

    6. Re:Oddly... I have a clue about this stuff lately by mapkinase · · Score: 1

      >I need is raw machine data

      Too bad genome centers disagree with you (I, au contraire, agree with you). We need raw NMR data for structures as well.

      --
      I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
    7. Re:Oddly... I have a clue about this stuff lately by Anonymous Coward · · Score: 0

      The problem is not compressing the sequence data efficiently. The problem is compressing the quality values. (DNA sequencing is not 100% perfect; each base that comes off the machine comes with an error estimation.) The error rates tend to be quite random from base to base, and so are hard to compress. They also make up the bulk of the data (typically 5-8 bytes per base of sequence)

      The quality values are really important for many applications. Without them, you can't tell whether your rare SNP is real, or simply sequencing error...

    8. Re:Oddly... I have a clue about this stuff lately by Anonymous Coward · · Score: 0

      Final assembly might end up being something like 1 GB, but nowadays it's made from short reads that easily account to 500 GB..

    9. Re:Oddly... I have a clue about this stuff lately by WaywardGeek · · Score: 1

      Thanks! I certainly will need some guidance, so if you don't mind, I'll ping you when I get the data. Same thing for the guy below who also offered to help.

      --
      Celebrate failure, and then learn from it - Nolan Bushnell
    10. Re:Oddly... I have a clue about this stuff lately by WaywardGeek · · Score: 1

      Thanks! I will check out snpEff. I certainly will need some help, so if you don't mind, I will contact you when I get the data. Same thing for the guy above who offered to help.

      --
      Celebrate failure, and then learn from it - Nolan Bushnell
    11. Re:Oddly... I have a clue about this stuff lately by WaywardGeek · · Score: 1

      Thanks, guys, for the CNV info. I'm doing only 30-deep sequencing, but I will get 3 exome sequences all probably having the same defect on the X chromosome. Combining the data should give me some reasonable CNV detection ability.

      --
      Celebrate failure, and then learn from it - Nolan Bushnell
    12. Re:Oddly... I have a clue about this stuff lately by Anonymous Coward · · Score: 0

      You might want to contact researchers a the Greenwood Genetic Center, South Carolina (ggc.org). hey specialize in X-linked disorders. Recently they have developed expertise in genome sequencing as well.

      Hope it helps.

    13. Re:Oddly... I have a clue about this stuff lately by heuermh · · Score: 1

      I also do this for my day job, from the side of downstream variant analysis and population genetics, so please feel free to contact me as well.

    14. Re:Oddly... I have a clue about this stuff lately by WaywardGeek · · Score: 1

      Will do! I wasn't expecting so much generosity in reply to my post, but thanks! I'm no dummy, but all I have is a Wikipedia level of knowledge of genetics, so any help I can get will be very much appreciated.

      --
      Celebrate failure, and then learn from it - Nolan Bushnell
    15. Re:Oddly... I have a clue about this stuff lately by Anonymous Coward · · Score: 0

      Please... entire DNA genomes are tiny... on the order of 1Gb, with no compression.

      Some salamander genomes can be 90Gb. I imagine some plant genomes can get pretty large too.

      Most mammalian genomes (that have been sequenced so far) fall in the 1-4Gbp range for an single individual, but the majority of the sequence is derived from repeat elements (around 67% of the human genome can be positively identified as repeat-derived sequence). But certainly, if we're just sequencing more and more human genomes, we can compress the additional data.

      Except when something is published, one needs to hang on to the original data in case someone wishes to replicate the results. So, even though you've taken those sequencing reads, mapped them to the genome, and come up with all the SNPs and CNVs, indels, and rearrangements with respect to the reference sequence, you still need to save those original files of raw reads. You might only ever release the filtered results of your analysis to the world, but you still need to have that data on hand in case someday you might need it.

      To add to that, you never just sequence 1x coverage of a genome. Because of the way genomic DNA libraries are generated, you sequence 10x, 20x, 50x, etc. depending on the depth of coverage needed for your application. So while the genome might be 3Gb, you might have 150Gb of data. But wait, there's more. Sequencers also generate quality scores for each base. This quality score represents how confident the sequencer was that it identified the base correctly. This information is also stored along with the raw reads. So, for every 3Gb of sequence you have, you also have 3Gb of associated quality scores. For that 50x coverage of a genome, you now have 300Gb of data. Granted, there's nothing stopping you from running standard compression algorithms on the data to store it... Oh, and I forgot to mention, you might have a copy of your data (which you've hopefully backed up somewhere), but the sequencing lab likely also has a copy of your data in backup somewhere.

      But yes, as we approach the $1000/genome (to sequence), the question becomes how do we handle the rapid influx and storage of data. There is such a deluge of data that you almost need to be part computer scientist just to have the skillset to deal with it all. So what do we do? Do we change old habits and throw away all the parts but what we need? Do we sequence less (e.g. the parent poster's exome sequencing) and hope that what we're looking for is in the parts of the genome we've deemed 'important' enough to examine? Or do we just hope the computing industry can keep up with our needs?

      (I run a genome sequencer.)

    16. Re:Oddly... I have a clue about this stuff lately by Anonymous Coward · · Score: 0

      Er... DNA sequencer, I meant to say.

    17. Re:Oddly... I have a clue about this stuff lately by WaywardGeek · · Score: 1

      Well... I hope I can get all the read data. Since I'm doing exome sequencing, rather than genome sequencing, shouldn't the raw data be something like 100X less, or around 5Gb per exome?

      --
      Celebrate failure, and then learn from it - Nolan Bushnell
    18. Re:Oddly... I have a clue about this stuff lately by WaywardGeek · · Score: 1

      Yeah... I know you're right. I'm a fast learner, and I jump into all sorts of fields and make waves. One thing I've learned is that being smart gets you only so far. There's no substituted for real-world experience.

      --
      Celebrate failure, and then learn from it - Nolan Bushnell
    19. Re:Oddly... I have a clue about this stuff lately by WaywardGeek · · Score: 1

      Ok, I see how this is a real issue now. If it costs $50 to store a gnome ($100 for a 1T drive, and 500 gigabytes per genome of machine data), and the lab wants a copy as well as the user, and a backup somewhere, that's $150, which is significant when we imagine the entire process dropping to $1,000. The guys drawing blood and extracting DNA need their money, too, which frankly should be $100 to $200. Even shipping isn't cheap. Dry ice all the way to Korea has to cost a ton. My package was 17 lbs! When trying to get the overall cost below $1,000, every dollar counts.

      On the positive side, Moore's Law still applies to data storage, which will get cheaper every year. On the other hand, our genome is not likely to require more storage over time. This might be a short-term problem.

      --
      Celebrate failure, and then learn from it - Nolan Bushnell
    20. Re:Oddly... I have a clue about this stuff lately by Anonymous Coward · · Score: 0

      Target selection really mucks up the depth of coverage and depending on the CNV detection algorithm, small events may be undetectable. For the last chaper of my thesis, I inserted synthetic CNVs into human exome data and if they perturb depths by less than 2x (say 1.5X because hybrid capture may be non-linear), then many events are not detectable. Also, high copy numbers are pretty much impossible to determine. IMO, CNVs are the best reason to go for a genome instead of an exome.

    21. Re:Oddly... I have a clue about this stuff lately by blach · · Score: 1
      If I could add a little bit about CNV -- yes, it can be detected from ExomeSeq and yes you can infer it, to some extent from sequencing depth, given adequate depth. BUT there are a few caveats. First, exomeSeq is typically amplicon based and not all amplicons have uniform amplification. Second, while you could make gross calls (heterozygous deletion, 3x or greater amplification) from Exome data alone, it would be hard to say that one area had 1.6x (for example) amplification without really massive sequencing depth. To make better CNV calls with exome data it is useful to have control DNA (as in the arrayCGH technique which has heretofore been the standard for detection of CNV) sequenced under exact same conditions at exact same time in order to do a better genome-wide* circular binary segmentation procedure.

      * actually you would probably only want to simulate probes at the center of each exon target region in your whole-exome sequencing kit; this should be available from the kit vendor

      Best of luck to you and your family.

  23. Re:Storage Non-Problem - Sequences Compresses to M by Anonymous Coward · · Score: 1

    Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB.

    That's true, but the problem is that as a good scientist you are required by most journals and universities to keep the original sequence data that are coming off these high-through sequencers (aka the fastq files) so that you can show you work so-to-speak if it ever comes into question. These files often contain 30-40x coverage of your 3Gb reference sequence and even compressed are still several GB in size. Additional, because these large-scale sequencing projects are costing millions of dollars, the NIH isn't going to be happy if you lose the data due to drive failure, so you'll need that data duplicated using a RAID setup and offsite backup. So storage is actually a huge problem.

  24. Re:Storage Non-Problem - Sequences Compresses to M by timeOday · · Score: 1

    That's your own germline DNA. But it would be cool to get the distinct sequences of all the cells in your body. Most of those cells (by count, not mass) are various microorganisms, lots in your gut, or infections that are making you sick or wearing down your immune system, or a latent conditions like HIV or HPV, and you would see the evolution of a few strains of precancerous / cancerous cells evolving too. Taken altogether that would be a huge amount of DNA. But I guess a lot of the distinct genomes are localized and you couldn't sample them easily.

  25. To put this into perspective by khchung · · Score: 1

    To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.

    NO. This does not put anything "into perspective", except it meant "a lot of data" for the average Joe.

    To put it into useful perspective, we should compare with large data encountered in other sciences, such as 25PB per year from the LHC. And that's after aggressively discarding collisions that doesn't look promising in the first pass, it would be orders of magnitude bigger otherwise.

    But now just 15PB per year doesn't look that newsworthy, eh?

    --
    Oliver.
    1. Re:To put this into perspective by Samantha+Wright · · Score: 1

      Well, if you really need to have that kind of contest...

      The data files being discussed are text files generated as summaries of the raw sensor data from the sequencing machine. In the case of Illumina systems, the raw data consists of a huge high-resolution image; different colours in the image are interpreted as different nucleotides, and each pixel is interpreted as the location of a short fragment of DNA. (Think embarrassingly parallel multithreading.)

      If we were to keep and store all of this raw data, the storage requirements would probably be a thousand to a million times what they currently are—to say nothing of the other kinds of biological data that's captured on a regular basis, like raw microarray images.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    2. Re:To put this into perspective by khchung · · Score: 1

      Not really trying to turn it into a contest, but just "to put this into perspective". More or less, the point is other science projects have been dealing with similar data volume for a few years already, if there is anything newsworthy about this "DNA Data Deluge", it better be something more than just the data volume.

      --
      Oliver.
    3. Re:To put this into perspective by Samantha+Wright · · Score: 1

      Even within biology this is pretty stale news. I'm pretty sure this story is technically a shill piece for the products mentioned: Hadoop and Amazon ECC.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
  26. Yet more perspective by Anonymous Coward · · Score: 0

    Or, to put it in even better perspective, if you encoded each bit in a Rubik’s cube and stacked them end to end, it would stretch for half a light year. With the data increasing 5x every year, in less than 2 years the stack will reach to Alpha Centauri and back. In under 8 years, it will be the width of the galaxy. In 16 years, it will be as wide as the entire universe.

    Or perhaps a more reasonable perspective is to realize that the entire genetic data collected each year by all sequencers in all labs and hospitals in the world, if stored on SATA disks, would fit in a Subaru Forrester.

  27. Re:Storage Non-Problem - Sequences Compresses to M by B1ackDragon · · Score: 3, Informative

    This is very much the case. I work as a bioinformatician at a sequencing center, and I would say we see around 50-100G of sequence data for the average run/experiment, which isn't really so bad, certainly not compared to the high energy physics crowd and given a decent network. The trick is what we want to do with the data: some of the processes are embarrassingly parallel, but many algorithms don't lend themselves to that sort of thing. We have a few 1TB ram machines, and even those are limiting in some cases. Many of the problems are NP-hard, and even the for the heuristics we'd ideally use superlinear algorithms, but we can't have that either, it's near linear time (and memory) or bust which sucks.

    I'm actually really looking forward to a vast reduction in dataset size and cost in the life sciences, so we can make use of and design better algorithmic methods and get back to answering questions. That's up to the engineers designing the sequencing machines though..

    --
    The snow doesn't give a soft white damn whom it touches. -- ee cummings
  28. a straightforward solution by mtrachtenberg · · Score: 1

    "At this rate, within the next five years the stack of DVDs could reach higher than the orbit of the International Space Station."

    Use more than one stack. You're welcome.

  29. Re:Storage Non-Problem - Sequences Compresses to M by Anonymous Coward · · Score: 3, Interesting

    A single finished genome is not the problem. It is the raw data.

    The problem is that any time you sequence a new individual's genome for a species that already has a genome assembly, you need minimum 5x coverage across the genome to reliably find variation. Because of variation in coverage, that means you may have to shoot for >20x coverage to find all the variation. The problem is more complex when you are trying to de novo assemble a genome for a species that does NOT have a genome assembly. In this case, you often have to aim for at least 40x coverage (and in the 100x range may be better).

    To get the data, we use next-gen sequencing. To give you an idea of the data output, a single Illumina HiSeq 2000 run produces 3 billion reads. Each "read" is a pair of genomic fragments 100 bases long. That means 600,000,000,000 bases are produced in a single run. The run is stored as a .fastq file, meaning that each base is stored as an ASCII character, and has an associated quality score stored as another ASCII character. So that's 1.2 trillion ASCII characters for a single run, or about 1.09 terabytes uncompressed. This does not include the storage for the (uncompressable) images taken by the sequencing machine in order to call the bases. They can be an order of magnitude larger. A single experiment may involve dozens of such runs.

    There is an expectation that these runs will be made available in a public repository when an analysis is published. That puts great stress on places like NIH, where 1.7 quadrillion raw bases have been uploaded in about the last four years:
    http://www.ncbi.nlm.nih.gov/Traces/sra/

    You are correct when you say that computational power is a bigger problem, but again, this is not related to the three billion bases of the genome, which is trivial in size. Once again, the problem is the raw data. When assembling a new species' genome from scratch, you somehow have to reassemble those 3 billion pairs of 100-base reads. The way that is done is by hashing every single read into pieces about 21 nucleotides long, then storing them all, creating a de Bruijin graph, and navigating through it. The amount of RAM required for this is absolutely insane.

  30. Re:Storage Non-Problem - Sequences Compresses to M by Anonymous Coward · · Score: 0

    Actually the larger problem at the moment is not so much computational power, its the extremely large amounts of memory (TB range) to assemble the genomes. 1000 core computers are fairly common, ones with 2 TB of memory which is available for weeks at a time are not.

    Oddly enough this won't be a problem as newer machines make longer reads which can be assembled with less memory.

  31. Moores Law? by Anonymous Coward · · Score: 0

    I never knew Moores law (which applies to transistor count) now applies to the cost of DNA sequencing....

    Here's hoping Moores law applies to the housing market soon too!

  32. 600 bytes per person by bob_jenkins · · Score: 1

    If you have the genomes of your parents, and your own genome, yours is about 70 new spot mutations, about 60 crossovers, and you have to specify who your parents were. About 600 bytes of new information per person. You could store the genomes of the entire human race on a couple terabytes if you knew the family trees well enough. I tried to nail down the statistics for that in http://burtleburtle.net/bob/future/geninfo.html .

  33. Heroic acts along with create the globe by rs3gold · · Score: 1

    Within your character's lifestyle as being a main character, you can enterprise in a few quests to perform heroic acts along with create the globe a greater spot for a are in.

  34. Yay, AdEnine & 1 click splicing by charlesjo488 · · Score: 4, Funny

    Scientists who viewed this sequence also viewed these sequences...

    1. Re:Yay, AdEnine & 1 click splicing by blach · · Score: 1

      [ LIKE ] Be the first of your friends to like this gene!

  35. Good thing there are new algorithms by Anonymous Coward · · Score: 0

    As an example, I used to build phylogenetic trees with an algorithm called RAxML. Along comes something called FastTree, which is just or nearly as accurate, but 1-2 orders of magnitude faster..

    1. Re:Good thing there are new algorithms by K.+S.+Kyosuke · · Score: 1

      That's probably because it doesn't have "XML" in its name!

      --
      Ezekiel 23:20
  36. doubling doubling by mlush · · Score: 1

    The amount of biological data doubles every 9 months, processing power doubles every 18 months we have already reached the crossover point.

  37. Actually, it's only 15 discs per year by Anonymous Coward · · Score: 0

    15 discs is enough, according to the article on storing 1 petabyte on a single DVD we had just one week ago. Problem solved!

  38. I'll send my invoice later by govett · · Score: 1

    Why not have a reference genome? For everyone else, simply store deviations from the reference. Seems a possibility.

  39. Re:Storage Non-Problem - Sequences Compresses to M by Kjella · · Score: 1

    Each genetic sequence is ~3GB but since sequences between individuals are very similar it is possible to compress them by only recording the differences from a reference sequences making each genome ~20 MB. This means you could store a sequences for everybody in the world in ~132 PB or 0.05% or total worldwide data storage (295 exabytes)

    For a single delta to a reference, but there's probably lots of redundancy in the deltas. If you have a tree/set of variations (Base human + "typical" Asian + "typical" Japanese + "typical" Okinawa + encoding the diff) you can probably bring the world estimate down by a few orders of magnitude, depending on how much is systematic and how much is unique to the individual.

    --
    Live today, because you never know what tomorrow brings
  40. Compression by Reference by sowalsky · · Score: 1

    The sequence read archives (such as the one hosted by NCBI) as a repository for this sequencing data, uses "compression by reference," a highly-efficient way to compress and store a lot of the data. The raw data that comes off these sequencers is often >99% homologous to the reference genome (such as human, etc), so the most efficient way to compress and store this data is only to record what is different between the sequence output and the reference genome.

  41. Re:Storage Non-Problem - Sequences Compresses to M by mapkinase · · Score: 1

    Does it say something about handling this way also internal repeats?

    --
    I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
  42. We need to find new approaches ... by Ihlosi · · Score: 1
    ... to store all this data.

    I suggest storing it molecular form by pairs of four different bases (guanine, thymine, adenine, cytosine) combined in an aesthetically pleasing, double helical molecule!

    1. Re:We need to find new approaches ... by byrtolet · · Score: 1

      If it is easy to sequence - dna is easy to replicate, this is indeed the best storage!

  43. Miles of DVDs? by sanman2 · · Score: 2

    Can't we have more meaningful units?

    How many Libraries of Congress is that?

    1. Re:Miles of DVDs? by sarysa · · Score: 1

      IEEE wrote that? Watch out...if Jesse Ventura runs for president, the prophecy may be fulfilled...

      --
      Charisma is the measure of someone's ability to lie with a straight face.
  44. Perspective? by jbmartin6 · · Score: 1

    To put this into perspective, if you were to write this data onto standard DVDs, the resulting stack would be more than 2 miles tall.

    This doesn't put it into perspective at all. What is a DVD and why would I put data on it?

    --
    This posting is provided 'AS IS' without warranty of any kind, implied or otherwise.
    1. Re:Perspective? by N0Man74 · · Score: 1

      For God's sake! Give it to me in a useful unit that is normally provided by journalists... Give it to me in Library of Congresses!

  45. Re:Simple. Get the NSA to do it. by DoctorBonzo · · Score: 1

    Whoever modded this "funny" isn't paranoid enough.

  46. Hmmm, shitton... by DoctorBonzo · · Score: 1

    pronounced shi-TAWN ?

  47. Why Bother by morgauxo · · Score: 1

    It's not like they can use the data, it all is or soon will be patented! Even the patent holders are SOL because anything their bit of patented gene interacts with is patented by someone else. What a lovely system we have!

  48. In the future... by stkpogo · · Score: 1

    In the future, DVD's will be made much thinner and won't stack up as high.

  49. Store it in a double helix by Anonymous Coward · · Score: 0

    I believe that is a very efficient storage mechanism

  50. We have the means! by juliuszs · · Score: 1

    NSA could get to do something nice. They already know how to deal with mountains of useless data. Scanning for cancer and potential terrorists? Mutations of genes and behaviour? Possibilities are endless.