Slashdot Mirror


Sequencing a Human Genome In a Week

blackbearnh writes "The Human Genome Project took 13 years to sequence a single human's genetic information in full. At Washington University's Genome Center, they can now do one in a week. But when you're generating that much data, just keeping track of it can become a major challenge. David Dooling is in charge of managing the massive output of the Center's herd of gene sequencing machines, and making it available to researchers inside the Center and around the world. He'll be talking about his work at OSCON, and gave O'Reilly Radar a sense of where the state of the art in genome sequencing is heading. 'Now we can run these instruments. We can generate a lot of data. We can align it to the human reference. We can detect the variance. We can determine which variance exists in one genome versus another genome. Those variances that are cancerous, specific to the cancer genome, we can annotate those and say these are in genes. ... Now the difficulty is following up on all of those and figuring out what they mean for the cancer. ... We know that they exist in the cancer genome, but which ones are drivers and which ones are passengers? ... [F]inding which ones are actually causative is becoming more and more the challenge now.'"

21 of 101 comments (clear)

  1. DNA GATC by sakdoctor · · Score: 4, Funny

    Functions that don't do anything, no comments, worst piece of code ever!

    I say we fork and refactor the entire project.

    1. Re:DNA GATC by RDW · · Score: 4, Interesting

      'I say we fork and refactor the entire project.'

      You mean like this?:

      http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=16729053

    2. Re:DNA GATC by K.+S.+Kyosuke · · Score: 2, Funny

      You thought God can't spell "job security"? Mind you, he's omnipotent!

      --
      Ezekiel 23:20
  2. Here's what I want to know... by HotNeedleOfInquiry · · Score: 2, Insightful

    Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?

    --
    "Eve of Destruction", it's not just for old hippies anymore...
    1. Re:Here's what I want to know... by QuantumG · · Score: 3, Informative

      Typically they sequence every base at least 30 times.

      --
      How we know is more important than what we know.
    2. Re:Here's what I want to know... by blackbearnh · · Score: 3, Informative

      I wondered the same thing, so I asked. From the article: And between two cells, one cell right next to the other, they should be identical copies of each other. But sometimes mistakes are made in the process of copying the DNA. And so some differences may exist. However, we're not at present currently sequencing single cells. We'll collect a host of cells and isolate the DNA from a host of cells. So what you end up is with when you read the sequence out on these things is, essentially, an average of this DNA sequence. Well, I mean it's digital in that eventually you get down to a single piece of DNA. But once you align these things back, if you see 30 reads that all align to the same region of the genome and only one of them has an A at the position and all of the others have a T at that position, you can't say whether that A was actually some small change between one cell and its 99 closest neighbors or whether that was just an error in the sequencing. So it's hard to say cell-to-cell how much difference there is. But, of course, that difference does exist, otherwise that's mutation and that's what eventually leads to cancer and other diseases.

    3. Re:Here's what I want to know... by K.+S.+Kyosuke · · Score: 2, Interesting

      "Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?"

      Not necessarily. ;-)

      --
      Ezekiel 23:20
  3. How about storing it in analog format? by Anonymous Coward · · Score: 5, Funny

    Just store all that data as a chemical compound. Maybe a nucleic acid of some kind? Using two long polymers made of sugars and phosphates? I bet the whole thing could be squeezed into something smaller than the head of a pin!

  4. Money well spent by momerath2003 · · Score: 4, Insightful

    We pissed away $3 billion dollars and 13 years of time, when we could have waited a few more years and got it done in a week, and much, much cheaper. What a waste of time and money that was....

    I know I'm being trolled, but you're an idiot. It's pretty obvious that the ability to sequence the genome in a week could only result from techniques developed and information gathered in the original Human Genome project.

    --
    I had but a simple dream, to destroy all humans.
  5. Re:Moore's law at work? by blackbearnh · · Score: 3, Interesting

    It wasn't the computing power that was the holdup, it was the sequencing throughput. Also, as noted in the article, they can do it in a week now partially because they have the completed human genome to use as a template to match things up against. As I analogized in the interview, it's like the difference between putting together a jigsaw puzzle with the cover image available, and doing one without.

  6. Re:We pissed away $3 billion dollars by QuantumG · · Score: 4, Insightful

    What's funny is that there is actually people who think like that. Apparently if we just sit around and wait, things will get better. I call this the dark side of the "invisible hand" of the market.. because it is invisible, people forget how it comes about. In order to get improvement in technology you need a market for that technology. And, typically, you need some loss-leader to create the market in the first place. Government funding serves this purpose well.

    --
    How we know is more important than what we know.
  7. Data analysis a rapidly growing problem in Biology by SlashBugs · · Score: 5, Informative

    Data handling and analysis is becoming a big problem for biologists generally. Techniques like microarray (or exon array) analysis can tell you how strongly a set of genes (tens of thousands, with hundreds of thousands of splice variants) are being expressed under given conditions. But actually handling this data is a nightmare, especially as a lot of biologists ended up there because they love science but aren't great at maths. Given a list of thousands of genes, teasing out the statistically significantly different genes from the noise is only the first step. Then you have to decide what's biologically important (e.g. what's the prime mover and what's just a side-effect), and then you have a list of genes which might have known functions but more likely have just a name or even a tag like "hypothetical ORF #3261", for genes that are predicted by analysis of the genome but have never been proved to actually be expressed. After this, there's the further complication that these techniques only tell you what's going on at the DNA or RNA level. The vast majority of genes only have effects when translated into protein and, perhaps, further modified, meaning that you cant's be sure that the levels you're detecting by the sequencing (DNA level) or expression analysis chips (RNA level) actually reflects what's going on in the cell.

    One of the big problems studying expression patterns in cancer specifically is the paucity of samples. The genetic differences between individuals (and tissues within individuals) means there's a lot of noise underlying the "signal" of the putative cancer signatures. This is especially true because there are usually several genetic pathways that a given tissue can take to becoming cancerous: you might only need mutations in a small subset of a long list of genes, which is difficult to spot by sheer data mining. While cancer is very common, each type of cancer is much less so; therefore the paucity of available samples of a given cancer type in a given stage makes reaching statistical significance very difficult. There are some huge projects underway at the moment to collate all cancer labs' samples for meta-analysis, dramatically increasing the statistical power of the studies. A good example of this is the Pancreas Expression Database, which some pacreatic cancer researchers are getting very excited about.

  8. Buttload of data by virgil+Lante · · Score: 2, Interesting

    Illumina's Solexa sequencing produces around 7 TB of data per genome sequencing. Its a feat just to move the data around, let alone analyze it. Its amazing how far sequencing technology has come, but how little our knowledge of biology as a whole has advanced. 'The Cancer Genome' does not exist. No tumor is the same and in cancer, especially solid tumors, no two cells are the same. Sequencing a gamish of cells from a tumor only gives you the average which may or may not give any pertinent information about the tumor. Vogelstein's group has shown this quite convincingly but hardly anyone truly looks at what the data really says.

  9. DNA is digital by EndoplasmicRidiculus · · Score: 2, Informative

    Four bases and not much in between.

  10. Humans have ~810.6 MiB of DNA by izomiac · · Score: 2, Interesting

    The human genome is approximately 3.4 billion base pairs long. There are four bases, so this would correspond to 2 bits of information per base. 2 * 3,400,000,000 /8 /1024 /1024 = 810.6 MiB of data per sequence. That doesn't seem like it'd be too difficult. With a little compression it'd fit on a CD. Now, I suppose each section is sequenced multiple times and you'd want some parity, but it still seems like something that'd easily fit on a DVD (especially if alternate sequences are all diff'd from the first). Perhaps throw in another disc for pre-computed analysis results and that ought to be it.

    So, what's going on here? Are the file formats used to store this data *that* bloated? Or are they trying to include structural information beyond sequence? What am I missing that makes this an unwieldy amount of data?

    (I have to laugh at how Vista is apparently 20 times more complex than the people that use it...)

    1. Re:Humans have ~810.6 MiB of DNA by izomiac · · Score: 2, Interesting

      Interesting, I was assuming that it was more of the former method since I hadn't studied the latter. Correct me if I'm wrong, but as I remember it that method involves supplying only one type of fluorescently labeled nucleotide at a time during in vitro DNA replication and measuring the intensity of flashes as nucleotides are added (e.g. brighter flash means two bases were added, even brighter if it's three, etc.). Keeping track of four sensors at 200 bytes per base would imply sensors that could detect 133 levels of brightness or 8 measurements per base at 16 levels of brightness. That seems like a lot higher resolution than the example data sheets I've seen, but maybe that's what current technology can do. Still though, most bases are fairly unambiguous so the bulk of the sequence could likely be stored as results only.

      The new method sounds like they're doing a microarray or something and just storing high resolution jpegs. I could see why that would require oodles of image processing power. It does seem like an odd storage format for what's essentially linear data.

      I suppose my point is more that they're storing a lot of useless information. I could see storing a ton of info about a sequence back when graduate students were adding nucleotides and interpreting graphs by hand, but in this day and age you'd just redundantly sequence until you got to the desired accuracy. I couldn't imagine that it'd be cheaper to have technicians manually tweak the entire sequence.

      BTW, I'm not arguing against you, more against some of the design decisions of automated sequencers. You clearly know a lot more about the subject than my undergrad degree allows me to even think about refuting.

  11. Re:Passing this data back to the scientist by goombah99 · · Score: 2, Interesting

    a whole human genome will fit on a CD.

    if you just transmit the diffs from the generic human you could put it in an e-mail

    --
    Some drink at the fountain of knowledge. Others just gargle.
  12. Re:Passing this data back to the scientist by goombah99 · · Score: 3, Insightful

    I suppose it's worth noting that the intermediate (raw) data sets can get pretty large. they are actually getting larger as the trend goes towards shorter less informative "reads" that require more of them to recover the connective information and to recover from errors and duplications. However that's a tend that has a stopping point. While more reads is better at some point there is almost no added value from more reads. So at that point that's the maximum amount of data you need to collect. it's won't increase ever. meanwhile hard drive and network speeds will go up factors of ten.

    thus the storage issues here are well tolerated at present and soon will become trivial.

    --
    Some drink at the fountain of knowledge. Others just gargle.
  13. I also manage a Next-gen Sequencing Machine by Anonymous Coward · · Score: 3, Interesting

    Next gen sequencing eats up huge amounts of space. Every run on our Illumina Genome Analyzer II machine takes up 4 terabytes of intermediate data, most of which comes from the something like 100,000+ 20 Mb bitmap picture files taken from the flowcells. All that much data is an ass load of work to process. Just today I got a little lazy with my Perl programming and let the program go unsupervised...and it ate up 32 gb of ram and froze up the server. Took redhat 3 full hours to decide it had enough of the swapping and kill the process.

    For people not familiar with current generation sequencing machines, they can scan between 30-80 bp reads and use alignment programs to match up the reads to species databases. The reaction/imaging takes 2 days, prep takes about a week, processing images takes another 2 days, alignment takes about 4. The Illumina machine achieves higher throughput than the ABI ones but gives shorter reads; we get about 4 billion nt per run if we do everything right. Keep in mind though, that 4 billion that they mention in the summary is misleading: the read cover distribution is not uniform (ie you do not cover every nucleotide of the human's 3 billion nt genome). To ensure 95%+ coverage, you'd have to use 20-40 runs on the Illumina machine...in other words, about 6-10 months of non-stop work to get a reasonable degree of coverage over the entire human genome (at which point you can use programs to "assemble" the reads in a contiguous genome). WashU is very wealthy so they have quite a few of these machines available to work at any given time.

    the main problem these days is that processing all that much data requires a huge amount of computer knowhow (writing software, algorithms, installing software, using other people's poorly documented programs), and a good understanding of statistics and algorithms, especially when it comes to efficiency. Another problem they never mention are artifacts from the chemical protocol; just the other day we found a very unusual anomaly that indicated the first 1/3 of all our reads was absolutely crap (usually only the last few bases are unreliable); turned out our slight modification of the Illumina protocol to tailor it to studying epigenomic effects had quite large effects of the sequencing reactions later on. Even for good reads, a lot of the bases can be suspect so you have to do a huge amount of averaging, filtering, and statistical analysis to make sure your results/graphs are accurate.

  14. Re:Moore's law at work? by cbailster · · Score: 2, Informative

    Fingerprinting doesn't rely on DNA sequencing, but does rely on the DNA sequence being different between people. Everyone's DNA contains subtle differences (particularly in the non-coding DNA regions). These differences can be exploited by various laboratory techniques to produce small pieces of DNA which will be of different sizes because of these differences. When these fragments of DNA are run down a suitable gel (usually agarose, a substance derived from seaweed) under an electric current the fragments will separate by size. The pattern of fragments formed will be unique for each individual.

    Several fingerprinting techniques rely on what most programmers would best recognise as regular expression matching. For example there are enzymes in biology which will recognise certain DNA sequences but not others, and will cut the DNA in two where ever this sequence is matched. (in perl:

    my @dna_fragments = split /GAATTC/, $my_dna;

    is the equivalent of what an enzyme called EcoRI does). Not everyone will have the same numbers of this sequence in their DNA, and nor will they be in the same place, thus the number and size of fragments will differ. By using a suitable range of such enzymes you can generate a pattern of DNA fragments which is sufficiently unique as to identify a single person amongst a population of several billion.

    for more information take a look at DNA Profiling on wikipedia

  15. wow. read a book. by CFD339 · · Score: 2, Insightful

    First, kinds of cancers were known to exist a century ago. Tumors and growths were not unheard of. Most childhood cancers killed quickly and were undiagnosed as specific disease other than "wasting away". When the average lifespan was 30-40 years, a great many other cancers were not present because people didn't live long enough to die from them.

    As we cure "other" diseases, cancers become more likely causes of death. Cells fail to divide perfectly, some may go cancerous others simply don't produce as healthy a replacement specialized cell. Your arteries harden, muscles don't repair as well, other tissues don't work as well (you get weaker, more wrinkled, easier to fall ill). Eventually either something fails that can't be repaired or enough cells go cancerous. Until we either figure out how to replace the body (seems unlikely as the brain and body are more tied together than sf movies like to present) or we figure out how to make cells repair/refresh themselves without shortening their telomeres -- I have no idea how likely that actually is.

    --
    The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln