Sequencing a Human Genome In a Week
blackbearnh writes "The Human Genome Project took 13 years to sequence a single human's genetic information in full. At Washington University's Genome Center, they can now do one in a week. But when you're generating that much data, just keeping track of it can become a major challenge. David Dooling is in charge of managing the massive output of the Center's herd of gene sequencing machines, and making it available to researchers inside the Center and around the world. He'll be talking about his work at OSCON, and gave O'Reilly Radar a sense of where the state of the art in genome sequencing is heading. 'Now we can run these instruments. We can generate a lot of data. We can align it to the human reference. We can detect the variance. We can determine which variance exists in one genome versus another genome. Those variances that are cancerous, specific to the cancer genome, we can annotate those and say these are in genes. ... Now the difficulty is following up on all of those and figuring out what they mean for the cancer. ... We know that they exist in the cancer genome, but which ones are drivers and which ones are passengers? ... [F]inding which ones are actually causative is becoming more and more the challenge now.'"
has reverted to sneakernet. They literally bring you an external hard drive.
Also, Illumina will sequence your genome for $48,000.
Functions that don't do anything, no comments, worst piece of code ever!
I say we fork and refactor the entire project.
Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?
"Eve of Destruction", it's not just for old hippies anymore...
Just store all that data as a chemical compound. Maybe a nucleic acid of some kind? Using two long polymers made of sugars and phosphates? I bet the whole thing could be squeezed into something smaller than the head of a pin!
and 13 years of time, when we could have waited a few more years and got it done in a week, and much, much cheaper. What a waste of time and money that was....
13 years of doubling computer speeds every 18 months would bring this to about 18 day by the time they were done with the first sequence.
So while a staggering improvement, not surprising from a CPU processing standpoint. There could be many other factors involved though.
We pissed away $3 billion dollars and 13 years of time, when we could have waited a few more years and got it done in a week, and much, much cheaper. What a waste of time and money that was....
I know I'm being trolled, but you're an idiot. It's pretty obvious that the ability to sequence the genome in a week could only result from techniques developed and information gathered in the original Human Genome project.
I had but a simple dream, to destroy all humans.
Data handling and analysis is becoming a big problem for biologists generally. Techniques like microarray (or exon array) analysis can tell you how strongly a set of genes (tens of thousands, with hundreds of thousands of splice variants) are being expressed under given conditions. But actually handling this data is a nightmare, especially as a lot of biologists ended up there because they love science but aren't great at maths. Given a list of thousands of genes, teasing out the statistically significantly different genes from the noise is only the first step. Then you have to decide what's biologically important (e.g. what's the prime mover and what's just a side-effect), and then you have a list of genes which might have known functions but more likely have just a name or even a tag like "hypothetical ORF #3261", for genes that are predicted by analysis of the genome but have never been proved to actually be expressed. After this, there's the further complication that these techniques only tell you what's going on at the DNA or RNA level. The vast majority of genes only have effects when translated into protein and, perhaps, further modified, meaning that you cant's be sure that the levels you're detecting by the sequencing (DNA level) or expression analysis chips (RNA level) actually reflects what's going on in the cell.
One of the big problems studying expression patterns in cancer specifically is the paucity of samples. The genetic differences between individuals (and tissues within individuals) means there's a lot of noise underlying the "signal" of the putative cancer signatures. This is especially true because there are usually several genetic pathways that a given tissue can take to becoming cancerous: you might only need mutations in a small subset of a long list of genes, which is difficult to spot by sheer data mining. While cancer is very common, each type of cancer is much less so; therefore the paucity of available samples of a given cancer type in a given stage makes reaching statistical significance very difficult. There are some huge projects underway at the moment to collate all cancer labs' samples for meta-analysis, dramatically increasing the statistical power of the studies. A good example of this is the Pancreas Expression Database, which some pacreatic cancer researchers are getting very excited about.
Illumina's Solexa sequencing produces around 7 TB of data per genome sequencing. Its a feat just to move the data around, let alone analyze it. Its amazing how far sequencing technology has come, but how little our knowledge of biology as a whole has advanced. 'The Cancer Genome' does not exist. No tumor is the same and in cancer, especially solid tumors, no two cells are the same. Sequencing a gamish of cells from a tumor only gives you the average which may or may not give any pertinent information about the tumor. Vogelstein's group has shown this quite convincingly but hardly anyone truly looks at what the data really says.
That is not all.
Functions that don't do anything, no comments, worst piece of code ever!
Most of it doesn't code proteins or any of the other things that have been reverse-engineered so far. How do you know it's NOT comments?
(And if terrestrial life was engineered and it IS comments, do they qualify as "holy writ"?)
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
...we've now learned that the epigenome controls which parts the genome manifest themselves.
http://en.wikipedia.org/wiki/Epigenetics
You have to be very careful about what findings at different levels actually mean, and how the various levels correlate.
For example when looking at duplications/expansions in cancer, an expansion of a locus results in about a 50% correlation between DNA level change and expression level chane. Protein and gene expression levels correlate 50 to 60% of the time (or less depending on who's data you look at). So therefore, being gracious and assuming a 60% correlation at the two levels you are already below a 40% correlation. Add in post translation modifications, sub-cellular localization and the requirement for other players within a functional pathway to exhibit a specific behavior and what you have is a tangled mess that you can spin almost any story about a favorite gene. But does it have meaning for diagnosis and treatment? I'd definitely hedge my bet.
Its that big of a mess and that isn't even considering the vast population heterogeneity of each tumor.
A good example of this is the Pancreas Expression Database, which some pacreatic cancer researchers are getting very excited about.
Kim Jong-il will be ecstatic to hear that. Dear Leader can't very well put the Grim Reaper into political prison....
Four bases and not much in between.
The human genome is approximately 3.4 billion base pairs long. There are four bases, so this would correspond to 2 bits of information per base. 2 * 3,400,000,000 /8 /1024 /1024 = 810.6 MiB of data per sequence. That doesn't seem like it'd be too difficult. With a little compression it'd fit on a CD. Now, I suppose each section is sequenced multiple times and you'd want some parity, but it still seems like something that'd easily fit on a DVD (especially if alternate sequences are all diff'd from the first). Perhaps throw in another disc for pre-computed analysis results and that ought to be it.
So, what's going on here? Are the file formats used to store this data *that* bloated? Or are they trying to include structural information beyond sequence? What am I missing that makes this an unwieldy amount of data?
(I have to laugh at how Vista is apparently 20 times more complex than the people that use it...)
The vast majority of genes only have effects when translated into protein
That depends on your definition. If you define a gene as "stretch of DNA that is translated into protein," which until fairly recently was the going definition, then of course your statement is tautologically true (replacing "the vast majority of" with "all.") But if you define it as "a stretch of DNA that does something biologically interesting," then it's no longer at all clear. Given the number of regulatory elements not directly associated with genes, sections of DNA that code for RNAzymes, etc., it may well be that the majority of "genes" are not protein-coding at all. Going back to the Mendelian definition of a gene as a unit of inheritance, this looks more and more likely.
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
Next gen sequencing eats up huge amounts of space. Every run on our Illumina Genome Analyzer II machine takes up 4 terabytes of intermediate data, most of which comes from the something like 100,000+ 20 Mb bitmap picture files taken from the flowcells. All that much data is an ass load of work to process. Just today I got a little lazy with my Perl programming and let the program go unsupervised...and it ate up 32 gb of ram and froze up the server. Took redhat 3 full hours to decide it had enough of the swapping and kill the process.
For people not familiar with current generation sequencing machines, they can scan between 30-80 bp reads and use alignment programs to match up the reads to species databases. The reaction/imaging takes 2 days, prep takes about a week, processing images takes another 2 days, alignment takes about 4. The Illumina machine achieves higher throughput than the ABI ones but gives shorter reads; we get about 4 billion nt per run if we do everything right. Keep in mind though, that 4 billion that they mention in the summary is misleading: the read cover distribution is not uniform (ie you do not cover every nucleotide of the human's 3 billion nt genome). To ensure 95%+ coverage, you'd have to use 20-40 runs on the Illumina machine...in other words, about 6-10 months of non-stop work to get a reasonable degree of coverage over the entire human genome (at which point you can use programs to "assemble" the reads in a contiguous genome). WashU is very wealthy so they have quite a few of these machines available to work at any given time.
the main problem these days is that processing all that much data requires a huge amount of computer knowhow (writing software, algorithms, installing software, using other people's poorly documented programs), and a good understanding of statistics and algorithms, especially when it comes to efficiency. Another problem they never mention are artifacts from the chemical protocol; just the other day we found a very unusual anomaly that indicated the first 1/3 of all our reads was absolutely crap (usually only the last few bases are unreliable); turned out our slight modification of the Illumina protocol to tailor it to studying epigenomic effects had quite large effects of the sequencing reactions later on. Even for good reads, a lot of the bases can be suspect so you have to do a huge amount of averaging, filtering, and statistical analysis to make sure your results/graphs are accurate.
Finding which ones are actually causative is becoming more and more the challenge now
Ah, the simple magic bullet solution. Better tell Mother Nature that there will be bullets with her name on them.
Y'know what I think would be interesting? If, when this becomes possible, if the gov't offered it so anyone can go and drop off a cheek-swab, be handed an unrecorded number on a slip of paper, and can check back or log in and get their results. No ID required, no idea who you are, just a swab and a number. That would, it would seem to me, avoid all that pesky gov't knowing your genome thing.
Thoughts?
I suppose it would be useful for studies to have some data on the person, age, sex, race perhaps, so broad studies could be done, but yea, why can't they just do a swab-number-off-you-go kind of thing?
Well, how about pollution, processed food, and all that trash being the main reason we get cancer?
Cancer was not even a known disease, a century ago, because nobody had it. (And if people get cancer now, way before the average age of death a century ago, then it can't be that it is because we now get older.)
But I guess there is no money in that. Right?
Any sufficiently advanced intelligence is indistinguishable from stupidity.
...used against me for anything without violating the DMCA. The act of decoding it by some forensics lab paternity test or future insurance company medical cost profile would become unlawful and I'm sure the RIAA would help me with the cost of prosecuting the lawsuit.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
Check the vending machines !
Squirrel!
First, kinds of cancers were known to exist a century ago. Tumors and growths were not unheard of. Most childhood cancers killed quickly and were undiagnosed as specific disease other than "wasting away". When the average lifespan was 30-40 years, a great many other cancers were not present because people didn't live long enough to die from them.
As we cure "other" diseases, cancers become more likely causes of death. Cells fail to divide perfectly, some may go cancerous others simply don't produce as healthy a replacement specialized cell. Your arteries harden, muscles don't repair as well, other tissues don't work as well (you get weaker, more wrinkled, easier to fall ill). Eventually either something fails that can't be repaired or enough cells go cancerous. Until we either figure out how to replace the body (seems unlikely as the brain and body are more tied together than sf movies like to present) or we figure out how to make cells repair/refresh themselves without shortening their telomeres -- I have no idea how likely that actually is.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
Very good points. I think I've been using a very sloppy definition of gene, just a vague idea that it's only DNA>RNA>protein>action or DNA>RNA>action. I've never really got deeply into thinking about regulatory elements, etc. It's compounded by the fact that, while I'm interested in cancer, most of my actual work is with a DNA-based virus that only produces a very few non-translated RNAs that we're aware of. I have a tough time convincing some people that even those are biologically relevant.
I sometimes think that RNAs and various epigenetic factors (I'm including DNA secondary and tertiary structures here) fall into the same trap as a lot of post-translational protein modifications: They're hard to study so not much is written or understood about them, so most non-specialists basically ignore them and decide they can't be too important. It's changing now as techniques evolve to do the experiments, but I'm still shocked how often I see someone basically say "well we don't understand this so we'll assume it's not affecting our system".
The human genome project cost a lot of money because the technology was being developed at the same time and the genome was unknown. What WASHU is doing (and clearly stated in the article) resequencing human genomes. They are sequencing from a genome- collecting data from a given genome- but the genome is produced by aligning short reads against the human genome reference, not creating another reference and comparing it. As of today- a single new human genome sequenced to the accuracy and completeness of the original would still cost 60-80 million dollars. You could potentially cut the cost by 30% using 454, but also sacrifice some serious accuracy. Oh, and the cost is so low because we already know how to do it and the investment in the technological infrastructure:) Also, if you started from scratch it would also cost about 30M in equipment costs to accomplish this task in 1 year (okay so its probably not actually doable in a year, so say 15 M for equipment to complete the sequencing in 2 years) which brings the cost to around 100M for an equivalent denovo human genome sequence. The original project doesn't seem like such a bad deal does it?
See, i think we need articles like this on slashdot (or at least the comment turned into an article). sometimes just random 'not in the spotlight' news articles.
just my two cents.
Splice too much of that bad, useless, convoluted code into a "new" human and we might end up with a G-Gnome or GNOME (Gratuitous, Nacent, Ogreous, Mechanised Entity). Call it... "G-UNIT", and give it a uniform and a mission. Or, give it a script and a part and call it Smeegul/Smigel...)
Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
While a single human genome is a lot of information, storing thousands shouldn't add much requirements, one can simply store a diff from the first.
[]'s Victor Bogado da Silva Lins
^[:wq