Sequencing a Human Genome In a Week

DNA GATC by sakdoctor · 2009-07-13 11:41 · Score: 4, Funny

Functions that don't do anything, no comments, worst piece of code ever!

I say we fork and refactor the entire project.

Re:DNA GATC by RDW · 2009-07-13 11:55 · Score: 4, Interesting

'I say we fork and refactor the entire project.'
You mean like this?:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=16729053
Re:DNA GATC by ocularDeathRay · 2009-07-13 12:43 · Score: 1

I would like to announce publicly that my genome is released under the GPL

--
Obama is a twitter sock puppet
Re:DNA GATC by interkin3tic · 2009-07-13 13:40 · Score: 1

At least it's backed up well. 3 backups of almost everything ain't bad.
Two strands on each chromesome... I'm probably in the wrong crowd of nerds...
Re:DNA GATC by K.+S.+Kyosuke · 2009-07-13 16:37 · Score: 2, Funny

You thought God can't spell "job security"? Mind you, he's omnipotent!

--
Ezekiel 23:20
Re:DNA GATC by ComaVN · 2009-07-13 19:48 · Score: 1

I would like to announce publicly that my genome is released under the GPL
So you'll only allow you children to mate with other GPL'ed people?

--
Be wary of any facts that confirm your opinion.
Re:DNA GATC by Hurricane78 · 2009-07-13 22:38 · Score: 1

Actually it's just the arrogance of some scientist. Who later found out, that all those parts who seemingly did not do anything, were in fact just as relevant. Just in a different way. Whoops!

--
Any sufficiently advanced intelligence is indistinguishable from stupidity.
Re:DNA GATC by darthpenguin · 2009-07-14 02:34 · Score: 1

Another problem they never mention are artifacts from the chemical protocol; just the other day we found a very unusual anomaly that indicated the first 1/3 of all our reads was absolutely crap (usually only the last few bases are unreliable); turned out our slight modification of the Illumina protocol to tailor it to studying epigenomic effects had quite large effects of the sequencing reactions later on.
Do you have any more details about this? I'm working on solexa sequencing of ChIP DNA with (modified) histone and transcription factor targets. These runs are expensive so it would be nice to avoid problems that someone else has already gone through.
Re:DNA GATC by hakey · 2009-07-14 03:23 · Score: 1

No, its viral. When his children mate with non-GPL people, their code becomes GPL.
Re:DNA GATC by ioshhdflwuegfh · 2009-07-14 07:28 · Score: 1

You thought God can't spell "job security"? Mind you, he's omnipotent!
and also dead! So go know.

Re:Passing this data back to the scientist by QuantumG · 2009-07-13 11:41 · Score: 1

Illumina will sequence your genome for $48,000.

http://scienceblogs.com/geneticfuture/2009/06/illumina_launches_personal_gen.php

Details.

--
How we know is more important than what we know.

Here's what I want to know... by HotNeedleOfInquiry · 2009-07-13 11:43 · Score: 2, Insightful

Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?

--
"Eve of Destruction", it's not just for old hippies anymore...

Re:Here's what I want to know... by QuantumG · 2009-07-13 11:45 · Score: 3, Informative

Typically they sequence every base at least 30 times.

--
How we know is more important than what we know.
Re:Here's what I want to know... by blackbearnh · 2009-07-13 11:46 · Score: 3, Informative

I wondered the same thing, so I asked. From the article: And between two cells, one cell right next to the other, they should be identical copies of each other. But sometimes mistakes are made in the process of copying the DNA. And so some differences may exist. However, we're not at present currently sequencing single cells. We'll collect a host of cells and isolate the DNA from a host of cells. So what you end up is with when you read the sequence out on these things is, essentially, an average of this DNA sequence. Well, I mean it's digital in that eventually you get down to a single piece of DNA. But once you align these things back, if you see 30 reads that all align to the same region of the genome and only one of them has an A at the position and all of the others have a T at that position, you can't say whether that A was actually some small change between one cell and its 99 closest neighbors or whether that was just an error in the sequencing. So it's hard to say cell-to-cell how much difference there is. But, of course, that difference does exist, otherwise that's mutation and that's what eventually leads to cancer and other diseases.
Re:Here's what I want to know... by nodrogluap · 2009-07-13 11:58 · Score: 1

It can get even more complicated too: if you have 10x coverage of a position, and 9 say T while 1 says G, it may be an allelic variation. There's a one in 16 chance this'll happen randomly instead of 5 Ts and 5Gs as you expect.
Re:Here's what I want to know... by K.+S.+Kyosuke · 2009-07-13 12:02 · Score: 2, Interesting

"Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?"
Not necessarily. ;-)

--
Ezekiel 23:20
Re:Here's what I want to know... by damn_registrars · 2009-07-13 12:09 · Score: 1

Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?
They should be. An individual's genome does not change over time. Gene expression can change, which can itself lead to significant problems such as cancer.

--
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
Re:Here's what I want to know... by damn_registrars · 2009-07-13 13:19 · Score: 1

The genome sure as hell changes
Not necessarily. The genome refers specifically to the genes encoded by DNA; mutations can also occur in the non-coding regions. Indeed the non-coding regions are often the most critical for gene expression.

Hence a non-genomic mutation can have a profound effect on gene expression.

lots of mutations happening all the time in probably every cell of our body
Also not necessarily true. For example, a non-dividing cell has no reason to duplicate its own genome, hence it has almost no chance to acquire mutations.

That in turn causes your gene expressions to change since they're also, to a large extent, controlled by the genome.
As I already described, much of gene expression is regulated by non coding upstream (or sometimes downstream) DNA sequences. Look up eukaryotic gene expression and you'll see how critical non-coding regions are; this is often where transcription factors bind.

Hence your initial claim is based on how you define the genome. As the genome is supposed to be the collection of genes, the assertion of the genome changing rapidly is not necessarily true.

--
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
Re:Here's what I want to know... by maxume · 2009-07-13 13:44 · Score: 1

Except cells do undergo the occasional survivable mutation, and then there are the people that integrated what would have been a twin, and so on.

--
Nerd rage is the funniest rage.
Re:Here's what I want to know... by interkin3tic · 2009-07-13 13:47 · Score: 1

Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same.
You're talking about for different individuals? There will be differences, yes, but most of that difference should be in non-coding regions. The actual regions making proteins should be nearly identical. I only work with a few DNA sequences that code for proteins, so that's all I'd be interested in, but there are other applications for medicine that the variation in non-coding regions would be important.
Re:Here's what I want to know... by Rakishi · 2009-07-13 15:22 · Score: 1

The genome refers specifically to the genes encoded by DNA;

Genome(gene+[chromos]om) refers to all the genetic material, coding and non-coding, gene or not. It has meant that for the past 90 years, before non-coding regions were even thought of I'm guessing, and last I checked it hasn't been redefined. You have very nicely explained why having it refer to simply coding regions is stupid. If I'm wrong then I'd love to know but you'll need to provide me a reference.
The definition of a gene on the other hand may be changing to include both coding and non-coding regions but that's probably still up in the air.

Also not necessarily true. For example, a non-dividing cell has no reason to duplicate its own genome, hence it has almost no chance to acquire mutations.
I believe there's still mutations, I think they come from failures of dna repair mechanisms (there is constant genetic damage happening) rather than dna replication mechanisms.
Re:Here's what I want to know... by Rakishi · 2009-07-13 16:09 · Score: 1

Just to add so we're all clear on definitions, as I understand it we're talking about these regions of dna here:
a) rna/protein coding region
b) transcribed non-coding regions (introns)
c) regulatory region
d) unknown/junk/other non-coding region
I suspect there's some additional stuff I missed but it's been a while since I cared too much about this.
In my molecular biology classes "gene" was used to refer to a, b and c as they relate to a protein. To be honest that someone would think otherwise, aside from specialized research, just seemed beyond silly to me.
The only thing "genome" may not have referred to was d) but recent advances have probably stopped that trend. Difficult to ignore something when you keep finding ways in which is matters and can't be ignored.
There's a large number of mutations happening in all these regions. The rate of mutations surviving is higher in regions b and d due to the lack of impact of those mutations. At the same time you can have quite a few mutations in regions a and c with no or almost no detrimental effect.
Re:Here's what I want to know... by timeOday · 2009-07-13 16:38 · Score: 1

In other words, you don't have "a" genome. What you are is a big bunch of cells, closely enough related that their genomes are very similar.
Re:Here's what I want to know... by bradbury · 2009-07-13 21:25 · Score: 1

I would suggest that you spend some time studying the topic in more detail before you make comments on /.
At all the genome conferences I've been to the "genome" includes everything -- the chromosome number and architecture, the coding, regulatory & non-coding regions (tRNA, rRNA, miRNA/siRNA, telomere length, etc.). But the non-coding, highly variable parts of the genome can be considered part of the "big picture" because the amount of *really* junk DNA may function as a free radical "sink" which protects the critical DNA from more rapid mutation (and thus effects rates of cancer development and aging). The DNA composition and ultrastructure may also effect things like gene expression (DNA unwinding temperature, variable access to genes, etc.). It would be a gross simplification to divide a gnome into simply coding vs. non-coding regions.
Books like DNA Replication, DNA Repair and Mutagenesis and Aging of the Genome (~2000+ pages) are good places to start on this topic. Bruce Ames demonstrated in the early '90s that all cells are receiving damage to the DNA (thousands of "hits" per cell per day) and usually repair it successfully. If you knew about the 5+ types of DNA repair (BER, NER, MMR, HR, NHEJ) involving 150+ proteins or had some knowledge of the types of DNA damage which have been discovered in genetic diseases (OMIM) and cancers (Gancer Genome Anatomy project(s)) you would understand that mutational repair does occur within both coding and noncoding DNA and that such damage is probably the core cause of cancer, aging and arguably many other major causes of death (decline of the immune system, susceptibility to influenza or pneumonia, aging of the blood vessels, heart and other muscles, etc.). The accumulation of mutations *does* happen in non-dividing cells and is cumulative. Karanjawala & Lieber [1] have estimated that each cell of a 70 year old individual may contain more than 100 mutations in the critical regions of genes. The accumulation of the proper set of "wrong" mutations (5-10) in dividing cells tends to steer towards cancer while mutations in less critical genes, or non-dividing cells, tends to result in general aging.
There are several million differences in the, esp. SNPs, between the genomes of each human, which is what makes each of us (excepting identical twins/triplets) "different". Speciation results when those differences become sufficient to effectively prevent breeding between population groups. However, the same mutation accumulation that drives differences in individuals and evolution among species can also occur within a single individual. There are many more cells in a single human body (which are derived from a single genome) than there are humans on the planet (~3 orders of magnitude difference), and I suspect more than have ever lived on the planet, so it is unlikely that even within a single individual all of the "genomes" are the same. A genome can best be considered an "average" (esp. if the DNA used to produce the sequence was derived from more than a single cell -- the typical case).
1. http://www.ncbi.nlm.nih.gov/pubmed/15272504
Re:Here's what I want to know... by damn_registrars · 2009-07-14 00:36 · Score: 1

I would suggest that you spend some time studying the topic in more detail before you make comments
Starting your response by insulting the other person? I ordinarily wouldn't bother responding, but since you took the time to provide a peer-reviewed article as a reference I'll give you a chance.

The DNA composition and ultrastructure may also effect things like gene expression (DNA unwinding temperature, variable access to genes, etc.).

Actually if you had read what I said earlier about gene regulation you would have seen that I already said that. Non-coding regions are where transcription factors for gene expression often bind.

If you knew about the 5+ types of DNA repair (BER, NER, MMR, HR, NHEJ) involving 150+ proteins or had some knowledge
You could be less arrogant and presumptuous in your statements. You seem to have taken what you wanted to see in my original post and drawn your own conclusions about me without having anything else to back up your assumptions. You don't seem to be following a very rigorous scientific method, here.

The accumulation of mutations *does* happen in non-dividing cells
I never asserted that it does not. You read what I said and somehow took it to say that. Why you made those assumptions I am not sure.

Any cell biologist can tell you that there are significant numbers of non-dividing cells in any higher eukaryote; one of the reasons why the number of dividing cells is kept low in comparison to non-dividing cells is of course to protect the genome of the organism. While higher eukaryotes have error-checking mechanisms in their dna replication mechanisms, keeping dna duplication to a minimum is a good way to prevent mistake from happening in the first place.

There are several million differences in the, esp. SNPs, between the genomes of each human, which is what makes each of us (excepting identical twins/triplets) "different".
Please clarify that sentence, you either forgot a noun after "the", or the phrase "in the" was added erroneously. (I could insert a joke about your sentence structure undergoing mutagenesis but that would be unkind)

Speciation results when those differences become sufficient to effectively prevent breeding between population groups.
Is there some reason why you expected me to disagree with that statement?

However, the same mutation accumulation that drives differences in individuals and evolution among species can also occur within a single individual.
I do, however, disagree with this statement. Individuals do not evolve; populations evolve. If an individual acquires a mutation during life, the likelihood of them passing it on is nearly zero. While their own genes - or more likely gene expression - may change, it will likely not have any net effect on the population.

so it is unlikely that even within a single individual all of the "genomes" are the same
There is an important distinction to be made between sequencing "a human genome" and "the human genome (project)". We know that there are particular regions of "the human genome" that are particularly difficult to sequence with available technology due to a variety of factors. If "a human genome" were to be sequenced today, we would use much of what we know about "the human genome" to work around these difficult areas, accepting very low coverage of some areas in exchange for very high coverage of coding regions. And amongst the coding regions, the likelihood of finding variation within an individual should be quite low as many coding regions encode for critical cell machinery without which a cell would not be viable.

As for the paper you cited, the institution I am working at does not have that year available in print or electronic. I will have to try to find it the next time I am at a larger university. I will wait to comment on how a paper titled "DNA damage and aging" could relate to genomic variation until after I get a chance to look at it.

--
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
Re:Here's what I want to know... by bradbury · 2009-07-14 06:44 · Score: 1

The Non-Homologous End Joining (NHEJ) DNA double strand break repair process can produce mutagenic deletions (and sometimes insertions) in the DNA sequence. Both the Werner's Syndrome (WRN) and Artemis (DCLRE1) proteins involved in that process have exonuclease activity in order to process the DNA ends into forms which can be reunited. The Homologous Recombination (HR) pathway, which is more active during cell replication, is more likely to produce "gene conversion" which can involve copying of formerly masked mutated DNA (from the homologous chromosome) or sometimes large scale deletions or duplications as well as inversions if the "homology" detection proteins perform their job poorly. The other DNA repair processes generally involve only a single bases and tend to be less destructive to the genome from a pathology standpoint (in part due to the redundancy in the genetic code in converting mRNA into proteins)..
Re:Here's what I want to know... by bogado · 2009-07-14 07:02 · Score: 1

Each person have a different sequence, while the first time they sequenced one of the billions "human genomes". Doing different people could help finding what makes one person different from another and on the other hand what make us similar. :-)

--
[]'s Victor Bogado da Silva Lins
^[:wq
Re:Here's what I want to know... by ioshhdflwuegfh · 2009-07-14 07:34 · Score: 1

"Suppose they sequence a specific human's genome. Now they do it again. Will the two sequences be the same?"

Not necessarily. ;-)
[Reference needed]

How about storing it in analog format? by Anonymous Coward · 2009-07-13 11:43 · Score: 5, Funny

Just store all that data as a chemical compound. Maybe a nucleic acid of some kind? Using two long polymers made of sugars and phosphates? I bet the whole thing could be squeezed into something smaller than the head of a pin!

Re:How about storing it in analog format? by Anonymous Coward · 2009-07-13 11:58 · Score: 1, Funny

I bet the whole thing could be squeezed into something smaller than the head of a pin!
That's what she said!
Re:How about storing it in analog format? by hamisht · 2009-07-13 17:03 · Score: 1

I bet the whole thing could be squeezed into something smaller than the head of a pin!
or, indeed, smaller than the point of a pin

Money well spent by momerath2003 · 2009-07-13 11:58 · Score: 4, Insightful

We pissed away $3 billion dollars and 13 years of time, when we could have waited a few more years and got it done in a week, and much, much cheaper. What a waste of time and money that was....

I know I'm being trolled, but you're an idiot. It's pretty obvious that the ability to sequence the genome in a week could only result from techniques developed and information gathered in the original Human Genome project.

--
I had but a simple dream, to destroy all humans.

Re:Money well spent by bill_mcgonigle · 2009-07-13 15:06 · Score: 1

It doesn't really look like a troll, more a facetious back-handed complement.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Re:Money well spent by quenda · 2009-07-14 13:58 · Score: 1

We pissed away $3 billion dollars and 13 years of time, when we could have waited ...
It's pretty obvious that the ability to sequence the genome in a week could only result from techniques developed and information gathered in the original Human Genome project.
But that doesn't mean you needed to spend $13b! Maybe only $1b was spend on the vital R&D, and the other 12b spent on building the massive parallel infrastructure and processes to implement it full-scale. Plus patent lawyers.
Re:Money well spent by momerath2003 · 2009-07-15 16:54 · Score: 1

The problem with any real-world project is that you can't tell a priori what the "vital" part is and what isn't. If that were the case, we as humans wouldn't waste time on trial and error, and it would instead be trial and success.

--
I had but a simple dream, to destroy all humans.
Re:Money well spent by Adam+Hazzlebank · 2009-07-17 21:17 · Score: 1

Playing devil's advocate. What from the original project was useful in developing 2nd gen sequencing? Assuming that we can now assemble human genomes de novo using Illumina reads.

Re:Moore's law at work? by blackbearnh · 2009-07-13 11:58 · Score: 3, Interesting

It wasn't the computing power that was the holdup, it was the sequencing throughput. Also, as noted in the article, they can do it in a week now partially because they have the completed human genome to use as a template to match things up against. As I analogized in the interview, it's like the difference between putting together a jigsaw puzzle with the cover image available, and doing one without.

Re:We pissed away $3 billion dollars by QuantumG · 2009-07-13 11:59 · Score: 4, Insightful

What's funny is that there is actually people who think like that. Apparently if we just sit around and wait, things will get better. I call this the dark side of the "invisible hand" of the market.. because it is invisible, people forget how it comes about. In order to get improvement in technology you need a market for that technology. And, typically, you need some loss-leader to create the market in the first place. Government funding serves this purpose well.

--
How we know is more important than what we know.

Data analysis a rapidly growing problem in Biology by SlashBugs · 2009-07-13 12:00 · Score: 5, Informative

Data handling and analysis is becoming a big problem for biologists generally. Techniques like microarray (or exon array) analysis can tell you how strongly a set of genes (tens of thousands, with hundreds of thousands of splice variants) are being expressed under given conditions. But actually handling this data is a nightmare, especially as a lot of biologists ended up there because they love science but aren't great at maths. Given a list of thousands of genes, teasing out the statistically significantly different genes from the noise is only the first step. Then you have to decide what's biologically important (e.g. what's the prime mover and what's just a side-effect), and then you have a list of genes which might have known functions but more likely have just a name or even a tag like "hypothetical ORF #3261", for genes that are predicted by analysis of the genome but have never been proved to actually be expressed. After this, there's the further complication that these techniques only tell you what's going on at the DNA or RNA level. The vast majority of genes only have effects when translated into protein and, perhaps, further modified, meaning that you cant's be sure that the levels you're detecting by the sequencing (DNA level) or expression analysis chips (RNA level) actually reflects what's going on in the cell.

One of the big problems studying expression patterns in cancer specifically is the paucity of samples. The genetic differences between individuals (and tissues within individuals) means there's a lot of noise underlying the "signal" of the putative cancer signatures. This is especially true because there are usually several genetic pathways that a given tissue can take to becoming cancerous: you might only need mutations in a small subset of a long list of genes, which is difficult to spot by sheer data mining. While cancer is very common, each type of cancer is much less so; therefore the paucity of available samples of a given cancer type in a given stage makes reaching statistical significance very difficult. There are some huge projects underway at the moment to collate all cancer labs' samples for meta-analysis, dramatically increasing the statistical power of the studies. A good example of this is the Pancreas Expression Database, which some pacreatic cancer researchers are getting very excited about.

Buttload of data by virgil+Lante · 2009-07-13 12:01 · Score: 2, Interesting

Illumina's Solexa sequencing produces around 7 TB of data per genome sequencing. Its a feat just to move the data around, let alone analyze it. Its amazing how far sequencing technology has come, but how little our knowledge of biology as a whole has advanced. 'The Cancer Genome' does not exist. No tumor is the same and in cancer, especially solid tumors, no two cells are the same. Sequencing a gamish of cells from a tumor only gives you the average which may or may not give any pertinent information about the tumor. Vogelstein's group has shown this quite convincingly but hardly anyone truly looks at what the data really says.

Re:Buttload of data by LokiSteve · 2009-07-13 14:37 · Score: 1

Working with Illumina in a busy lab turns into a battle of platters faster than most people want to believe.

The next thing to hit is supposed to be 3D DNA modeling where the interactions within the genome is mapped. Meaning; if x and y is present b will be produced but will only be active if x is at position #100 or position #105 if it's at another position c will be created, etc. It differs from normal mapping because the code AND position within the code is taken into account so there are conditions added to almost every code sequence. Won't cause much more disk overhead but will kill processor time and I'm too young to retire before then...

--
END OF LINE.

How do you know it's NOT comments? by Ungrounded+Lightning · 2009-07-13 12:19 · Score: 1

Functions that don't do anything, no comments, worst piece of code ever!

Most of it doesn't code proteins or any of the other things that have been reverse-engineered so far. How do you know it's NOT comments?

(And if terrestrial life was engineered and it IS comments, do they qualify as "holy writ"?)

--
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way

Re:How do you know it's NOT comments? by dunkelfalke · 2009-07-13 16:30 · Score: 1

There was some SF book I read, where it was explained that the comments were "made by a demo version of creature editor" and that was the reason for humans to die after 100 years. Some hacker has then found a way to reset the demo counter and thus to make people live forever.

--
"It's such a fine line between stupid and clever" -- David St. Hubbins, Spinal Tap
Re:How do you know it's NOT comments? by NewsWatcher · 2009-07-13 20:25 · Score: 1

Like this girl?

--
If the pattern goes 9am, 10am, 11am, why isn't noon 12am?
Re:How do you know it's NOT comments? by SlashWombat · 2009-07-13 21:14 · Score: 1

How do you know it's NOT comments?
Come on, how many programmers do you know that write comments, meaningful or not? I personally have a massive descriptive dialogue running down the side. "Real" programmers have told me that is excessive. Looking at their code I find one comment every 20 to fifty lines, and descriptive identifiers, like i, x or y. The genome will be just like that. (Also, given that any big project ends up with lots of dead code. (yes, I know the compiler identifies that, but ...)
Re:How do you know it's NOT comments? by ioshhdflwuegfh · 2009-07-14 07:23 · Score: 1

There was some SF book I read, where it was explained that the comments were "made by a demo version of creature editor" and that was the reason for humans to die after 100 years. Some hacker has then found a way to reset the demo counter and thus to make people live forever.
Hey, what if that's how DNA work? Would that not be like awesome and stuff?
Re:How do you know it's NOT comments? by dunkelfalke · 2009-07-14 07:29 · Score: 1

bullshit. the average western live expectancy is over 80 nowadays and even third world countries manage 60 years. and that is only because lots of people die young (cancer, car accidents and so on).

--
"It's such a fine line between stupid and clever" -- David St. Hubbins, Spinal Tap
Re:How do you know it's NOT comments? by Ungrounded+Lightning · 2009-07-14 09:08 · Score: 1

Come on, how many programmers do you know that write comments, meaningful or not?
Plenty. And the ones that do tend to have more functional programs, too. B-)
(My own code is heavily commented - to the point of providing a second full description of the design. And a colleague once said I'm the only person he'd trust to program his pacemaker. B-) )

--
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way

Re:Data analysis a rapidly growing problem in Biol by virgil+Lante · 2009-07-13 12:21 · Score: 1

You have to be very careful about what findings at different levels actually mean, and how the various levels correlate.

For example when looking at duplications/expansions in cancer, an expansion of a locus results in about a 50% correlation between DNA level change and expression level chane. Protein and gene expression levels correlate 50 to 60% of the time (or less depending on who's data you look at). So therefore, being gracious and assuming a 60% correlation at the two levels you are already below a 40% correlation. Add in post translation modifications, sub-cellular localization and the requirement for other players within a functional pathway to exhibit a specific behavior and what you have is a tangled mess that you can spin almost any story about a favorite gene. But does it have meaning for diagnosis and treatment? I'd definitely hedge my bet.

Its that big of a mess and that isn't even considering the vast population heterogeneity of each tumor.

Re:Data analysis a rapidly growing problem in Biol by olsmeister · 2009-07-13 13:04 · Score: 1

A good example of this is the Pancreas Expression Database, which some pacreatic cancer researchers are getting very excited about.

Kim Jong-il will be ecstatic to hear that. Dear Leader can't very well put the Grim Reaper into political prison....

DNA is digital by EndoplasmicRidiculus · 2009-07-13 13:25 · Score: 2, Informative

Four bases and not much in between.

Re:DNA is digital by gringer · 2009-07-13 15:10 · Score: 1

Plus histones, methylation, imprinting, a few thousand proteins, and a few pieces of RNA to bootstrap the transcription/translation.

--
Ask me about repetitive DNA
Re:DNA is digital by EndoplasmicRidiculus · 2009-07-14 06:32 · Score: 1

You're using a very liberal definition of DNA.
Re:DNA is digital by ioshhdflwuegfh · 2009-07-14 07:39 · Score: 1

You're using a very liberal definition of DNA.
I'm telling you, liberals are destroying this great country of ours.

Humans have ~810.6 MiB of DNA by izomiac · 2009-07-13 13:27 · Score: 2, Interesting

The human genome is approximately 3.4 billion base pairs long. There are four bases, so this would correspond to 2 bits of information per base. 2 * 3,400,000,000 /8 /1024 /1024 = 810.6 MiB of data per sequence. That doesn't seem like it'd be too difficult. With a little compression it'd fit on a CD. Now, I suppose each section is sequenced multiple times and you'd want some parity, but it still seems like something that'd easily fit on a DVD (especially if alternate sequences are all diff'd from the first). Perhaps throw in another disc for pre-computed analysis results and that ought to be it.

So, what's going on here? Are the file formats used to store this data *that* bloated? Or are they trying to include structural information beyond sequence? What am I missing that makes this an unwieldy amount of data?

(I have to laugh at how Vista is apparently 20 times more complex than the people that use it...)

Re:Humans have ~810.6 MiB of DNA by blackbearnh · 2009-07-13 13:48 · Score: 1

Right, but that's the finished product. You start with a ton of fragments that you need to sequence and fit together, and there's overlap, and multiple reads (like 30?) of each sequence, so you end up with much much more data that gets refined down into the end sequence.
Re:Humans have ~810.6 MiB of DNA by rnaiguy · 2009-07-13 14:02 · Score: 1

You have to take into account that sequencing machines do not just spit out a pretty string of A, C, T, G. For the older sequencing method, the raw data from the sequencing machine consists of 4 intensity traces (one for each base), so you have to record 4 waves, which are then interpreted (sometimes imperfectly) by software to give you the sequence. The raw data does have to be stored and moved around for some period of time, and often needs to be stored for other analyses. This data is around 200 kilobytes for less than 1 kilobase of sequence. The newer methods collect data as a series of very high-resolution images (something like a black image with ~10 million colored spots), which take up TONS of space, and take substantial processing power to interpret and turn into nucleotide sequence. I don't have exact numbers though, since I haven't worked with them directly, only the preprocessed data (which is still several gigabytes for a gigabase of sequence, since it contains data on the quality/certainty of each base read and such)
Re:Humans have ~810.6 MiB of DNA by izomiac · 2009-07-13 14:49 · Score: 2, Interesting

Interesting, I was assuming that it was more of the former method since I hadn't studied the latter. Correct me if I'm wrong, but as I remember it that method involves supplying only one type of fluorescently labeled nucleotide at a time during in vitro DNA replication and measuring the intensity of flashes as nucleotides are added (e.g. brighter flash means two bases were added, even brighter if it's three, etc.). Keeping track of four sensors at 200 bytes per base would imply sensors that could detect 133 levels of brightness or 8 measurements per base at 16 levels of brightness. That seems like a lot higher resolution than the example data sheets I've seen, but maybe that's what current technology can do. Still though, most bases are fairly unambiguous so the bulk of the sequence could likely be stored as results only.

The new method sounds like they're doing a microarray or something and just storing high resolution jpegs. I could see why that would require oodles of image processing power. It does seem like an odd storage format for what's essentially linear data.

I suppose my point is more that they're storing a lot of useless information. I could see storing a ton of info about a sequence back when graduate students were adding nucleotides and interpreting graphs by hand, but in this day and age you'd just redundantly sequence until you got to the desired accuracy. I couldn't imagine that it'd be cheaper to have technicians manually tweak the entire sequence.

BTW, I'm not arguing against you, more against some of the design decisions of automated sequencers. You clearly know a lot more about the subject than my undergrad degree allows me to even think about refuting.
Re:Humans have ~810.6 MiB of DNA by johannesg · 2009-07-13 19:19 · Score: 1

So, what's going on here? Are the file formats used to store this data *that* bloated?
<genome species="human">... ;-)
Re:Humans have ~810.6 MiB of DNA by RDW · 2009-07-14 04:03 · Score: 1

'The new method sounds like they're doing a microarray or something and just storing high resolution jpegs. I could see why that would require oodles of image processing power. It does seem like an odd storage format for what's essentially linear data.'
There's a good summary of the technology here:
http://seqanswers.com/forums/showthread.php?t=21
Millions of short sequences are analysed in a massively parallel way, and you need to take a new high resolution image for every cycle to see which 'polonies' the new base has been added to, so obviously there's a large of image data to store (you can throw it away afterwards, of course, but since it probably cost you several thousand $USD to generate, you might want to hang on to it for a bit until you're sure you've analysed it properly).
But the image series isn't the only large data set you have to deal with. A Solexa/Illumina sequencer can generate 20 gigabases of data from a single paired-end run. This seems like a lot, but you'll actually need multiple runs to assemble a genome with any confidence (to make sure you've got both alleles, and can distinguish variants from artefacts - 30-fold 'depth' is common). It's this raw short read data, rather than the images, that would typically be delivered to a scientific end user (algorithms to assemble and align it to a reference genome are still in active development, so you probably don't want to throw it away in case a better method of processing it comes along next month).
Even the final product, a complete assembled genome, takes up a bit more space than you might think. You'll be getting diploid data, so that's over 6 gigabases. And although 2-bit encoding is used for some purposes (like BLAT databases for fast pattern matching), finished sequences are typically stored as standard 8-bit ascii text files for convenience, so the whole thing is over 6Gb uncompressed (a couple of Gb gzipped). James Watson got his genome on a couple of (presumably single layer) DVDs. Commercial personal genomics services will ship yours on a fancy encrypted USB stick:
http://www.knome.com/service/genomekey.html

Re:We pissed away $3 billion dollars by cupantae · 2009-07-13 13:31 · Score: 1

Well, EENterestingly, that's pretty much what people are saying when they complain that early paleontologists ruined priceless artifacts.

You learn as you go, like when you're learning to play Ghosts 'n' Goblins and you keep getting killed by the red gargoyle, but then you eventually learn that you have to jump away from him as he swoops and fire frantically towards him. I know other people have made similar responses, but I only understand things in terms of analogies. Particularly ones related to throwing lances at gargoyles.

--
--

Re:Data analysis a rapidly growing problem in Biol by Daniel+Dvorkin · 2009-07-13 14:00 · Score: 1

The vast majority of genes only have effects when translated into protein

That depends on your definition. If you define a gene as "stretch of DNA that is translated into protein," which until fairly recently was the going definition, then of course your statement is tautologically true (replacing "the vast majority of" with "all.") But if you define it as "a stretch of DNA that does something biologically interesting," then it's no longer at all clear. Given the number of regulatory elements not directly associated with genes, sections of DNA that code for RNAzymes, etc., it may well be that the majority of "genes" are not protein-coding at all. Going back to the Mendelian definition of a gene as a unit of inheritance, this looks more and more likely.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.

Re:Passing this data back to the scientist by goombah99 · 2009-07-13 14:12 · Score: 2, Interesting

a whole human genome will fit on a CD.

if you just transmit the diffs from the generic human you could put it in an e-mail

--
Some drink at the fountain of knowledge. Others just gargle.

Re:Passing this data back to the scientist by goombah99 · 2009-07-13 14:37 · Score: 3, Insightful

I suppose it's worth noting that the intermediate (raw) data sets can get pretty large. they are actually getting larger as the trend goes towards shorter less informative "reads" that require more of them to recover the connective information and to recover from errors and duplications. However that's a tend that has a stopping point. While more reads is better at some point there is almost no added value from more reads. So at that point that's the maximum amount of data you need to collect. it's won't increase ever. meanwhile hard drive and network speeds will go up factors of ten.

thus the storage issues here are well tolerated at present and soon will become trivial.

--
Some drink at the fountain of knowledge. Others just gargle.

Re:Passing this data back to the scientist by goombah99 · 2009-07-13 14:46 · Score: 1

This actually suggests that perhaps we should start transmitting into space or on space crafts the genome of all the genes ever sequence, even the ones hauled out of the ocean that we don't know what organism they belong too. you send that, plus the molecular composition of DNA, and the molecular structure of the ribosome and T-rna

while there's more to a cell than just that, it's well known that in virto you can get transciption of the DNA from just that. It won't be too long I suspect before you could come up with some way to bootstrap a primordial cell out of those expressed proteins. Once you have such a cell, bootstrapping to higher level organisms is not such a long leap.

You would be effectively preserving an approximation of the earth's ecosystem. maybe someone will find it.

--
Some drink at the fountain of knowledge. Others just gargle.

Re:Passing this data back to the scientist by Neil+Blender · 2009-07-13 14:57 · Score: 1

A single run on a Solexa next gen sequencer can generate over 200GB of data and half a million files. And that is for 8 samples only. You get into the terabyte range very quickly.

That's why data is delivered on hard drives.

Re:Moore's law at work? by Daniel+Dvorkin · 2009-07-13 15:07 · Score: 1

What does "sequence a genome" actually mean. The name "sequence" suggests that it has something to do with the "order" of something. Your post makes it sound like sequencing is something done before the computer gets ahold of the data. Can you explain for us genetics laypersons what the heck "sequencing" is? Tnx.

Very simply: Your DNA is stored in chromosomes. Each chromosome contains DNA in tight bundles with lots of weird secondary and tertiary structure. Suppose that you took all the chromosomes from one of your cells -- i.e., your genome -- unwound the DNA into long threads, and laid those threads out. You'd then have the chemical equivalent of strings of characters, e.g. ACGTGCATT ..., one for each chromosome, where each character represents a particular base. (I'm not going to get into the biochemistry, and anyway, there are probably people here who can explain it better than I can -- I'm a bioinformaticist, but mainly a numbers guy.) This ordered set of strings of characters is what's known as "the sequence," and "to sequence" a genome is to obtain that set.

Unfortunately, the actual sequencing process is a hell of a lot more complicated than what I just described, and considerable computational power is required at all stages of the process. But really the number crunching isn't the bottleneck, it's the biochemistry. And that's been improving rapidly, so now we have the ability to do the "wet-lab" work necessary to get an entire human (or any other organism) genome sequence a lot faster than we used to.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.

I also manage a Next-gen Sequencing Machine by Anonymous Coward · 2009-07-13 15:22 · Score: 3, Interesting

Next gen sequencing eats up huge amounts of space. Every run on our Illumina Genome Analyzer II machine takes up 4 terabytes of intermediate data, most of which comes from the something like 100,000+ 20 Mb bitmap picture files taken from the flowcells. All that much data is an ass load of work to process. Just today I got a little lazy with my Perl programming and let the program go unsupervised...and it ate up 32 gb of ram and froze up the server. Took redhat 3 full hours to decide it had enough of the swapping and kill the process.

For people not familiar with current generation sequencing machines, they can scan between 30-80 bp reads and use alignment programs to match up the reads to species databases. The reaction/imaging takes 2 days, prep takes about a week, processing images takes another 2 days, alignment takes about 4. The Illumina machine achieves higher throughput than the ABI ones but gives shorter reads; we get about 4 billion nt per run if we do everything right. Keep in mind though, that 4 billion that they mention in the summary is misleading: the read cover distribution is not uniform (ie you do not cover every nucleotide of the human's 3 billion nt genome). To ensure 95%+ coverage, you'd have to use 20-40 runs on the Illumina machine...in other words, about 6-10 months of non-stop work to get a reasonable degree of coverage over the entire human genome (at which point you can use programs to "assemble" the reads in a contiguous genome). WashU is very wealthy so they have quite a few of these machines available to work at any given time.

the main problem these days is that processing all that much data requires a huge amount of computer knowhow (writing software, algorithms, installing software, using other people's poorly documented programs), and a good understanding of statistics and algorithms, especially when it comes to efficiency. Another problem they never mention are artifacts from the chemical protocol; just the other day we found a very unusual anomaly that indicated the first 1/3 of all our reads was absolutely crap (usually only the last few bases are unreliable); turned out our slight modification of the Illumina protocol to tailor it to studying epigenomic effects had quite large effects of the sequencing reactions later on. Even for good reads, a lot of the bases can be suspect so you have to do a huge amount of averaging, filtering, and statistical analysis to make sure your results/graphs are accurate.

Re:We pissed away $3 billion dollars by johannesg · 2009-07-13 19:18 · Score: 1

What's funny is that there is actually people who think like that. Apparently if we just sit around and wait, things will get better. I call this the dark side of the "invisible hand" of the market.. because it is invisible, people forget how it comes about. In order to get improvement in technology you need a market for that technology. And, typically, you need some loss-leader to create the market in the first place. Government funding serves this purpose well.

The sad thing is that this seems to be pretty much par for the course. If only we wait just a little while and skip all those annoying intermediate steps, we will soon have fantastically good rockets / fusion reactors / whatever else without having to pay anything...

Re:Moore's law at work? by quadrox · 2009-07-13 20:48 · Score: 1

If it has taken 13 years until recently to properly sequence the genome of a single human, how has it been possible to do DNA "fingerprinting" e.g. for crime investigations? Is the actual sequencing not required there?

Re:Passing this data back to the scientist by jacquesm · 2009-07-13 21:20 · Score: 1

yes, let's give those aliens something to experiment on, so they can figure out what bugs to send our way to exterminate us :)

--
MP3 Search Engine

Genome as a cause? by Hurricane78 · 2009-07-13 22:41 · Score: 1

Well, how about pollution, processed food, and all that trash being the main reason we get cancer?

Cancer was not even a known disease, a century ago, because nobody had it. (And if people get cancer now, way before the average age of death a century ago, then it can't be that it is because we now get older.)

But I guess there is no money in that. Right?

--
Any sufficiently advanced intelligence is indistinguishable from stupidity.

Re:Genome as a cause? by SlashBugs · 2009-07-14 01:25 · Score: 1

Cancer has been with us throughout recorded history. Ancient Egyptian, Greek, Roman and Chinese doctors described and drew tumours growing on their patients covering a span of about 2000-4000 years ago. There's also archeological evidence of cancers much older than that, e.g. in Bronze age fossils.

Cancer has become more common over the last hundred years or so. A huge part of that is simply the fact that we're living much longer, meaning that the odds of a given person developing cancer are much higher.

Of course you're right that environmental factors are important. Smoking and increased alcohol consumption are probably the biggest contributors, probably followed by poorly tested or controlled industrial synthetics like Asbestos. I've no idea what makes you think that no-one is researching this stuff. It's not exactly hard to find: cancer.org and cancerresearch.org.uk are great places to start reading about the known risk factors in modern life. Or, you know, there's google.

Probably the best source about risk factors is this huge meta-analysis of cancer papers. A science journalist's summary: In addition to the cancer risk associated with excess body fat, the WCRF-AICR study offered 10 lifestyle recommendations to help ward off cancer, including limiting red meat consumption and excessive drinking, exercising daily, avoiding processed meats such as bacon and ham, and eating a diet rich in fruits, vegetables and whole grains. The research synthesizes many individual reports that have found similar lifestyle-cancer connections for specific cancers.

But even with cancers caused by environmental factors, there's still good reason to sequence genomes. Cancer develops as a result of a cell's DNA becoming damaged in ways that constitutively activate its replication programmes and suppress its checkpoint and suicide programmes. So sequencing the genome of cancer cells gives a lot of information about exactly how those cells became cancerous (although we're not sure what we're looking for yet), which in turn suggests ways to treat that specific cancer. Alternatively, sequencing healthy cells from people can give us information about why some populations are at higher risk of developing cancer. For example, carriers of specific forms of the BRCA1, BRCA2 or BRIP1 gene are at higher risk of developing breast cancer than the rest of the population. These discoveries gave us insight into how this cancer develops, which hints at possible treatments. Also, if someone has their genome sequenced and discovers these faulty genes they can take steps to avoid other risk factors (alcohol, etc) to control their risk, and attend more regular screening than the general population.

Re:Moore's law at work? by cbailster · 2009-07-13 22:56 · Score: 2, Informative

Fingerprinting doesn't rely on DNA sequencing, but does rely on the DNA sequence being different between people. Everyone's DNA contains subtle differences (particularly in the non-coding DNA regions). These differences can be exploited by various laboratory techniques to produce small pieces of DNA which will be of different sizes because of these differences. When these fragments of DNA are run down a suitable gel (usually agarose, a substance derived from seaweed) under an electric current the fragments will separate by size. The pattern of fragments formed will be unique for each individual.

Several fingerprinting techniques rely on what most programmers would best recognise as regular expression matching. For example there are enzymes in biology which will recognise certain DNA sequences but not others, and will cut the DNA in two where ever this sequence is matched. (in perl:

my @dna_fragments = split /GAATTC/, $my_dna;

is the equivalent of what an enzyme called EcoRI does). Not everyone will have the same numbers of this sequence in their DNA, and nor will they be in the same place, thus the number and size of fragments will differ. By using a suitable range of such enzymes you can generate a pattern of DNA fragments which is sufficiently unique as to identify a single person amongst a population of several billion.

for more information take a look at DNA Profiling on wikipedia

I want to copyright my dna. Then, it can't be.... by CFD339 · 2009-07-14 00:29 · Score: 1

...used against me for anything without violating the DMCA. The act of decoding it by some forensics lab paternity test or future insurance company medical cost profile would become unlawful and I'm sure the RIAA would help me with the cost of prosecuting the lawsuit.

--
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln

Where's Nedry ? by ciderVisor · 2009-07-14 00:31 · Score: 1

Check the vending machines !

--
Squirrel!

wow. read a book. by CFD339 · 2009-07-14 00:41 · Score: 2, Insightful

First, kinds of cancers were known to exist a century ago. Tumors and growths were not unheard of. Most childhood cancers killed quickly and were undiagnosed as specific disease other than "wasting away". When the average lifespan was 30-40 years, a great many other cancers were not present because people didn't live long enough to die from them.

As we cure "other" diseases, cancers become more likely causes of death. Cells fail to divide perfectly, some may go cancerous others simply don't produce as healthy a replacement specialized cell. Your arteries harden, muscles don't repair as well, other tissues don't work as well (you get weaker, more wrinkled, easier to fall ill). Eventually either something fails that can't be repaired or enough cells go cancerous. Until we either figure out how to replace the body (seems unlikely as the brain and body are more tied together than sf movies like to present) or we figure out how to make cells repair/refresh themselves without shortening their telomeres -- I have no idea how likely that actually is.

--
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln

Re:Data analysis a rapidly growing problem in Biol by SlashBugs · 2009-07-14 01:43 · Score: 1

Very good points. I think I've been using a very sloppy definition of gene, just a vague idea that it's only DNA>RNA>protein>action or DNA>RNA>action. I've never really got deeply into thinking about regulatory elements, etc. It's compounded by the fact that, while I'm interested in cancer, most of my actual work is with a DNA-based virus that only produces a very few non-translated RNAs that we're aware of. I have a tough time convincing some people that even those are biologically relevant.

I sometimes think that RNAs and various epigenetic factors (I'm including DNA secondary and tertiary structures here) fall into the same trap as a lot of post-translational protein modifications: They're hard to study so not much is written or understood about them, so most non-specialists basically ignore them and decide they can't be too important. It's changing now as techniques evolve to do the experiments, but I'm still shocked how often I see someone basically say "well we don't understand this so we'll assume it's not affecting our system".

Re:Passing this data back to the scientist by goombah99 · 2009-07-14 02:49 · Score: 1

I'm curious how you figure 200GB of data. A solexa 1G only produces tens of millions reads per run, each read being about 36 bases.

--
Some drink at the fountain of knowledge. Others just gargle.

Re:Passing this data back to the scientist by Neil+Blender · 2009-07-14 04:13 · Score: 1

There are raw intensity files which are used for base calling. The output of the base calling is used to generated alignments, quality, etc. You don't just get the short reads at the end. Most people are just going to use the short reads but that still can be 30G of data for a run.

Re:DNA GATC ... G-GNO-ME by davidsyes · 2009-07-14 06:46 · Score: 1

Splice too much of that bad, useless, convoluted code into a "new" human and we might end up with a G-Gnome or GNOME (Gratuitous, Nacent, Ogreous, Mechanised Entity). Call it... "G-UNIT", and give it a uniform and a mission. Or, give it a script and a part and call it Smeegul/Smigel...)

--
Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"

Diff? by bogado · 2009-07-14 06:59 · Score: 1

While a single human genome is a lot of information, storing thousands shouldn't add much requirements, one can simply store a diff from the first.

--
[]'s Victor Bogado da Silva Lins

^[:wq

Re:Passing this data back to the scientist by ioshhdflwuegfh · 2009-07-14 07:17 · Score: 1

Illumina will sequence your genome for $48,000.

http://scienceblogs.com/geneticfuture/2009/06/illumina_launches_personal_gen.php

Details.

Helluva lots of details. After wasting some perfectly useful clicks all I could come up with was:

Illumina's technology is extremely well-established, and serves as the backbone for most large-scale genome sequencing projects currently underway (including the majority of the samples sequenced as part of the 1000 Genomes Project); that gives it an edge over the more experimental technology employed by competing sequence provider Complete Genomics.

That's so much details that I really want my clicks back. You should be ashamed of yourself, Sir.

Re:Passing this data back to the scientist by quenda · 2009-07-14 13:14 · Score: 1

This actually suggests that perhaps we should start transmitting into space

In an infinite universe, our DNA already exists out there. Its just a question of how far. Philosophically, there is nothing to be gained by pumping it out to our possible neighbours. It could only be used against us.

Re:Passing this data back to the scientist by Adam+Hazzlebank · 2009-07-17 21:10 · Score: 1

The raw images from the device alone can take up this much space. 8 lanes, 300 imaging regions (tiles) per lane. Each imaged 4 times (one for each base/channel). A typical run is 37 cycles (base pairs), paired end runs (now typical) double this so:

8*300*4*37*2 =710400

On a GA1 those files are 2mb each, giving you around a terabyte and a half of of primary data to process. Image analysis takes place processing those files in to "intensity files". Those are further processed in to corrected intensities, then basecalls. Each of these steps produces a similar number of files. Some details of the process here: http://sgenomics.org/mediawiki/upload/8/80/Pipeline.pdf

Those numbers are for a GA1, the current version of the instrument has less imaging regions (100). However cycle length has increased (typically now 75+ bp).
As a side note all the tools used are "shared source" and not available under an open source license. There is a project called Swift which is an open source tool to do this: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp383

Slashdot Mirror

Sequencing a Human Genome In a Week

85 of 101 comments (clear)