Genetic Stone Soup
It's the scientific achievment of our generation; what can you say about the mapping of the human genome? But here's a story behind the story. parvati turned us on to this NYT article
about James Kent, who wrote the gene assembly program
GigAssembler
last June. It turns out that, thanks to his code, the public
Human Genome Project
had actually finished its work three days before the private effort by
Celera Genomics
-- a feather in their cap and a boon to public science. The head of Celera was "astonished" to learn of this grad student's genius -- ten thousand lines of C in a month, and why? -- "because of his concern that the genome would be locked up by commercial patents if an assembled sequence was not made publicly available for all scientists to work on." (The debate over
public vs. private science
continues to rage; see
this Seattle P-I article,
which discusses among other things the ethics of NDA'ing scientific data produced for profit.)
Update: 02/13 02:26 PM by J : Thanks to tlunde for finding the link to GigAssembler and thus clarifying which language it was written in.
They measure things such as the number of fragments (fewer=better) and their lengths (longer=better) and estimated coverage of the genome.
There is also a less biased comparison over at the Nature website. I don't know if you can get to read it without a paid subscription though. Their findings are less controversial, saying that the statistics are similar for the two assemblies, but that the annotations (i.e. descriptions of what is actually there, comparison: A group photo with note on peoples names and their relationships) are better in the public version.
Lars
__
Reality or nothing.
I left PDI six years ago to start my own company, and I've lost touch with Jim; I had heard vaguely that he'd 'gone back to school', but I had no idea that he was up to something this big. It's great to see an old friend make such a great contribution in a new field. Way to go, Jim!
Another old friend of mine, Carter Burwell, went the other way, from doing genetics with Crick at his Cold Spring Harbor laboratory, to working on early computer graphics at NYIT in the early 80s, to becoming one of the pioneers in digital music, and is finally now a leading composer for films such as the recent O Brother, Where Art Thou (somehow passed over by the Academy).
And I'm still just making pictures :) Oh well.
thad
I love Mondays. On a Monday, anything is possible.
On another, slightly more disturbing note, I am somewhat concerned about the use of academic funding to compete with commercial enterprises. Just because RMS does it doesn't make it right.
Celera have released their sequence under a license that restricts commercial usage (something vaguely like the Sun open source license thing, whatever it's called), whereas the public effort has released their work into the public domain (pretty BSDish, really). If Celera were the only group releasing this data, academic research into the human genome would not be able to attract the same sort of investment and would proceed significantly more slowly than it otherwise would. Using academic funding in this case secures a future for academic research in a very important field.
Agreed, he likely brought a huge amount of pre-existing skill in matrix math. But 10k lines of assembly language hacking to beat richly funded capitalists with super-computers in four weeks is a truely amazing hack, no matter what their skill level.
BTW, his home page doesn't say: anyone know what graphics software worked on before? The name seems familiar - I think he used to hack math for a package called Digital Arts, but I could be wrong.
What's disturbing is that academic institutions are being forced to compete with commercial enterprises that, frankly, should not exist. The idea of a commercial enterprise doing something as important to the entire human race as the sequencing of the genome with the intention to control distribution of the resulting science is deeply offensive. Just because you can make money doing something doesn't make it right.
"How perfectly Goddamn delightful it all is, to be sure" Charles Crumb
From http://genome.ucsc.edu/goldenPath/algo.html:
Assembly Process Overview
The assembly proceeds according to the following major steps:
Decontaminating and repeat masking the sequence.
Alignment of mRNA, EST, BAC end, and paired plasmid reads against genomic fragments. On a cluster of one hundred 800 MhZ Pentium III CPUs running Linux this takes about three days.
Creating an input directory structure with using Washington University map and other data. This step takes about an hour on a single computer.
For each fingerprint clone contig, aligning the fragments within that contig against each other. This takes about three hours on the cluster.
Using the GigAssembler program within each fingerprint clone contig to merge overlapping fragments and to order and orient the resulting sequence contigs into scaffolds. This takes about two hours on the cluster.
Combining the contig assemblies into full chromosome assemblies. This takes about twenty minutes on one computer.
The steps will be described in more detail below.
[snip]
The program was NOT written in "assembler". From Appendix B:
mRNA Scoring Function
int scoreMrnaPsl(struct psl *ali, boolean isEst)
/* Return score for one mRNA oriented psl. */
{
int milliBad;
int score;
milliBad = calcMilliBad(ali, TRUE);
score = 25*log(1+ali->match) + log(1+ali->repMatch) - 10*milliBad + 10;
if (ali->match <= 10)
score -= (10-ali->match)*25;
if (isEst)
score -= 25;
else
score += 25;
return score;
}
The example I like to use is that the Genome project is like the Periodic Table: it just gives you a framework to hang knowledge on.
Just as the Periodic table helped scientists to deduce the structure of electron orbitals (by observing the sequence of how chemical similarities went with atomic number) and find new elements ("There's a hole here, and what should fill that hole will have these properties. Now we know what we are looking for and where to look..."), the Genome project will allow us to better determine how genes are controlled, and look for new proteins.
www.eFax.com are spammers
While some people are discussing Folding@Home as a response to your question of a "seti-like" processing system; there is actually a much more relevant project, also hosted at Stanford. The Genome@Home Project is attmepting "to design new genes that can form working proteins in the cell" from the DNA sequence of non-human organisms. It is a new project, but gaining speed quickly. It is worth taking a look at if you have spare cycles you can give to a good cause.
-OctaneZ
First of all, for those who aren't in the biotech industry, it should be mentioned that the NIH has an agenda to push just as much as private for-profit industry. Never believe that 'the good of the public' is the only thing driving non-profits, especially when a government (ANY government!) is involved.
Still, this issue isn't quite as cut and dried as many would like to believe. If it was, then everyone would gang up on one side, and the other side would wither and die. Consider some of the following points:
1) Celera's efforts most likely DID force the HGP to speed up.
2) Celera's "whole genome" approach appears to be a bust. Before they did it, we could only guess at how well (or poorly) it might work. In other words, we learned something valuable for future research from Celera!
3) There is a lot of grumbling about Science imposing a restrictive agreement on access to the Celera data. I agree--this isn't how science works! However, it's like book publishing. They "borrowed" publicly available information (preliminary work from the HGP), added their own stuff, and can impose whatever restrictions they want. Don't like it? Go to the HGP. They (Celera) are entirely within their rights, but I don't think that Science should have agreed to publish with those restrictions.
4) Here's a biggie. Science costs a LOT of money--the only two groups that can afford it are governments, and expensive biotech companies. The former can't afford to fund all science research, and the latter can't afford to not make a profit. Incidentally, biotech is an area where on the whole, the patent system works quite well.
At any rate, it's an impasse. Either you cut research by about 60%, or you deal with companies that need to make a profit on their research. Flip a coin and make your choice.
"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
Here's the story, along with (sigh!) a (pretty cool) BEOWOLF CLUSTER!
I hate to do it, but it's actually on topic. :-)
"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
ten thousand lines of assembly code in a month, and why?
Just for clarity; it doesn't say the language is assember, just that what the program does is assemble genome fracments...
from the unsung-hero dept.
Not really...
In fact, the Celera assembly is proof positive of the value of the HGP. When HGP started a decade ago, a dedicated scientist with years and years of training might be able to sequence a few tens of base pairs in a day, if he or she did nothing else. Five years later, after the public funded a huge improvement in the basic technology of sequencing, a barely competent technician can be expected to sequence thousands of bases a day without breaking a sweat.
Two years after that, private industry realized that it could make money exploiting that technology. All hail to Celera for doing a good job -- but if they had seen further, it would only be because they stood on the shoulders of public money.
"because of his concern that the genome would be locked up by commercial patents if an assembled sequence was not made publicly available for all scientists to work on."
So should genes be patented?
I believe this question has been at least partially answered by the Patent Office. You can patent a gene based medicine or treatment if it is applicable to a particular illness, or disease, or gene based disability. You cannot just patent genes willy nilly because you know they exist. The Patent Office and people in gene research from the NIH and Celera, the two main players in gene research, pretty much agree that it is beneficial to the public if gene based
medicines can be patented for specific treatments. A more detailed discussion on patenting is at:
http://www.ornl.gov/hgmis/elsi/patents.html
Cui peccare licet peccat minus. -- Ovid, Amores.
They wanted to make seeds that couldn't reproduce, ostensibly to control genetically modified plants and keep them from taking over.
Well you can't have it both ways can you? Either you want seeds that reproduce, in which case you'd be whining about cross-contamination with other crops, or you have seeds that don't produce, in which case you whine about "holding nations' food supplies hostage". Come on, which way do you want it?
Quite frankly there hasn't been a single conclusive study showing that there is any risk from GM crops. It's all just scare stories and psuedo-science.
Now that the entire genome is sequenced and work is underway on finding the individual genes and their functions, what advances are we going to see? Well plenty really, from screening and treatments for genetic illnesses, to modified organisms that are better and can survive in more extreme conditions. There's the potential to change almost everything as we begin to work out the sequences of more and more living beings.
But what concerns me is that the whole backlash against anything with the world "genetic" in it will slow or even stop the flow of scientific advancement. We've already seen how companies like Montesanto can have their research attacked, spoiled and subjected to the worst kind of slanderous publicity, and as we get the capacity to do more, these attacks will likely get worse, fueled by an ever more virulent group of protesters and environmentalists.
These people are true zealots which make RMS look like an apologist. They think nothing of resorting to intimidation, violance and criminal damage, whilst at the same time engaging in a war of words which admits no logic and no compromise. In some cases, the very lives of researchers who labour to increase our knowledge is at risk, and we cannot afford to let this happen, not with the problems of population growth looming large over humanity.
These people are dangerous, and their actions need to be curbed. No longer should they be able to get away with their lies and violent behaviour, no more than any common thug. They can claim moral superiority, but in truth it seems as though these people are as bigoted as any racist, and just as determined to further their cause.
We can't allow research to the thwarted because of the voices of a small bunch of extremists. That's not democracy at all.
How about trying to get an interview with this guy? Could be very interesting.
Co-founder and designer at Music Nearby: http://musicnearby.com
Jim Kent was once known in the mid-80s for writing Zoetrope, a 2D path-based animation system for the Atari ST, not unlike today's Flash technology. Zoetrope also became Aegis Animator on the Amiga, and Autodesk's Animator Pro for the PC, which begat the .FLI/.FLC animation format. I believe Kent also worked on the first DOS generations of Autodesk's 3D Studio, too.
Curator of the Jefferson Computer Museum http://www.threedee.com/jcm
I am more concerned with this kind of projects being run by commercial enterprises. Just because you can make money out of it doesn't make it right.
What's going to happen is we have to go into the protein world to really understand where the genome is taking the next level of biology. That's ten times as complex at least.
What is also noted is that the combination of these protein interactions is staggeringly more complex. I can imagine that the system interactions may be a million times or more complex.
So in my mind, patenting a gene might wind up being similar to patenting the management system of a nuclear power plant, and thinking that therefore you understand nuclear physics.
"It is a greater offense to steal men's labor, than their clothes"
a) sequencing, that is -- getting the actual sequence. This is almost purely technical work, and definitely not very interesting for a scientist, although you can get a lot of credits for it.
b) annotating the sequence: finding out where are the genes, what are the similarities between them and between the genes known from another organisms, and what can be suggested about their function based on those similarities. This is pure bioinformatics stuff: first finding the "open reading frames" (ORFs), that is -- anything that can be a gene at all: it has to start with an "ATG" (codon for metionine) and stop with a so-called stop codon. This is only the most basic criterium.
Whatever comes later is called "postgenomics", and it is probably the most exciting stuff in this whole area of reasearch.
1) in most of the genome projects which were done until now, as much as half of the proposed genes had not even a rough function assigned to them. (the group I'm working in sequenced a bacterial genome back in 1996, and during that time the situation hasn't changed much). Experimental work and more biocomputing is needed to find out what those genes do. The problem with biocomputing isn't the lack of CPU, but the lack of good strategies / models / theory (or, not lack of "good", but lack of "better" strategies etc.).
2) knowing what a gene does is, contrary to the common belief, only very little information. You need to know how it is regulated, and this means a lot of tedious and complicated experimental work: two hole areas of postgenomic science deal with that -- transcriptomics (regulation on RNA level) and proteomics (on protein level). You have to understand that each gene is regulated on many levels -- transcription of the gene from DNA to RNA, turnover (that is, the speed of degradation) of the mRNA, speed of translation, amino acid composition of the protein, protein turnover. Moreover, the genes are interconnected into networks rather then pathways. Creating a functioning model of an eukaryotic cell will be probable impossible during the next twenty or so years. That is -- among other things -- my group works with a little bacterium, which has only +- 700 genes. And even though it is a couple of orders of magnitude more simple then the simplest eukaryotic cell, it is very, very, very complicated.
Take-home lesson: don't be too enthusiastic. This is not the flight to the moon. This is only the first Sputnik.
Best regards,
January Weiner