Human Genome Sequencing Completed
Arthur Dent '99 writes "According to this article at Reuters, the last chromosome in the human genome has finally been sequenced, taking 150 British and American scientists 10 years to complete. The sequenced chromosome, Chromosome 1, is the largest chromosome, with nearly twice as many genes as the average chromosome, making up eight percent of the human genetic code. The Human Genome Project has published the sequence online in the journal Nature, according to the article. It contains 3,141 genes (over 1,000 of them newly discovered), and 4,500 new SNPs -- single nucleotide polymorphisms -- which are the variations in human DNA that make people unique."
They are all different sizes. Chromosomes are numbered from largest to smallest 1 - 22 (except 21 and 22; 21 is actually the shortest and 22 is slightly bigger; the mistake was made in early cytogenetics because they couldn't distinguish the sizes well enough and those two were named incorrectly) + X and or Y. So chr 1, being very large, has a very large number of genes just because it's huge. It isn't the most gene dense, however, which is chromosome 19 with more genes / Mb than elsewhere in the genome.
> Why do one chromosone have more genes than others?
Same reason some source code files contain more lines of code than others. They do different things.
Slashdot monitor for your Mozilla sidebar or Active Desktop.
From the fine article:
"The scientists also identified 4,500 new SNPs -- single nucleotide polymorphisms -- which are the variations in human DNA that make people unique."
There are other variations which make us unique.
Alternate alleles*
Indels (insertions/deletions)
Variable numbers of repeats.*
The genetic code uses 4 letters, but I'll use English for explaination.
A SNP is a single letter which has different values in different individuals: "The cat and the dog" vs "the hat and the dog".
An indel is where letters have been inserted into one sequence or deleted from another (without additional data, we can't distinguish these possibilities.)
"The cat and the dog" vs "the cat and the big dog".
In alternate alleles there are a bunch of changes which always stick together, e.g. we observe "the cat and the big dog" and "the cat and the small mouse", but never (or exceedingly rarely) "the cat and the big mouse" or "the cat and the small dog."
Variable repeats are a special case of indels, but common enough to warrant a category of their own. "The cat and and and the dog" vs "the cat and and and and and the dog".
Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
Just to add on to this
20/20 vision means that when you stand away from something at 20ft, what you see is what the normal person would see at 20ft.
20/40 is, well, if you stand 20ft away, you see what a normal person would see at 40ft
Same goes for 20/10.
You seem to be under the impression that the number 1000 has some special meaning. Let's try your comment again, in octal:
pi * 1750 genes. Got to love those fun coincidences
Not so exciting now, is it? Nature is not decimal-based. The only reason we tend to be is because of the number of fingers we have.
Half answer:
The beginning is very likely a non-coding region, since stuff near the ends can get damaged more readily. The chromosome itself probably does not exactly start with GAT, it probably has a few thousand bases worth of telomere, and this just happens the be the chunk that starts once they get past all that.
Everybody has different genes, but the difference between two indviduals over the total range is measured in decimal-points of a percent. Big chunks of it are exactly the same from person to person.
The basic idea is this. Our cells need a program that tells them what to do. That's the genome. There are a total of 46 chromosomes consisting of two sets of 23 independent chromosomes (1 - 22 and X or Y). DNA makes up the chromsomes. It's just a chemical structure that stores information; the four chemicals that make up DNA are Adenine (A), Thymidine (T), Cytosine (C) and Guanine (G). Every DNA molecule is actually two pieces of DNA that pair together as A binding to T and C binding to G. Sequencing is a chemical reaction that will tell you what the sequences of these four nitrogenous bases are. For example you may end up getting a read of AGTATTACGTATGCATAGGTCCGATG from a sequencing reaction (usu you'll get about 500 - 700 bases in one reaction). This tells you the sequence of ONE of the TWO strands of the DNA molecule. BUT since they pair in a predictable way, you know the sequence of the opposite strand (A-T and C-G). Our genomes are composed of approximately 3.2 billion total As, Cs, Ts and Gs. The goal of the genome project was just to tell us what the sequence of those bases are. That's it. Finding genes and things of that nature are really things that come about from having the primary sequence to reference. If you want to find a mutation you have to know what the sequence is SUPPOSED to be and WHERE IT IS before you can say it is different. That's your quick answer: the genome project sought to determine (1) what the sequence of bases in human chromosomes where and (2) the physical position of these sequences within the chromosomes. They did some other interesting things to prepare for it along the way, but that is a separate matter.
That doesn't mean that the traits will always stay linked. They probably result from residing on the same chromosome. Such things often can be seperated over time with a lucky chromosome aberration.
I think what you're referring to is Serpentor, The Emperor, who was made *by* Cobra Commander from the DNA of the world's most evil people.
http://en.wikipedia.org/wiki/Serpentor
The answer, for bacteria and yeast genes are used to make protein. They start with a 3-base sequence that signals "start making protein," have some sequences that tell the cell which amino acids to put together to make the protein, and end with a 3-base sequence that signals "stop making protien." 3-base "stop" sequences occurs pretty frequently in the genome (just by random chance), so, if you find a long sequence that doesn't have a "stop" sequence, you can be pretty sure it's a gene. For more complicated organisms (like humans), it's much more difficult to tell what's a gene and what isn't, without figuring out what a gene does (because lots of human genes have what look like "stop" sequences in the middle of the gene)--but there are computer models of how genes appear different than "junk" DNA, that can be used to predict what is and isn't a gene. These aren't perfect, but they're fairly accurate.
Completing the sequence and actually putting it together are two entirely different affairs. Small sequences called ESTs (Expressed Sequence Tags) were obtained during this effort. The big task after that was to put everything together AND in order. Think of it as a massive puzzle. Even the genome has different "builds" depending on the level of completeness of this work.
A CC-licensed illustrated horror novel
When we say that "the gene for xxxx is located at yyyy
This means that we *do* know where the particular controlling sequence is located?
Viral gene therapy is a process that can locate the target gene somehow and replace the sequence there with a new sequence?
Does the sequence have to be broken, segmented, and re-built for viral gene therapy? Or is there a "merge" type of operation that "overlays" the new information?
I have read a great deal that in a hand-waving manner, describes viral gene therapy as the next great thing either directly, or by implication. Is that so? Anything else like that, in terms of technology, that is currently looking promising?
I've fallen off your lawn, and I can't get up.
In addition, we have other effects. For example, there is a varying stability between GC and AT pairs, which gives a tendency to a biased ratio in "junk". This stability issue will naturally also possibly give a contribution to the coding sequence, but there, the selection towards specific function will often dominate. This means that you'll, generally, see a difference in GC/AT ratio between coding and non-coding.
(Pseudo-genes, that is, simplified, genes that won't be transcribed anymore because they're slightly broken, are of course often quite hard to discriminate from real genes, how hard depends on the mechanism by which they were created.)
In addition to what the other posters said, the chromosomes are numbered by their size up chromosome 22, then the 23rd pair is the X and/or Y chromosomes. Since this is chromosome 1 we're talking about, it's the largest one.
Determining what is (and is not) a gene is hard work. We know a number of rules (such as the aforementioned it must start with ATG), but these rules are largely of the form, if X is a gene then X has the following properties. These implications cannot be simply reversed; i.e., not all instances of ATG mark the start of a gene.
In simpler organisms, you can simply scan for open-reading frames (i.e., instances of ATG) and keep reading until you hit a stop codon because there is no post-processing of the transcribed RNA. If the result has a reasonable length, you've probably found a gene.
In complex organisms, once the RNA is transcribed, portions of the RNA are removed (spliced out). Thus, there could be a stop codon in the middle of the gene that is removed prior to translation. The splicing process is why certain repeated sequences are needed as filler material. During splicing the RNA strand has to be bent to bring the two ends together. In, for example, the cystic fibrosis gene (CFTR) if you don't have enough repeats, it is less likely that splicing will occur properly.
This is by my count the fourth time that the human genome has been announced "finished" - anymore times and they will all be invited to become slashdot editors.
Automated DNA sequencing software
One point is that there's very little variation between individuals in terms of coding sequence - in this chromosome from the article there's only just over 1 base where there are known single base changes per gene. The most common type of variation is in the number of times repeated streaches of DNA are repeated, this generally (though not always) has no effect on an individual. The numbers of such repeats in the draft sequence are not meaningful in the published sequence.
Databases of variation in the human genome are maintained. The paper accompanying the release of the finished sequence does discuss variation - and notes that in some areas of chromosome people have different numbers of copies of a large region which includes genes.
Nature has made the Full text of the article announcing the completion of the chromosome one finished sequence available online. While this is good, it's still not the open publishing which ought be demanded by those spending public money on scientific endevours such as this.
UK Laptops
First of all, you can't at the moment crack the protein folding problem by throwing more computational power at it. We still lack lots on insight into many of the fundamental forces governing protein folding. Electrostatics at that level are a nasty thing, for example. The scale of the system would require a quantum mechanical treatment, but then again the total systems are too large, not to speak of problems with the basis sets and parametrization for a QM treatment.
Oh, and the protein folding and design problem has been shown to be NP-complete.
Secondly, not all proteins even have defined structures. The class of so-called natively disordered proteins is large and might even comprise about 30% of the whole proteom. Those proteins only adopt structure in interaction with other proteins or other factors.
Third, in many cases the structure doesn't help you very much. True, a cancer causing mutation might have a clear structural effect. In other cases, it could perhaps just subtly alter electrostatics on a protein surface, causing a slight difference in its interaction with another protein, which finally gets amplified way downstream a regulatory cascade where it causes the final problem. Knowledge of protein structures is useful to clarify that, but you need to know the whole interaction network to fully understand it.
Fourth, the cell is crowded. Knowing the structure of an isolated protein in solution does not tell you all about its function in the cell, where it is in contact with lots of other proteins.
Fifth, not only structure determines the function of a protein, but also its dynamics. Proteins move, and these movements are intricately linked to their function.
Sixth, and final, if you want to use the structure of a protein as a basis for rational drug design, you have also to solve the design problem. How do you exactly build a compound with the desired properties? There is no completely rational approach implemented at the moment, much is just done by large-scale trial and error approaches.
This comment does not exist.
Personally, I'd have found it more interesting if there had been 1618 genes, with phi turning up all over the place in nature and all...
"Each draft sequence has been checked at least four to five times to increase 'depth of coverage' or accuracy. About 47% of the draft were high-quality sequences. The final version will have been checked eight to nine times giving an error rate of 1 in 10,000 bases."
Which means that there will be an estimated 300,000 errors in the project.