Gene Mappers May Have Missed Half The Genes
Nepre writes: "Forbes.com is running a story about new research that suggests that the Human Genome Project may have missed tens of thousands of genes in the race to map the human genome. This is interesting given the intense competition between commercial and academic research. As my grandmother used to say, "The faster you go, the behinder you get!""
IAB (I'm a bioinformaticist). You're partly correct. Introns (the 'junk' inbetween the exonic regions in DNA and freshly transcribed mRNA) do tend towards non-random sequence. You can use a variety of metrics to make guesses as to where introns and exons begin and end within a gene's coding region based on sequence entropy, on GC/AT frequency, on neural nets or hidden markov models trained on known examples, etc.
These metrics, however, are only useful once one knows something about where a 'gene' starts and ends. The real problem here is that some of the assumptions we've made historically about gene structure has potentially led us astray. Yes, the chromosomes are full of junk DNA but no, it's nothing near random for the most part and is full of 'repetitive' elements (short segments that repeat endlessly, query Genbank for 'ALU Repeat' and see how many sequences you find) that make any sort of pattern matching a tough sell genome-wide. There are also plenty of 'psueudogenes' interspersed throughout the genome, leftovers from a bygone era. It's the question of which of these pseudogenes might actually still BE transcribed that only mRNA expression analysis can provide. Hopkins is definitely on the right track w/ something like SAGE (though it's not exactly high-throughput, hence our man's need for extrapolation to genome-wide numbers).
The paper should be an interesting read to say the least.
-j
I get the impression that this guy predicted there would be around 80,000 genes, and after the mappers showed there were many fewer, he decided to say that the mappers had their definition of gene wrong, or that they missed the genes. He's not exactly clear on that, except to reiterate that based on indirect evidence, he beleives there are 80,000 genes, if you define them properly.
The article says the 30,000 figure is close to a worm or a fruit fly. There's a better list here, which lists the gene counts for humans, mice, worms, etc. With their disclaimer that these counts are not yet complete, it seems that humans have 46,000 genes in this database, compared with 22,300 for a worm, 24,900 for a fruit fly, and 39,156 for a mouse. Exactly why this should be so unreasonable is beyond me. Maybe if we define the gene so that humans have 80,000, then fruit flies will have 60,000! After all, how can you draw a comparison if you pretend not to know the definition, eh?
Of course they did.
Gene prediction applications uses many statistical and computational methods in order to predict where hypothetically a gene might be hiding.
Information content and entropy measurement are used vastly in many of them.
For example:
the basic tool to compare how much one sequence resembles another (by doing sequence alignment) uses substitution matrices. Matrices that give a score to any possible substitution of one unit - a nucleotide for DNA/RNA or amino acid in the case of proteins, according to the likelihood of this substitution of being meaningful or just pure chance. The most commonly used matrix (actually it's a whole family or series of them) is called BLOSOM and information content is being used on one of the stages in its build.
However, this is not the news here.
What the researchers did was combining the prediction from the DNA sequence itself with results from expression data, meaning the results of experiments that measure whether a subsequence of DNA is transcribed to RNA.
An earlier comment hit the nail on the head, I'm quite sure Mr. Shoemaker sold 80,000 genes to a biotech/pharmacutical company, and now has to explain why he doesn't owe them half their money back (what a funny conversation that would be to listen to).
What many people don't want to address that are trying to sell genomics, is that the differences between a mouse and a human are likely not the result of there being more genes in humans, but rather a difference in regulation of (approximaly) the same number of genes. That is to say that there are likely differences in the promoter (on switch) and repressor (off switch) portions of these genes, that cause one to be active in a certain situation in the human, but not in the mouse. A simple analogy demonstrates the difference: you can have two similar cars with similar horsepower, number of tires, gears etc, but if you put an old grandmother in one, and a formula 1 driver in the other, and watched them drive on the highway, you might make the mistake of thinking one car had more power, a larger engine (genes) than the other- when in fact the difference between the two is due to control of the same equipment(gas=promoter and brake=repressor elements of genes). Further analysis of the control regions of genes, as well as differences in protein-protein interations (proteonomics)will likely explain the differences between a human and a mouse, not 50,000 as yet undiscovered genes.
"If we knew what we were doing, it wouldn't be called research, now would it?' -Albert Einstein-
The question of what is the definition of a gene is not that trivial.
for example in the definition you quote:
" No definition for a gene? "A unit of heredity. The unit of genetic function which carries the information for a single polypeptide." "
The two sentences are not equal definitions.
There are mRNA sequences that are units of heredity, play important roll in defining our genetic makeover but are not translated to become a polypeptide. Does the DNA which serves as their template count as a gene? depends on the exact definition.
It is quite acceptable now that those sequences, include transcription factors, do count, but there are other difficulties in counting.
For example the issue of alternative splicing.
When you take many of the human proteins you will find that the DNA that is translated to make that protein is divided in more than one segment of DNA. It can be divided to many smaller segments - exons, separated by untranslated segments - introns.
Those segments are being cut and edited to result in a protein - a process called splicing.
Now, to complex things up, many times different proteins results from the same or almost the same set of segments by different editing. There is a question how to count those variants.
One of the reasons that the number of estimated genes was probably lower than the actual number is this phenomena of alternative splicing which is much more frequent than first assumed.