International Challenge To Computationally Interpret Protein Function
Shipud writes "We live in the post-genomic era, when DNA sequence data is growing exponentially. However, for most of the genes that we identify, we have no idea of their biological functions. They are like words in a foreign language, waiting to be deciphered. The Critical Assessment of Function Annotation, or CAFA, is a new experiment to assess the performance of the multitude of computational methods developed by research groups worldwide and help channel the flood of data from genome research to deduce the function of proteins. Thirty research groups participated in the first CAFA, presenting a total of 54 algorithms. The researchers participated in blind-test experiments in which they predicted the function of protein sequences for which the functions are already known but haven't yet been made publicly available. Independent assessors then judged their performance. The challenge organizers explain that: 'The accurate annotation of protein function is key to understanding life at the molecular level and has great biochemical and pharmaceutical implications, explain the study authors; however, with its inherent difficulty and expense, experimental characterization of function cannot scale up to accommodate the vast amount of sequence data already available. The computational annotation of protein function has therefore emerged as a problem at the forefront of computational and molecular biology.'"
Actually, I don't think the parent topic is actually off topic.. when we do in fact decipher a genes function, it doesn't necessarily mean we will get the more subtle nuances of how they function as part of the whole orgamism, in other words, we could read specific functionality literally but misinterpret functionality of the whole..
Without a good plan, we'll be at it for decades. Here's what I think genomic researchers should do.
Genes (and proteins) are obviously organized hierarchically. Which means there must be a control hierarchy in there somewhere. To unravel and properly classify the genome, researchers must first identify and understand the hierarchical control system. Only then can they begin to populate the branches with the correct genes.
After the tree is completely built and all the genes have found their correct locations on the tree, then it's a matter of going through the tree from the top down and switching the branches of the tree off/on one at a time to see what happens. It's hard but it can be done.
Unfortunately there doesn't have to be "a" control hierarchy: each subsystem can have its own hierarchy (or none) that uses its own unique control mechanisms, they don't have to operate by the same rules, they can mess with each other by lots of different ad hoc means. And that's just the genes: the proteins are much harder to model, at least as far as useful predictions go.
It's been ad hoc with no code review for over 3 billion years.
Stunning. Absolutely astounding. Yet another AC has taken Science by the balls and shaken the Universe to it's core. Dizzying intellect, artistic prose. He's probably six feet tall, blonde and with the chiseled features of a Grecian statue.
Oh. Wait.
Faster! Faster! Faster would be better!
This is the dumbest thing, not related to football, that I have read all day. "obviously" hierarchical? That's utterly idiotic. And I mean utterly, betraying a complete lack of any experience with metabolic processes. Many, perhaps even most, protiens do many things in many circumstances, and have dynamic equilibria within more than one metabolic chain, as do many of the small molecules which are produced.
Fugue for Aaron Swartz
"It's been ad hoc with no code review for over 3 billion years." This again, is immensely stupid. First, natural selection is constantly weeding out undesirable variations, and second the genome is highly tectonic, constantly removing or altering pathways. It's not teleological, but DNA is the coding mechanism precisely because it is not a passive storage medium.
Fugue for Aaron Swartz
Don't discount that as stupid. Most of what he said is true. Evolution makes you write code that works, not good or clean code, just code that works. The only time evolution comes into lay is when the code can't even compile.
http://tinyurl.com/42geekcode
Don't discount that as stupid. Most of what he said is true. Evolution makes you write code that works, not good or clean code, just code that works. The only time evolution comes into lay is when the code can't even compile.
Indeed there's even some selective pressure for code obfuscation. Viruses take advantage of compression for example. New functions usually evolve from faulty events in old genes. There's no pressure to remove accidental calls to the wrong subroutine if they don't matter, hence a lot of messages go to the wrong place as well as the right place. Even in higher animals you see this (dog's legs that scratch themselves when you scratch their ribs) is probably some back propagation on the nerve network that was not necessary to remove for proper operation of the dog.
Some drink at the fountain of knowledge. Others just gargle.
I am not a biologist so forgive me my ignorance but when people say that DNA is the blueprint for an organism I never understand how a bunch of proteins can determine an organism's shape and behavior. Aren't there more factors that determine those things, like the surroundings in which the DNA is used, like chemicals that the growing organism is surrounded with, temperature, etc?
-- Cheers!
I wouldn't rule it out. My understanding was that folding at home was brute force taking these sequences, testing all possible conformations, and seeing what was the lowest energy conformation. That's still what happens to actual proteins when they fold up, so it's not like the approach doesn't make sense.
It's possible that some protein out there will cure a lot of cancers. It could be in platypus, or in some fungus in a desert, some coral, or some other exotic species. We're never going to test all proteins in very many species to see if they're useful. However, we've already sequenced a lot of genomes, and will sequence a lot more. We thus have a lot of protein sequences. We're never going to purify most of them and determine the structure that way. Computing them and using that to identify proteins that may be useful on the other hand, that's within reason. It will take a lot of computing power though. So there will probably be a use in something like folding@home.
1. We have accurate mapping of the genes.
We have a pretty good idea on this one. Specific polymerases have specific sequences which they respond to, defining the start sequences of genes. It is possible we have missed some polymerase, but the likelihood is low given the extensive searches which have been done for them. As well, regions which are genes have a distinctively different character than regions which are not genes (at least in the general sense).
2. We can predict the protein sequence from the sequence of the gene.
We also have a pretty good idea about this, due to decades and decades of biologists trying to figure out the answer to this problem. The genetic code turns out to differ in some organisms from what we think of as the default. Sometimes multiple amino acids are coded for by the same sequence of bases, and so multiple proteins are produced from the identical coding region of DNA. Sometimes proteins are produced with modified amino acids, which are not explicitly coded for in the DNA of the gene, but rather by the activity of other proteins defined elsewhere by DNA. (This is a stochastic process and interference in the distribution of outcomes can sometimes result in pathological consequences.) In some organisms, the DNA is decompressed into RNA which is then translated into protein in a more typical way. (Extra bases are incorporated into the RNA in a repeatable way that results in amino acids added which were not defined in the sequence of DNA of the gene being added to proteins.) There's a whole bunch of stuff on alternate splicing, which we explicitly know that we don't know how to predict, that produces variations in protein sequence from a single gene sequence.
3. One protein can not be the product of two genes.
There are plenty of ways in which two separate genes can produce an identical protein. This actually happens ALL THE TIME in mammals, since we have two copies of every gene and most of these pairs have identical sequence. Even if the genes produce the identical protein through different mechanisms, if the protein is identical... then the protein is identical.
4. We have a good understanding of what the functions of the proteins in the training set are.
We do have a good idea of what the functions of the proteins in the training set are. See all of molecular biology for your citations.
5. If two proteins have similar sequence, they must have similar functions.
This is explicitly known to be false and is not expected under the evolutionary model. Look up the category of proteins known as 'crystalins' for a specific case counter to your assumption.
6. One protein has one function.
It is generally thought that there is a primary function for every protein. All things in biology are fuzzy, such that every protein probably has secondary side reactions or functions which may or may not be biologically relevant. (Arsenic is poisonous to us because our enzymes have a hard time distinguishing it from Phosphorous, so the enzymes which incorporate phosphorous also 'function' to incorporate arsenic.)
7. A protein has a function.
Any protein synthesized by a cell costs energy. Under the evolutionary model of biology, proteins which don't have a function should have been discarded because their synthesis was wasting energy. That said, lots and lots of proteins are continuously created and then rapidly degraded because they were improperly folded or had other problems which brought them to the attention of intracellular systems with the 'function' of degrading such errant protein and returning their components to the cell for more productive use. Some genetic diseases are the consequence of the buildup of proteins which are otherwise non-symptomatic, but don't get degraded properly by the degradation systems.
You justify an assumption with assumption. The core promoter sequences are so degenerate that they can be found pretty much anywhere. This has lead to misannotation of long genes as multiple single genes. There are a number other causes of annotation errors.
There are also numerous examples of manually curated entries that are wrong because people studied non-existent proteins as a result of cloning artifacts or ignoring nonsense mediated decay. Here is one example where a transcripts containing unspliced introns that are eliminated by NMD have been studied and ascribed a function Zhu J, Chen X. MCG10, a novel p53 target gene that encodes a KH domain RNA-binding protein, is capable of inducing apoptosis and cell cycle arrest in G(2)-M. Mol Cell Biol. 2000 Aug;20(15):5602-18. (accessions AF257770, AF257771)
Those long single genes which are sometimes miss-annotated as a series of smaller genes... are sometimes transcribed as a long single gene and sometimes as a series of smaller genes. You've primarily pointed out that biology is hard and that most published papers are full of crap.
Your pretty good idea is applicable to about 60% of the long reading frames and even less applicable to short ORFs: Ingolia NT, Lareau LF, Weissman JS. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011 Nov 11;147(4):789-802.. Mind you this does not include processes like RNA editing, that can further complicate how we predict protein sequence based on gene sequence.
This counter-argument doesn't counter my argument.
I wasn't commenting on ploidity. I had in mind things like trans-splicing, where you assemble mature RNA from transcripts that belong to different genes sometimes located on different chromosomes, or the way protozoan genomes are rearranged prior to expression in the macronucleus.
I wasn't commenting on ploidy either. Protozoans do things in all sorts of ways, most of which we have no idea about... and don't care about for the most part. The knowledge we have about the systems we have applies best to the systems we have studied.
See the MCG10 example above. Even for well studied proteins like p53 (there are over 65,000 publication out there on p53) we keep finding new functions (p53 controlling energy metabolism for example).
You're referring to the network of downstream effects which are influenced by p53. That is a whole order or few of magnitude of difficulty beyond identifying protein function.
Yet, sequence homology is in the base of all algorithms for predicting protein function. I know it is the best tool we have (I use it on daily basis), but still this limits its applicability to generating a testable hypothesis. Which bring is back to my point that we have to experimentally validate all these computational predictions.
...which isn't a point I ever disagreed with. We can most easily predict functions which are the result of sequences we have seen before... but we don't assume that similar sequences will result in similar functions.
Again, you are supporting one assumption with another. Here are couple of examples that fly in the face of it: VPS39/Vam6/TLP is involved in lysosome fusion, but it also regulates TGF-beta signaling. Disruption of any of these functions is lethal for the organism. So which one is the primary?; FAM48A in the nucleus is a part of chromatin remodeling complex that controls transcription. In the cytopla