A Genome Mark-up Language
There's an interesting story running about the need/development of genetic mark-up language. It's called GEML - Gene Expression Mark-up Language and is basically a DTD [?] . Obviously, with working with things like genes, GEML is useful - and a good example of why DTD is muy bein.
"there's absolutely no value in forking a DTD. Unless you think there was maybe some value in all of the "modifications" Netscape and Microsoft made to the HTML DTD, for a simple example - its the same in this case."
Apples and Oranges.
HTML is controlled by the w3c--a standards body more or less independent of any particular company. Sure, M$ and Netscape had a lot of pull on HTML, but they *should* have, given that they *were* the browser market for a long time.
In this case, we have a particular bioinformatics company graciously offering up their own "public domain" DTD as a standard for the rest of the industry (how generous). And a major scientific journal latching on to it. The only problem is, that same bioinformatics company must approve any and all changes to the "standard"! It would be the same if HTML were a copyrighted property of Netscape, Inc.
It would be nice if the bioinformatics community could organize and form it's own XML standards body, a la the w3c. An agreed-upon standard is almost always better than a legislated standard.
Let's try not to let fact interfere with our speculation here, OK?
The bioxml project has been trying to do this very thing for quite a while now. Previous to that, there was the biomolecular sequence markup language (BSML), and I don't think it ever came close to becoming a standard. The problem that these efforts always run into is the sheer diversity of opinion on how biological data should be represented. Molecular biologists and computational biologists can't even agree on the basic things, like how to represent sequence regions, let alone more complex issues, like annotation syntax.
Why Nature chose GEML as a standard is unclear--the article doesn't present a compelling argument for it over the alternatives, and the choice seems a little arbitrary. It'll be interesting to see what impact this has on the other projects, and how open the standard will be to extension and modification.
Let's try not to let fact interfere with our speculation here, OK?
From the GEML terms of use:
...
The GEML Format is a free, public-domain, open standard created and licensed by Rosetta Inpharmatics, Inc. ("Rosetta") in order to define a single, distinct format for handling gene expression data and avoid proliferation of incompatible variations.
You may not modify, lease, loan, sell, charge for, or create derivative works of the GEML Format or documentation without written permission from Rosetta.
So nobody can fork the standard without first consulting with Rosetta Inpharmatics. Wonderful. I just love their definition of "open standard."
This looks like another corporate-buddy move by a major scientific journal, much like the Science/Celera deal a few weeks back...
Go see bioxml for a truly open alternative.
Let's try not to let fact interfere with our speculation here, OK?
While all of this is fairly unreadable -- even by geneticists -- it is easily read by a computer
GEML? Hard to read? Bah! What we should *REALLY* do is figure out a quadrary (you know, after binary and trinary) encoding scheme for all the other info and just pre-pend it to the beginning of the amino acid sequence. Maybe even insert it in some points, with some sort of delimiting sequcne, of course. None of this wimpy markup language stuff.
--
Tweet, tweet.
Sorry, I'm too lasy to annotate this myself :-):
Link to NCBI
FASTA looks remarkably like the example given in the article.
Quicky description of FASTA (just one of many schemes but one of the most popular and oldest.
Perhaps rather than writing a trendy article trying to get buzzwords like genomics and bioinformatics together with geek speak, he should have done a tad more research.
Not to say there can't be huge improvements and trying to show the interplay (temporally AND physically) between genes. But don't do a half-assed job by ignoring what has already been used for decades.
Constrast this with a relatively more recent model genetic organism, the roundworm Caenorhabditis elegans. Standards were set early whereby all gene names were standardized by basis of their phenotype (eat-4 is a worm with a mutant feeding behavior, unc-6 describes a worm with uncoordinated movement, lin-41 describes a mutant with mutant cell development lineage, etc etc), and is ascii-friendly. As a result, C. elegans people enjoyed standardized and searchable computerized gene databases for much longer than other geneticists in other fields.
I hope that a standard becomes set and rapidly adapted; lab chiefs (to us grad student peons anyway) can often seem like PHB's in IT when it comes to adapting new methods and paradigms.
NO CARRIER
Insurance provider: Well Mr. Johnson, I'm afraid you have the tag.
Mr. Johnson: No!
Insurance provider: Yup. It's right between the <bald ugly-looking guy> tag and the <most likely to drink beer after finding out his wife gets fatter with age> tag.
Mr. Johnson: Oh God.
Insurance provider: I'm sorry.
Mr. Johnson: Is this hereditary? What can be done about my kids?
Insurance provider: Well, we can comment out the little buggers if we try. Some GScript may work to prevent them from passing the traits onto their children. Hell, we may even be able to use some Gava to touch up their faces so they won't be as ugly as you.
Mr. Johnson: And as for me?
Insurance provider: Your body is 2.0, Mr. Johnson. As far as we're concerned, noone supports you anymore.
- I don't care if they globalize against free speech. All my best free thoughts are done in my head.