Slashdot Mirror


A Genome Mark-up Language

There's an interesting story running about the need/development of genetic mark-up language. It's called GEML - Gene Expression Mark-up Language and is basically a DTD [?] . Obviously, with working with things like genes, GEML is useful - and a good example of why DTD is muy bein.

5 of 84 comments (clear)

  1. Not the first... by Tim · · Score: 5

    The bioxml project has been trying to do this very thing for quite a while now. Previous to that, there was the biomolecular sequence markup language (BSML), and I don't think it ever came close to becoming a standard. The problem that these efforts always run into is the sheer diversity of opinion on how biological data should be represented. Molecular biologists and computational biologists can't even agree on the basic things, like how to represent sequence regions, let alone more complex issues, like annotation syntax.

    Why Nature chose GEML as a standard is unclear--the article doesn't present a compelling argument for it over the alternatives, and the choice seems a little arbitrary. It'll be interesting to see what impact this has on the other projects, and how open the standard will be to extension and modification.

    --
    Let's try not to let fact interfere with our speculation here, OK?
  2. It's a closed standard. by Tim · · Score: 5

    From the GEML terms of use:

    The GEML Format is a free, public-domain, open standard created and licensed by Rosetta Inpharmatics, Inc. ("Rosetta") in order to define a single, distinct format for handling gene expression data and avoid proliferation of incompatible variations.
    ...
    You may not modify, lease, loan, sell, charge for, or create derivative works of the GEML Format or documentation without written permission from Rosetta.


    So nobody can fork the standard without first consulting with Rosetta Inpharmatics. Wonderful. I just love their definition of "open standard."

    This looks like another corporate-buddy move by a major scientific journal, much like the Science/Celera deal a few weeks back...

    Go see bioxml for a truly open alternative.

    --
    Let's try not to let fact interfere with our speculation here, OK?
  3. Article ignored what is already used! by upstateguy · · Score: 5
    As a molecular genteticist, I am familiar with the markup languages that *already* exist for annotating genome sequences. Free software from NCBI even helped you format your sequences for submission to databases.

    Sorry, I'm too lasy to annotate this myself :-):

    Link to NCBI

    FASTA looks remarkably like the example given in the article.

    Quicky description of FASTA (just one of many schemes but one of the most popular and oldest.

    Perhaps rather than writing a trendy article trying to get buzzwords like genomics and bioinformatics together with geek speak, he should have done a tad more research.

    Not to say there can't be huge improvements and trying to show the interplay (temporally AND physically) between genes. But don't do a half-assed job by ignoring what has already been used for decades.

  4. standards are important esp. for biologists by myc · · Score: 5
    since classical genetics has been around for a lot longer than computers and ascii, many classical genetic nomenclature use nortoriously asii-unfriendly symbols. For instance, as many of you know, Drosophila (fruit fly) geneticists can basically name genes anything they want to, and nomenclature to denote specific mutant alleles of genes use all sorts of evil things like subscripts, superscripts, Greek letters, etc etc. In short, it's just a total mess. Similarly, although yeast geneticists do have a standardized nomenclature, it's very ascii-unfriendly, due to things like Greek letters, superscripts, subscripts, etc etc. Nomenclature for mammalian systems such as mouse and humans is even worse, there is basically no standard. for instance some gene names use all CAPS while others only capitalize the first letter, and some use the common three-letter convention plus a number (BMP1, BMP2, BMP3, etc etc), while others use a Drosophila-type naming scheme (e.g., agouti, shaker are mouse mutant names)(there is some uniformity that is given to gene assignments in large sequencing projects, but those are just an alphanumeric sequence, it's not very descriptive).

    Constrast this with a relatively more recent model genetic organism, the roundworm Caenorhabditis elegans. Standards were set early whereby all gene names were standardized by basis of their phenotype (eat-4 is a worm with a mutant feeding behavior, unc-6 describes a worm with uncoordinated movement, lin-41 describes a mutant with mutant cell development lineage, etc etc), and is ascii-friendly. As a result, C. elegans people enjoyed standardized and searchable computerized gene databases for much longer than other geneticists in other fields.

    I hope that a standard becomes set and rapidly adapted; lab chiefs (to us grad student peons anyway) can often seem like PHB's in IT when it comes to adapting new methods and paradigms.

    --
    NO CARRIER
  5. HTML-like tags by Fervent · · Score: 5

    Insurance provider: Well Mr. Johnson, I'm afraid you have the tag.
    Mr. Johnson: No!
    Insurance provider: Yup. It's right between the <bald ugly-looking guy> tag and the <most likely to drink beer after finding out his wife gets fatter with age> tag.
    Mr. Johnson: Oh God.
    Insurance provider: I'm sorry.
    Mr. Johnson: Is this hereditary? What can be done about my kids?
    Insurance provider: Well, we can comment out the little buggers if we try. Some GScript may work to prevent them from passing the traits onto their children. Hell, we may even be able to use some Gava to touch up their faces so they won't be as ugly as you.
    Mr. Johnson: And as for me?
    Insurance provider: Your body is 2.0, Mr. Johnson. As far as we're concerned, noone supports you anymore.

    --

    - I don't care if they globalize against free speech. All my best free thoughts are done in my head.