Slashdot Mirror


A Genome Mark-up Language

There's an interesting story running about the need/development of genetic mark-up language. It's called GEML - Gene Expression Mark-up Language and is basically a DTD [?] . Obviously, with working with things like genes, GEML is useful - and a good example of why DTD is muy bein.

28 of 84 comments (clear)

  1. Pathetic research by the author. by Anonymous Coward · · Score: 2

    Mark Pesce ought to spend more time researching what he's writing about rather than plugging VRML. From the article:

    The "reporter" tag defines a sequence of codons (the four amino acids that comprise DNA) -- TACAGTGTCAGAATTAACTGTAGTC --




    Elementary Grade 9 biology here, Mark. A codon is a sequence of three nucleotides (ex: GCC) that are in turn expressed into the 20 amino acids that constitute the building blocks of all our proteins. Don't just regurgitate what was in the press release!

    Anyway, GEML is useless for real exchange and analysis of genetic information. For that purpose, I agree with a previous poster about packing 2 nucleotides per byte. It's an optimization that must be accepted as a standard before we can start doing on-demand heavy processing of genetic results.

  2. Re:It's a closed standard. by Tim · · Score: 2

    "I would agree that bioxml servers as a much better licensing model for the community than GEML, its worth mentioning that at the current time they do not compete. GEML appears to be about gene expression, and bioxml has no DTD's addressing this."

    True. I do think that bioxml's goal is the same as GEML, but they're just not as far along as GEML (yet). It's just bothersome to me that a company-owned and controlled format like GEML could become very prevalent. I would still much rather see something like bioxml succeed instead. I hope they don't give up because of this...

    --
    Let's try not to let fact interfere with our speculation here, OK?
  3. Re:DTDs shouldn't be forked - thats the point by Tim · · Score: 3

    "there's absolutely no value in forking a DTD. Unless you think there was maybe some value in all of the "modifications" Netscape and Microsoft made to the HTML DTD, for a simple example - its the same in this case."

    Apples and Oranges.

    HTML is controlled by the w3c--a standards body more or less independent of any particular company. Sure, M$ and Netscape had a lot of pull on HTML, but they *should* have, given that they *were* the browser market for a long time.

    In this case, we have a particular bioinformatics company graciously offering up their own "public domain" DTD as a standard for the rest of the industry (how generous). And a major scientific journal latching on to it. The only problem is, that same bioinformatics company must approve any and all changes to the "standard"! It would be the same if HTML were a copyrighted property of Netscape, Inc.

    It would be nice if the bioinformatics community could organize and form it's own XML standards body, a la the w3c. An agreed-upon standard is almost always better than a legislated standard.

    --
    Let's try not to let fact interfere with our speculation here, OK?
  4. Not the first... by Tim · · Score: 5

    The bioxml project has been trying to do this very thing for quite a while now. Previous to that, there was the biomolecular sequence markup language (BSML), and I don't think it ever came close to becoming a standard. The problem that these efforts always run into is the sheer diversity of opinion on how biological data should be represented. Molecular biologists and computational biologists can't even agree on the basic things, like how to represent sequence regions, let alone more complex issues, like annotation syntax.

    Why Nature chose GEML as a standard is unclear--the article doesn't present a compelling argument for it over the alternatives, and the choice seems a little arbitrary. It'll be interesting to see what impact this has on the other projects, and how open the standard will be to extension and modification.

    --
    Let's try not to let fact interfere with our speculation here, OK?
  5. It's a closed standard. by Tim · · Score: 5

    From the GEML terms of use:

    The GEML Format is a free, public-domain, open standard created and licensed by Rosetta Inpharmatics, Inc. ("Rosetta") in order to define a single, distinct format for handling gene expression data and avoid proliferation of incompatible variations.
    ...
    You may not modify, lease, loan, sell, charge for, or create derivative works of the GEML Format or documentation without written permission from Rosetta.


    So nobody can fork the standard without first consulting with Rosetta Inpharmatics. Wonderful. I just love their definition of "open standard."

    This looks like another corporate-buddy move by a major scientific journal, much like the Science/Celera deal a few weeks back...

    Go see bioxml for a truly open alternative.

    --
    Let's try not to let fact interfere with our speculation here, OK?
    1. Re:It's a closed standard. by wowbagger · · Score: 2
      The GEML Format is a free, public-domain[...]

      You may not modify, lease, loan, sell, charge for, or create derivative works of the GEML[....]

      IT seems somebody doesn't understand the legal meaning of "public domain": that anybody can modify what is in the public domain, without restriction. That is why free software and Open Source Software AREN'T "public domain"!
    2. Re:It's a closed standard. by Phillip2 · · Score: 2
      "Go see bioxml for a truly open alternative."

      I would agree that bioxml servers as a much better licensing model for the community than GEML, its worth mentioning that at the current time they do not compete. GEML appears to be about gene expression, and bioxml has no DTD's addressing this.

      As for nature, well I expect that there publishers are worried. Sooner or later paper journals are going to disappear. Perhaps they are diversifying, and have a stake in the company. This is not necessarily a problem. Even nature does not have the power to make a standard.

      Phil

  6. Mejor by Pseudonymus+Bosch · · Score: 2

    You could also inform that the proper Spanish phrase is "muy bien".
    __

    --
    __
    Men with no respect for life must never be allowed to control the ultimate instruments of death.
    GW Bu
  7. lots of "exceptions" to the coding rules by peter303 · · Score: 2

    The genome is much like human language-
    a fair amount of regularity plus a lot of special
    cases. In fact the latter throws off decoding
    robots and you see statistics like 98% decoded, etc.
    The scientific papers are full of nifty
    exceptions to what was believed before.

    The markup language would have to be flexible
    enough to encode all the exceptions- perhaps as
    a procedural attachment.

  8. GEML? Bah! Quadrary Encoding! by weston · · Score: 3

    While all of this is fairly unreadable -- even by geneticists -- it is easily read by a computer

    GEML? Hard to read? Bah! What we should *REALLY* do is figure out a quadrary (you know, after binary and trinary) encoding scheme for all the other info and just pre-pend it to the beginning of the amino acid sequence. Maybe even insert it in some points, with some sort of delimiting sequcne, of course. None of this wimpy markup language stuff.



    --

  9. Re:standards are important esp. for biologists by Star_Gazer · · Score: 2

    Unfortunatly, they often tend not to do that :(
    At least life scientist do not.

    Instead, the use (the much dreaded) Word and wonder why all their betas, gammas, indices etc. tend to always disappear in the wrong moment...

    I once wrote an web application where people could submit an abstract for a congress on developmental neurobiology. I allowed for subsets of HTML or simplifed LaTeX for text formatting. It was hell - even the brightest people in their field failed to understand the concepts. I believe I spend more time searching texts for missing tags or closing braces than for anything else...

  10. Article ignored what is already used! by upstateguy · · Score: 5
    As a molecular genteticist, I am familiar with the markup languages that *already* exist for annotating genome sequences. Free software from NCBI even helped you format your sequences for submission to databases.

    Sorry, I'm too lasy to annotate this myself :-):

    Link to NCBI

    FASTA looks remarkably like the example given in the article.

    Quicky description of FASTA (just one of many schemes but one of the most popular and oldest.

    Perhaps rather than writing a trendy article trying to get buzzwords like genomics and bioinformatics together with geek speak, he should have done a tad more research.

    Not to say there can't be huge improvements and trying to show the interplay (temporally AND physically) between genes. But don't do a half-assed job by ignoring what has already been used for decades.

    1. Re:Article ignored what is already used! by Phillip2 · · Score: 2
      The problem with most of the markup languages used in biology is that the are simple two letter at the begining of the line schemes. They tend to be very unexpressive as there are no relations between the tags (a line is one thing or another, and each line is independant of the last). The main problem with this unexpressivity is that it means "all the biology is in the comment field", or in other words unstructured free text. To extract this information out in a machine readable way, you get straight into natural (or as this is biology fairly unnatural) language parsing, and hit the same brick wall that AI has for the last 30 years.

      I agree that the article linked to is half-assed, and badly researched. But the sad fact is that most of the database formats in existance also seem to be fairly half assed. I think that XML might help us to get around some of these problems.

      Phil

  11. XML considered harmful by dingbat_hp · · Score: 2

    This is another example of What's Wrong With XML (and particularly, what's wrong with proliferating schemas all over the place).

    A schema isn't a means of publishing your data to a wider audience, it's a means of locking-out everyone who doesn't have a copy of it.

    Look at real user of RDF for how to do this in a better way. XML is great, but the coupling between structure and semantics that comes from using an XML schema to represent both is a nightmare for interworking between teams that overlap, but aren't identical enough to use exactly the same schema.

    A couple of years ago, we watched a bunch of old guys slaving over COBOL legacy conversion programs, desperately trying to suck the data out and into SQL, before Cinderella's glass computer turned back into the Y2K pumpkin. I don't want my future to turn into the same thing, scratching together n^2 XSL transforms to convert fooML into foo'ML.

  12. Re:Human Markup Language by Shimbo · · Score: 2
    It will have the additional benefit that you could do gene therapy by applying an XSL stylesheet in the transporter.

    Oh dear, this is beginning to sound like a Voyager plot.

  13. standards are important esp. for biologists by myc · · Score: 5
    since classical genetics has been around for a lot longer than computers and ascii, many classical genetic nomenclature use nortoriously asii-unfriendly symbols. For instance, as many of you know, Drosophila (fruit fly) geneticists can basically name genes anything they want to, and nomenclature to denote specific mutant alleles of genes use all sorts of evil things like subscripts, superscripts, Greek letters, etc etc. In short, it's just a total mess. Similarly, although yeast geneticists do have a standardized nomenclature, it's very ascii-unfriendly, due to things like Greek letters, superscripts, subscripts, etc etc. Nomenclature for mammalian systems such as mouse and humans is even worse, there is basically no standard. for instance some gene names use all CAPS while others only capitalize the first letter, and some use the common three-letter convention plus a number (BMP1, BMP2, BMP3, etc etc), while others use a Drosophila-type naming scheme (e.g., agouti, shaker are mouse mutant names)(there is some uniformity that is given to gene assignments in large sequencing projects, but those are just an alphanumeric sequence, it's not very descriptive).

    Constrast this with a relatively more recent model genetic organism, the roundworm Caenorhabditis elegans. Standards were set early whereby all gene names were standardized by basis of their phenotype (eat-4 is a worm with a mutant feeding behavior, unc-6 describes a worm with uncoordinated movement, lin-41 describes a mutant with mutant cell development lineage, etc etc), and is ascii-friendly. As a result, C. elegans people enjoyed standardized and searchable computerized gene databases for much longer than other geneticists in other fields.

    I hope that a standard becomes set and rapidly adapted; lab chiefs (to us grad student peons anyway) can often seem like PHB's in IT when it comes to adapting new methods and paradigms.

    --
    NO CARRIER
    1. Re:standards are important esp. for biologists by fortunetroll · · Score: 2

      This is why scientists write documents in LaTeX, not ASCII.

      On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"

  14. CellML by alexburke · · Score: 2

    From the Feed article:

    GEML ISN'T alone. It has a competitor, another DTD known as CellML, used to define the complex interactions that take place within cells. CellML takes an integrated approach to describing all of the processes within a living cell -- its genes, proteins, enzymes, and chemical reactions, the pathways and connections between each part of the whole. CellML seems well suited to the kinds of work that supercomputers do -- creating simulations of incredibly complex systems -- while GEML only defines the genetics that create the cell.

    Doesn't this seem a more apt way of describing a living organism? Sure, it's undoubtedly more complex and expensive (financially and computationally), but if you were to set an E10000 or Cray (or maybe a high-end Sun farm) to work on CellML, wouldn't it do more in less time than having to work everything out manually with GEML?

    --

  15. And a closed standard ain't a bad thing... by mpesce · · Score: 2

    That's not a bad thing. Standards should not be arbitrarily pulled apart - particularly by competing commercial organizations (reference my XML article on FEED from a few years ago for points on this matter). The VRML97 ISO spec is "owned" by the Web3D consortium, in fact to make spec changes basically "illegal". Whatever that means.

  16. DTDs shouldn't be forked - thats the point by Ars-Fartsica · · Score: 2

    Open standard or not - there's absolutely no value in forking a DTD. Unless you think there was maybe some value in all of the "modifications" Netscape and Microsoft made to the HTML DTD, for a simple example - its the same in this case.

  17. No tool support, yet by Ars-Fartsica · · Score: 2
    For the time being, DTDs are going to be required for defining new XML grammars - Schemas are still brand new, and tool support is weak to nonexistant.

    DTDs will probably stick around in one form or another for the next few years - its unfortunate that Schemas couldn't have been part of XML 1.0 - unfortunately the co-existance of DTDs and Schemas will cause code bloat as tools will basically need to support both.

  18. Wake up, RDF is dead by Ars-Fartsica · · Score: 2
    A schema isn't a means of publishing your data to a wider audience, it's a means of locking-out everyone who doesn't have a copy of it.

    Are you telling me that someone who doesn't have my data doesn't have it? Your astounding conclusion seems to be some sort of convoluted identity function.

    Look at real user of RDF for how to do this in a better way. XML is great, but the coupling between structure and semantics that comes from using an XML schema to represent both is a nightmare for interworking between teams that overlap, but aren't identical enough to use exactly the same schema.

    No one is doubting that poorly implemented schemas will degrade productivity, but I don't see how a dead, unused (sorry, never was used, ever) standard like RDF is going to help. Added to which you can employ namespaces to form compound documents from many schemas, so your limitation doesn't exist in any case.

    A couple of years ago, we watched a bunch of old guys slaving over COBOL legacy conversion programs, desperately trying to suck the data out and into SQL, before Cinderella's glass computer turned back into the Y2K pumpkin. I don't want my future to turn into the same thing, scratching together n^2 XSL transforms to convert fooML into foo'ML.

    You're vastly overestimating the dynamic nature of these schemas - this isn't the HTML DTD we're talking about. Look at DocBook, as an example - people have been able to use it for years without concern that the next revision would destroy their document semantics. Once again proof that a properly designed format weakens your counterarguments, and in any case, RDF isn't going to ever, EVER take off, so its probably time to quit flogging it.

  19. Re:RDF hasn't woken up yet. by Ars-Fartsica · · Score: 2
    I see your point, but semantics are never enforceable anyway. At the end of the day, if people want to take your document and completely invert your semantics, they are going to do it.

    Added to which, you haven't told me how RDF gets around this, or are you saying that the issue should be avoided altogether?

  20. HTML-like tags by Fervent · · Score: 5

    Insurance provider: Well Mr. Johnson, I'm afraid you have the tag.
    Mr. Johnson: No!
    Insurance provider: Yup. It's right between the <bald ugly-looking guy> tag and the <most likely to drink beer after finding out his wife gets fatter with age> tag.
    Mr. Johnson: Oh God.
    Insurance provider: I'm sorry.
    Mr. Johnson: Is this hereditary? What can be done about my kids?
    Insurance provider: Well, we can comment out the little buggers if we try. Some GScript may work to prevent them from passing the traits onto their children. Hell, we may even be able to use some Gava to touch up their faces so they won't be as ugly as you.
    Mr. Johnson: And as for me?
    Insurance provider: Your body is 2.0, Mr. Johnson. As far as we're concerned, noone supports you anymore.

    --

    - I don't care if they globalize against free speech. All my best free thoughts are done in my head.

    1. Re:HTML-like tags by fortunetroll · · Score: 2

      So long as my child doesn't turn into a Javascript popup window.

      And then there's the parallel between reproduction of the species and that damn close-browser-window-makes-more-windows-popup trick that some sites pull on you. And I don't mean the fact that its usually a porn site that does it.

      On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"

  21. We are very closed to this. by mentin · · Score: 2
    We are really close to being able to modify human genome.

    From CNN: Genetically modified monkey - named ANDi carries in him an extra bit of DNA from a jellyfish. ANDi is the first primate to be similarly modified.

    See CNN story for full details.

    --
    MSDOS: 20+ years without remote hole in the default install
  22. Hmmm.... by Calle+Ballz · · Score: 2

    <GEML>
    <body eyes="#00FF00" hair="#4F1F5F" height="74in" weight="175lb" crotchproperties=endowed>
    </GEML>



  23. Muy Bein... wow by tlipcon · · Score: 2

    Yet another slashdot spelling mistake... If you're going to try to be witty and use other languages to try to increase people's perception of your intelligence or chic-ness, at least do it right. And this is a first post- MY first post, not the story's first post...


    --

    --


    --
    - It ain't easy, being green.