A Genome Mark-up Language
There's an interesting story running about the need/development of genetic mark-up language. It's called GEML - Gene Expression Mark-up Language and is basically a DTD [?] . Obviously, with working with things like genes, GEML is useful - and a good example of why DTD is muy bein.
They are not for text fomratting. HTML was _misused_ for text formatting - the original idea was you tell the computer about what he thing _is_ and the _computer_ formats it. Like a simplified LaTeX (if you didn't know, in LaTeX, you basically say "I'm writing a book now", "This is the first chapter", "This is a footnote", etc., and the computer decides the formatting. That is to say, where to put everything. Or are you confused as to what formatting means?). Then XML came along. It's main use is telling the computer in an extensible manner, the meaning behind a piece of data - e.g. "This is a spec for a fireplace", "This is an address", "This is how you make a fruit fly". The computer can take appropriate action based on that, then (currently via XSLT and CSS to format + present the data appropriately) - but it can do other things too. It's a way of processing semantic content. It's a step towards AI.
Mark Pesce ought to spend more time researching what he's writing about rather than plugging VRML. From the article:
The "reporter" tag defines a sequence of codons (the four amino acids that comprise DNA) -- TACAGTGTCAGAATTAACTGTAGTC --
Elementary Grade 9 biology here, Mark. A codon is a sequence of three nucleotides (ex: GCC) that are in turn expressed into the 20 amino acids that constitute the building blocks of all our proteins. Don't just regurgitate what was in the press release!
Anyway, GEML is useless for real exchange and analysis of genetic information. For that purpose, I agree with a previous poster about packing 2 nucleotides per byte. It's an optimization that must be accepted as a standard before we can start doing on-demand heavy processing of genetic results.
"I would agree that bioxml servers as a much better licensing model for the community than GEML, its worth mentioning that at the current time they do not compete. GEML appears to be about gene expression, and bioxml has no DTD's addressing this."
True. I do think that bioxml's goal is the same as GEML, but they're just not as far along as GEML (yet). It's just bothersome to me that a company-owned and controlled format like GEML could become very prevalent. I would still much rather see something like bioxml succeed instead. I hope they don't give up because of this...
Let's try not to let fact interfere with our speculation here, OK?
"there's absolutely no value in forking a DTD. Unless you think there was maybe some value in all of the "modifications" Netscape and Microsoft made to the HTML DTD, for a simple example - its the same in this case."
Apples and Oranges.
HTML is controlled by the w3c--a standards body more or less independent of any particular company. Sure, M$ and Netscape had a lot of pull on HTML, but they *should* have, given that they *were* the browser market for a long time.
In this case, we have a particular bioinformatics company graciously offering up their own "public domain" DTD as a standard for the rest of the industry (how generous). And a major scientific journal latching on to it. The only problem is, that same bioinformatics company must approve any and all changes to the "standard"! It would be the same if HTML were a copyrighted property of Netscape, Inc.
It would be nice if the bioinformatics community could organize and form it's own XML standards body, a la the w3c. An agreed-upon standard is almost always better than a legislated standard.
Let's try not to let fact interfere with our speculation here, OK?
The bioxml project has been trying to do this very thing for quite a while now. Previous to that, there was the biomolecular sequence markup language (BSML), and I don't think it ever came close to becoming a standard. The problem that these efforts always run into is the sheer diversity of opinion on how biological data should be represented. Molecular biologists and computational biologists can't even agree on the basic things, like how to represent sequence regions, let alone more complex issues, like annotation syntax.
Why Nature chose GEML as a standard is unclear--the article doesn't present a compelling argument for it over the alternatives, and the choice seems a little arbitrary. It'll be interesting to see what impact this has on the other projects, and how open the standard will be to extension and modification.
Let's try not to let fact interfere with our speculation here, OK?
From the GEML terms of use:
...
The GEML Format is a free, public-domain, open standard created and licensed by Rosetta Inpharmatics, Inc. ("Rosetta") in order to define a single, distinct format for handling gene expression data and avoid proliferation of incompatible variations.
You may not modify, lease, loan, sell, charge for, or create derivative works of the GEML Format or documentation without written permission from Rosetta.
So nobody can fork the standard without first consulting with Rosetta Inpharmatics. Wonderful. I just love their definition of "open standard."
This looks like another corporate-buddy move by a major scientific journal, much like the Science/Celera deal a few weeks back...
Go see bioxml for a truly open alternative.
Let's try not to let fact interfere with our speculation here, OK?
todos podemos hablar con Miguel de Icaza en su propio lenguaje.
You mean object-oriented C?
__
__
Men with no respect for life must never be allowed to control the ultimate instruments of death.
GW Bu
los Estados Unidos de América, cuyo lenguaje oficial es el inglés
Really? Does it appear in the US Constitution?
And isn't Linux and Perl Slashdot-official? Should we limit ourselves to discut these?
__
__
Men with no respect for life must never be allowed to control the ultimate instruments of death.
GW Bu
You could also inform that the proper Spanish phrase is "muy bien".
__
__
Men with no respect for life must never be allowed to control the ultimate instruments of death.
GW Bu
WTF is "muy bein"? Haha.
--
Hemos, when comparing things, use than, not then. For instance, this article should've been from the "it's-better-than-the-web!" dept. The word then is used to describe a time sequence or other ordering, as in "first this, then that." The word than is used to compare things, as in "this is better than that." Got it?
*sigh*
--Joe--
Program Intellivision!
That reminds me of this grammar puzzler. Add punctuation to the following to make it grammatically correct:
--Joe--
Program Intellivision!
The genome is much like human language-
a fair amount of regularity plus a lot of special
cases. In fact the latter throws off decoding
robots and you see statistics like 98% decoded, etc.
The scientific papers are full of nifty
exceptions to what was believed before.
The markup language would have to be flexible
enough to encode all the exceptions- perhaps as
a procedural attachment.
While all of this is fairly unreadable -- even by geneticists -- it is easily read by a computer
GEML? Hard to read? Bah! What we should *REALLY* do is figure out a quadrary (you know, after binary and trinary) encoding scheme for all the other info and just pre-pend it to the beginning of the amino acid sequence. Maybe even insert it in some points, with some sort of delimiting sequcne, of course. None of this wimpy markup language stuff.
--
Tweet, tweet.
Unfortunatly, they often tend not to do that :(
At least life scientist do not.
Instead, the use (the much dreaded) Word and wonder why all their betas, gammas, indices etc. tend to always disappear in the wrong moment...
I once wrote an web application where people could submit an abstract for a congress on developmental neurobiology. I allowed for subsets of HTML or simplifed LaTeX for text formatting. It was hell - even the brightest people in their field failed to understand the concepts. I believe I spend more time searching texts for missing tags or closing braces than for anything else...
Hehheheh =:-) I prefer the _time honored_ method of exchanging genetic material =:-) [sorry, couldn't resist...]
---
Play Six Pack Man. I
Sí. Esto es verdad. Pero estoy estudiande español y como para practicar siempre que pueda.
"I'll take the red pill. No! Blue! AAAaaaahhhhhhhhh"
- Monty Python meets the Matrix
The previous poster suggests that the incorrect muy bein should be spelled muy bien. This is a correct spelling, but misses the grammatical error (hey, this is Slashdot!). bien is (generally) adverbial (meaning "well"), and since we're talking about a DTD, we want to use an adjective ("good"). In other words, the sentence should read "DTD is muy bueno."
ttaacattgagctaacgataggatacgattacattgagctaacgata
tacgattacattgagctaacgataggatacgattacattgagctaac
</genes>
Sorry, I'm too lasy to annotate this myself :-):
Link to NCBI
FASTA looks remarkably like the example given in the article.
Quicky description of FASTA (just one of many schemes but one of the most popular and oldest.
Perhaps rather than writing a trendy article trying to get buzzwords like genomics and bioinformatics together with geek speak, he should have done a tad more research.
Not to say there can't be huge improvements and trying to show the interplay (temporally AND physically) between genes. But don't do a half-assed job by ignoring what has already been used for decades.
GEML just sounds better to the kind of people who would be in charge of this kind of thing. bioxml has no capital letters, is half-pronounceable and half-gotta-be-spelled-out, etc. GEML is all capital letters, can be spelled out or pronounced as a whole, etc. I think that why they chose GEML as a standard is far from unclear; rational is another matter.
I don't see how a dead, unused (sorry, never was used, ever) standard like RDF is going to help.
Admittedly RDF hasn't been used much YET. After all - it's only a year since bog-standard XML took off. I'm a contractor; Dec '99 I couldn't sell XML skills to anyone, Jan 2000 my phone melted. By Easter 2000 everyone else was an XML "guru".
Wrox don't shift their first RDF book until October. You can't store production-grade quantities of RDF in a database yet. How can you say it's "past", when we haven't even finished building the infrastructure tools yet ?
OTOH, the one widely distributed RDF app that is out there (RSS) is even part of Slash. Take a look at those Slashboxes - they aren't running DocBook.
Added to which you can employ namespaces to form compound documents from many schemas,
That's just a quicker recipe for tag soup. The ability to have five different ways to express an author's address doesn't make it any easier to move data between applications or avoid "Dear Mr. Occupier" errors.
"It's the Semantics, Stupid"
Look at DocBook, as an example - people have been able to use it for years without concern that the next revision would destroy their document semantics.
What document semantics ? DocBook doesn't do semantics, and it has a structure that thinks everything is a computer manual. A schema that has a <GUIMenuItem> element, but doesn't have a means of expressing a target readership age ? Rights management that's a bare copyright element with an implied recommendation to attach generated text of "All Rights Reserved" when you render it ? (What if the rights _aren't_ all being reserved ?)
DocBook is a pile of bodges and hacks, and I only use it because I don't know anything else that's out there, and I'm reluctant to roll my own and add another one to the pile.
DocBook is Perl for text documents; lot's of "There's More Than One Way To Do It", and not a lot of "Done. Sorted.".
My current project (the next version of ARKive) is a huge graph of linked nodes, most of which are either text or rich-media. The directed nature of the graph blows plain XML out of the water - there's just no way to handle the referencing problem in XML; you're either fooling around with the inadequate ID & IDREF, or you do it through either XLink, or your own href attributes and lose support for any notion of document structure based on these links, unless you code it yourself at the application level. With RDF, I just talk to an API like Jena and when I make things related, they stay related (and the underlying engine will hand them back to me on demand, as whatever relevant fragment of the document I might need).
I am using DocBook to represent the text content nodes. It's not much more advanced than HTML though - I need a huge amount of markup on each node to select the appropriate set (what it refers to, what it says about it, whether it's written for 7 or 17 year olds) and I hold this trivially in RDF, with DocBook under a content property.
There's simply no way I could express this in DocBook alone. I could express it in DocBook with embedded LOM markup, and I could do that very easily just by namespacing two schemas as you suggest. Ther trouble with that approach though is that the only code that could ever make sense of it would be my own. With RDF, any RDF app (like the Redland app framework) can wander through it and make a pretty good use of it, even if it hasn't seen the documents before.
XML has no mechanism for a semantic schema. Attempting to use the structural schema it does have, as one, doesn't work well and it certainly doesn't travel.
DTDs are going to be required for defining new XML grammars
Rubbish. I haven't written a DTD in over 18 months. Tool support is better than DTD, mainly because Schemas also use XML as their expression syntax and so it's trivial to build tools (often with XSLT) for them.
Schemas are still brand new, and tool support is weak to nonexistant.
Schema has been a Candidate Recommendation since October. Maybe it's not signed off yet, but it's pretty stable and usable out in the "real world".
I thank M$oft for this one. Dropping early versions of XSL and Schema onto developers a long time ago put a rocket under the W3C. This might have ended badly, except M$oft then did something unusual for them and fell back into line with a developing standard. Credit where credit's due...
This is another example of What's Wrong With XML (and particularly, what's wrong with proliferating schemas all over the place).
A schema isn't a means of publishing your data to a wider audience, it's a means of locking-out everyone who doesn't have a copy of it.
Look at real user of RDF for how to do this in a better way. XML is great, but the coupling between structure and semantics that comes from using an XML schema to represent both is a nightmare for interworking between teams that overlap, but aren't identical enough to use exactly the same schema.
A couple of years ago, we watched a bunch of old guys slaving over COBOL legacy conversion programs, desperately trying to suck the data out and into SQL, before Cinderella's glass computer turned back into the Y2K pumpkin. I don't want my future to turn into the same thing, scratching together n^2 XSL transforms to convert fooML into foo'ML.
"The 'reporter' tag defines a sequence of codons (the four amino acids that comprise DNA)"
sheeesh! can't they even get the basics right? a codon is a unit of three nucleotides that encode a single amino acid (there are three out of the 64 that do not code for an animo acid, rather, they code for the translation stop signals).
four nucleotides comprise DNA. there are 20 amino acids.
this type of error is shameful.
james
because it's XML.
james
Oh dear, this is beginning to sound like a Voyager plot.
There are lots of ways to extend and modify the behavior of an XML dialect and an associated DTD/schema without touching the core standard. That's the Xtensible part. They are merely holding veto power over back-propagation of enhancements into the original work.
The point of XML is to standardize the manner of extension. Even SGML allowed for internal subsets of markup declaration to extend the core DTD. The goal of such a standard is not to eliminate incompatibility but to minimize the pain of dealing with it.
Forking a DTD is like forking pudding, it doesn't do anything.
illegitimii non ingravare
Their license would appear to prohibit that which their chosen technology is intended to facilitate.
illegitimii non ingravare
[Assume mandatory smiley here]
Ciao
----
FB
Constrast this with a relatively more recent model genetic organism, the roundworm Caenorhabditis elegans. Standards were set early whereby all gene names were standardized by basis of their phenotype (eat-4 is a worm with a mutant feeding behavior, unc-6 describes a worm with uncoordinated movement, lin-41 describes a mutant with mutant cell development lineage, etc etc), and is ascii-friendly. As a result, C. elegans people enjoyed standardized and searchable computerized gene databases for much longer than other geneticists in other fields.
I hope that a standard becomes set and rapidly adapted; lab chiefs (to us grad student peons anyway) can often seem like PHB's in IT when it comes to adapting new methods and paradigms.
NO CARRIER
It's probably not useful to express hair color as full RGB values, though.
You were being serious, right? Oh.
If a corporation is a personhood, is owning stock slavery?
XML Schemas have the benefit of being written in XML. That should make XML Schema support fairly easy to manage. Of course, the parser has never been the hard part with XML.
If a corporation is a personhood, is owning stock slavery?
Can't wait for Lincoln Stein's GEML.pm module, with handy shortcuts and image creating functions :)
I bet I can make a script that then creates a life form. aww yeah.
Dada Mail - Program, Art Project or Absurdity?
From the Feed article:
GEML ISN'T alone. It has a competitor, another DTD known as CellML, used to define the complex interactions that take place within cells. CellML takes an integrated approach to describing all of the processes within a living cell -- its genes, proteins, enzymes, and chemical reactions, the pathways and connections between each part of the whole. CellML seems well suited to the kinds of work that supercomputers do -- creating simulations of incredibly complex systems -- while GEML only defines the genetics that create the cell.
Doesn't this seem a more apt way of describing a living organism? Sure, it's undoubtedly more complex and expensive (financially and computationally), but if you were to set an E10000 or Cray (or maybe a high-end Sun farm) to work on CellML, wouldn't it do more in less time than having to work everything out manually with GEML?
--
I'm not fully up on the XML scene, but aren't DTDs being replaced in the very near future by XSDs (XML Schema Definitions)? They at least are a dialect of XML, so to use XML you only have to learn one (easy) language.
over a year ago I described a human with XML tags for fun, something like this:
<XML>
<HUMAN GENDER="m/f">
<HEAD>
<BRAIN></BRAIN>
</HEAD>
<BODY></BODY>
<LEGS></LEGS>
</HUMAN>
</XML>
etc etc etc, maybe at some point null transportation technology will describe a human completely with his genetics, memory and personality with XML, and transport the person as energy over wireless media to put it all together at the other end.
Hopefully fast XSLT engines will exist by then and hopefully the whole thing will not be based on MS implementation of XML document.
You can't handle the truth.
Answer: Just in case we ever need to view our genome sequence on IE
And if the human genome has about 3 gig wouldn't wrapping quaint bits of information blow it up by quite a bit? sorry but the idea seems to rank on the same idiocy level as XML
Mea culpa.
That's not a bad thing. Standards should not be arbitrarily pulled apart - particularly by competing commercial organizations (reference my XML article on FEED from a few years ago for points on this matter). The VRML97 ISO spec is "owned" by the Web3D consortium, in fact to make spec changes basically "illegal". Whatever that means.
Open standard or not - there's absolutely no value in forking a DTD. Unless you think there was maybe some value in all of the "modifications" Netscape and Microsoft made to the HTML DTD, for a simple example - its the same in this case.
DTDs will probably stick around in one form or another for the next few years - its unfortunate that Schemas couldn't have been part of XML 1.0 - unfortunately the co-existance of DTDs and Schemas will cause code bloat as tools will basically need to support both.
Are you telling me that someone who doesn't have my data doesn't have it? Your astounding conclusion seems to be some sort of convoluted identity function.
Look at real user of RDF for how to do this in a better way. XML is great, but the coupling between structure and semantics that comes from using an XML schema to represent both is a nightmare for interworking between teams that overlap, but aren't identical enough to use exactly the same schema.
No one is doubting that poorly implemented schemas will degrade productivity, but I don't see how a dead, unused (sorry, never was used, ever) standard like RDF is going to help. Added to which you can employ namespaces to form compound documents from many schemas, so your limitation doesn't exist in any case.
A couple of years ago, we watched a bunch of old guys slaving over COBOL legacy conversion programs, desperately trying to suck the data out and into SQL, before Cinderella's glass computer turned back into the Y2K pumpkin. I don't want my future to turn into the same thing, scratching together n^2 XSL transforms to convert fooML into foo'ML.
You're vastly overestimating the dynamic nature of these schemas - this isn't the HTML DTD we're talking about. Look at DocBook, as an example - people have been able to use it for years without concern that the next revision would destroy their document semantics. Once again proof that a properly designed format weakens your counterarguments, and in any case, RDF isn't going to ever, EVER take off, so its probably time to quit flogging it.
It's nice that the genome has been "sequenced in its entirety" and is presently undergoing "error checking" which should "continue for the next year".
Last time i checked at ncbi the genome was at 30.4% finished. and the rough draft assembly is in 148307 pieces according to the golden path.
And of course the finished target for the human genome is three years from now!
So, how do you write Hello World (or its equivalent) in GEML?
Insurance provider: Well Mr. Johnson, I'm afraid you have the tag.
Mr. Johnson: No!
Insurance provider: Yup. It's right between the <bald ugly-looking guy> tag and the <most likely to drink beer after finding out his wife gets fatter with age> tag.
Mr. Johnson: Oh God.
Insurance provider: I'm sorry.
Mr. Johnson: Is this hereditary? What can be done about my kids?
Insurance provider: Well, we can comment out the little buggers if we try. Some GScript may work to prevent them from passing the traits onto their children. Hell, we may even be able to use some Gava to touch up their faces so they won't be as ugly as you.
Mr. Johnson: And as for me?
Insurance provider: Your body is 2.0, Mr. Johnson. As far as we're concerned, noone supports you anymore.
- I don't care if they globalize against free speech. All my best free thoughts are done in my head.
The article mentions that the GEML fragment on display may be incomprehensible to even geneticists, but is readable by computers. It goes on saying that the value of the GEML is allowing computers to share data. I am confused, since XML is either offering a verbose definition of computer data that even humans can understand, or allowing human data ( David is an Employee of IBM) to be expressed in self-describing computer accessible form.
Since the genetic code is already digital, transforming in into something that computers can process seems rather pointless, what is wrong with AGTCTTCGADC? making it verbose for humans is also not very useful, because what the GEML seems to offer is very raw data, essentially a wrapper around raw sequences.
Maybe the issue is really hype. I.e. a clever gimmick to drive companies to share information by offering them bandwagons they can't refuse to climb?
-- look, cheese ahoy!
<GEML>
<body eyes="#00FF00" hair="#4F1F5F" height="74in" weight="175lb" crotchproperties=endowed>
</GEML>
You have a syntax error in 'crotchproperties', 'crotchproperties' set to "0"
Coming soon, MS GenomePage 2006, so you can really start screwing things up.
âoeWho knew something as harmless as willful ignorance could end up having real consequences?â
From CNN: Genetically modified monkey - named ANDi carries in him an extra bit of DNA from a jellyfish. ANDi is the first primate to be similarly modified.
See CNN story for full details.
MSDOS: 20+ years without remote hole in the default install
pure luck my friend :P
id like to give props to pieceofshit and anyone else who knows me
that is pathetic
<GEML>
<body eyes="#00FF00" hair="#4F1F5F" height="74in" weight="175lb" crotchproperties=endowed>
</GEML>
I have my Gene Expression Mark-up Language, my HTML, and XML. I can express any form of text in the world. I can not die happy.
You stupid bastard, you don't have no arms left. It's just a flesh wound.
GeNeTeX
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
Yeah and CaML. Wonder what sort of genes that has...
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
What kind of nonsense is this? Everyone knows that we will use the safe, reliable Microsoft standard GEML to encode our genes that safely and reliably allow us to live. We wouldn't have it any other way!
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
While I wont use this directly, its interesting in that it proves XML is here to stay..SOAP, BizTalk, MathML...
Yet another slashdot spelling mistake... If you're going to try to be witty and use other languages to try to increase people's perception of your intelligence or chic-ness, at least do it right. And this is a first post- MY first post, not the story's first post...
--
--
- It ain't easy, being green.