An Overview of Modern XML Processing Techniques and APIs
Dare Obasanjo
writes with a link to his article "A Survey of APIs and Techniques for Processing XML" on xml.net. It starts off "In recent times the landscape of APIs and techniques for processing XML has been in the
process of reinventing itself as developers and API
designers learn from their experiences and some past mistakes. APIs such as DOM and SAX which
used to be the bread and butter of XML APIs are
giving way to new models of examining and processing
XML. However although some of these techniques have become widespread amongst developers who
primarily work with XML they are still unknown to
the general body of developers. Nothing highlights
this better than a recent article by Tim Bray one
of the co-inventors of XML entitled
XML
is too Hard for
Programmers and the
subsequent responses on Slashdot." Read the entire article to learn more about the state of the XML art. Added in the missing link.
With the new technology known as a hyperlink, we can simply click a location on the screen and be taken to the article, instead of having to go to xml.net and find it ourself.
Argh...
This space intentionally left blank.
Um....link?
:wq
The article is actually on xml.com, not xml.net. Here is the url: http://www.xml.com/pub/a/2003/07/09/xmlapis.html
This is a horrible post!
There is no link to the article, and the one link that comes close (to xml.net ) points to a site that says:
xml.net will be online soon. Sign up now and we'll keep you posted on our progress.Timothy, how did you read this as the editor?
I am interested in the topic: please fix the post so that we can read the article.XML sucks because it's being used wrongly. It is being used by people who view it as being an encapsulation of semantics and data, and it's not. XML is purely a way of structuring files, and as such, really doesn't add much to the overall picture. XML came from a document preparation tradition. First there was GML, a document preparation system, then SGML, a document preparation system, then HTML, a document preparation system, and now XML. All were designed as ways humans could structure documents. Now we've gotten to the point where XML has become so obscure and so complex to write, that it can no longer be written by people. If you talk to people in Sun about their libraries that generate XML, they say humans cannot read this. It's not designed for human consumption. Yet we're carrying around all the baggage that's in there, because it's designed for humans to read. So XML is a remarkably inefficient encoding system. It's a remarkably difficult to use encoding system, considering what it does. And yet it's become the lingua franca for talking between applications, and that strikes me as crazy.
People think, "Once I've got my data in XML that's all I've got to do. I've now got self-describing data," but the reality is they don't. They're just assuming that the tags that are in there somehow give people all the information they need to be able to deal with the data. Now, for some things there are standards. For example, there are some standards like RSS and RDF, which give you very simple ways of describing web page content. But a random XML file, especially machine generated XML files, can be as obscure as binary data.
Ant is a really good example, because in that case you're using XML as a user-specified input language, which is really inappropriate in that context. I'd much rather have a genuine grammar. I want to be able to type something simple and easy for me. I don't care if it's easy for the tool to parse, that's the tool's problem. I want it to be easy for me to write. And in cases like that, it's really the case of the programmer saying, "Oh look, here's an XML parser. I can just take XML files. That's easier." So one programmer in one context puts a burden on the other 100,000 programmers trying to use it.
cpeterso
It's also "too hard" in a variety of circumstances where the reason it's too hard is that it's the wrong thing to use.
Good programmers can cope with XML just fine when it's just what they need to get the job done, and are smart enough to avoid it when it isn't.
Experience is a hard school, but fools will learn no other.
So if you're unhappy with working directly wit XML -- lord knows I am: it obscures the content far too much -- use a formal structured human readable markup system like DocUtils or ASCIIDoc.
They're both quite robust, well-suited to documenting APIs, writing technical manuals, etcetera. They can both pump out DocBook-XML from the plaintext, lightly-formatted input.
The beauty of these formats is that they are simple and often intuitive.
You emphasize text by wrapping it in *asterisks*, just like you used to do in the old days. You create a title by
Underlining
===========
* bullets
1. numbered sequences
- and all that jazz.
I'm on the verge of completing a 50+ page hardware reference manual. The source files are plaintext, using ReST (DocUtils) markup. They are transformed straight into DocUtils XML, and my XSL/XSL-FO stylesheets digest that into PDF using FOP. It's a thing of beauty.
More difficult than laying out a publication using Ventura Publisher or FrameMaker? Yes: it's certainly not WYSIWYG. More flexible, in terms of allowing any git to subsequently update the manual? Yes, far and away easier.
Anyway, mark me down as happy with ReST structural markup, XSL/XSL-FO transformations, and FOP PDF creation!
--
Don't like it? Respond with words, not karma.
Yeah, I was surprised too.
I disagree about the human readable/writable bit. It is easily human readable/writable if it's properly structured (if it's complex because the information is complex, that's an inevitability. Make the data model simpler, if that's a problem to you). In terms of efficiency - sure, binary formats are more efficient, but they are much harder to debug when they go wrong.
I agree that XML documents are not necessarily self-documenting. That isn't surprising. XML is about syntax, not semantics. You can use XSD to provide basic (integer vs char) semantics, but anything more complicated comes back to human understanding and agreed specification. If you understand the objects in your schema, XML can provide a good presentation of those objects.
My team (myself and another guy) implemented a mapping framework in Java that I think is more useful than the other frameworks I've seen.
So when reading the comments about the weaknesses of object-mapping tools, keep in mind that some of us have overcome them. :)
Peace be with you,
-jimbo
XML Tools for Mac OS X
I know some of you don't care and/or are tired of hearing this, but XML data tends to violate relational rules. I would like to see a souped-up comma-delimited standard for data sharing. XML is perhaps suited okay for documents, but NOT structured data (except in relatively rare circumstances). Dr. Codd knew what he was doing. Relational has a more ordered, consistent structure than XML.
Table-ized A.I.
So you can make an application-specific grammar very easily. Not all data in the world is hierarchical or can be easily crammed into XML form. Maybe Larry is on to something here.
I have an application that does not have any access to a database or any database libraries on the server in which it will be run.
It needs to store a small amount of data in a text file. Initially I thought I would use XML, but figuring out how to parse the data after it was created proved very difficult. I had some small luck with DOM inside tcldom, but it seemed like a lot more effort than it was worth.
This file is a basic tree with branches of depth 2. All branches have the exact same structure.
Is XML the way to go for such a project, what other toolkits have you used to store program data in a text file (small strings less than 64 chars)? I definately want it to be in some human readable form so it can be debuggable.
I really wanted to keep the data in a form that someone else could use for a completely unrelated program (if the need ever arose).
Thanks for any input. I was glad to see this topic come up because it got me thinking about the issue again.
-Jonathan
The Ro Factor - Jeep/Linux Weblog
you make good points.
I understand the idea that if you need an XML editor (or compiler!) to make it easy to write XML what's the point of having a human readable format. Would you suggest standard binary formats?
I think, in fact, there are good binary encodings that also can encapsulate DOM shaped structures. It seems to me that the accessibility is still a good thing, I like XML as a universal interchange format that supports arbitrary nesting properties for the embedded blocks. All that could be done in a universal standards.
But then we'd have to resurrect the byte-ordering wars, settled but not forgotten, by truce years ago.
-pyrrho
How cheap do you think I think it is? Actually I have no idea because I don't use XML. But it is obviously slower if you have to parse stuff, than if you have it in a nice binary format. So libxml2 is the fastest parser, big whoop, it's still too slow.
I don't agree that binary formats are inherently harder to debug. It depends on the structure of the file.
I don't get it. Can someone explain?
Just because XML is a human-readable format and anyone can make one is I think the biggest problem of all in this ongoing xml-is-the-biggest-thing-since-sliced-bread saga.
I've been working with XML for 2 years now, and I am constantly reminded that, just as it happens with html, vb, or any other "simple to use" technology out there, anyone can use it, but few know how to use it well. I've seen xml structures that would have you rolling on the floor laughing, so inneficient and dumb were they.
It's just like databases, really. It's pretty easy to mess it up so badly it's unusable.
XML is really the most insignificant part of the whole. One has to know at least xpath and xslt to use xml properly. Because the great advantage I see about the format is the ease of transforming data from one format to another, and you that by applying transformations to it.
So instead of hardwiring the structure to the data (and banging you head on the wall when suddenly you need a slightly different structure for the same data), you just store everything in a format that will be easy to transform to suit your purposes. 80% of my work in xml is actually applying filters, transformations and the likes to my data.
An example: I work for an insurance company which is putting every kind of information is has on the web for the clients and the agents to consult. The data is mostly the same for everybody: contracts, receipts, etc.
Almost all applications deal with the same data, only in very different ways. Even in only one application, the same data can be manipulated in different ways. And of course, it is the web, so we need to cache stuff and minimize database access.
Some applications are on the intranet, others on extranet, others in the internet. We're using webservices to simpify access to the data (since it mostly the same for everyone), so the webservices work for everyone, handing out data that is then transformed for specific needs.
Since the basic format of the information is stable, we don't need to learn or implement new formats, different calls to webservices and the likes. We just take the information and transform it. Caching is straightforward for everyone (just text, after all), validation is easy (xsd).
XML is not for everything, but if you're working with pure information and passing it around to a zillion different apps, it sure beats sliced-bread.
shana
XPath is awesome for getting at what you really want. SAX and DOM are too low level for implementing anything other than an XPath or XSLT engine :)
Even easier is putting System.Xml.Serialization attributes on your properties in C#. Blammo, instant configuration file for your classes.
And I hear XQuery shall revolutionize the world as we know it. There are some early implementations already.
r4lv3k
Yes, it is obviously slower. But that's a tradeoff that many people are willing to make. And I didn't say that libxml2 was the fastest parser - I'm not qualified to say that, as I've not benchmarked it properly.
Bear in mind that many uses of XML are for data interchange, where speed is less important than compatibility. XML gives you more potential to add extra data into the format and still use a mix of old and new tools. Binary formats generally require that all programs using the format be upgraded if the format changes.
The "structure of the file". Obviously, binary files have no evident structure to humans without special tools - so if you have a mixture of content (e.g. floating point, variable length text) you have to rely on remembering the order of fields and be able to translate their values.
Honestly, I'm not saying there's not a place for binary formats (there is and I'd be happy to use them) but for many of the applications XML is used for, particularly data interchange, human readability is a big bonus. The XML I've been involved with shoudl be pretty readable to anyone with a knowledge of the domain.
XML sucks because of attributes. I can have a and a thing and they are treated differently. How pointless that is.
Plus any drooling idiot can come up with a way to represent a tree in a file. They did that 100 years ago with Lisp.