XML Co-Creator says XML Is Too Hard For Programmers
orangerobot writes "Tim Bray, one of the co-authors of the original XML 1.0 specification has a new entry on his website explaining why he's been feeling unsatisified lately with XML and says his last experience writing code for handling XML was 'irritating, time-consuming, and error-prone.' XML has always a divided response among the technical community. The anti-XML community has several sites stating their positions."
XML isn't intended for web pages. That's what you missed:
It's biggest use right now is data interchange. Moving bits between one magic widget and another. And for that, HTML sucks. It just can't represent arbitrary data. Programming languages (C++, Java) are for instructions, not data.
XML fits in perfectly where it's at use-wise. Tim Bray is talking about programming for it: The available interfaces are very counter-intuitive, and that's what Bray's getting at.
Tim Bray thinks that callback based XML apis are a bit awkward to use. He would prefer to use something like a pull parser (see for example http://www.xmlpull.org for examples in java) to the current perl xml apis.
And he would probably want to be able to parse parts of documents ("XML Fragments"), rather than whole documents.
I agree with his views (not using perl too much, though). But this is *not* the end of XML or anything. Tim just has some thoughts about how the xml api could be better in perl. Not very exciting, perhaps...
As he say in the article, the reason he uses Perl regexp is that the tools/libraries have to read the entire file. If this is a long stream, it's grossly inefficient - you have to load the entire thing into a massive tree structure in memory. If the job can be done serially with regexps without using a noticeable amount of memory or time, then it is often better. This is the point of the article - there is a choice between using a method which is often grossly innefficient for real world problems (XML libs) and a fast but messy method (Perl regexp). Neither of these is really satisfactory, hence the complaint.
If I seem short sighted, it is because I stand on the shoulders of midgets
WTF? Perhaps you could explain more about these two cases. As far as I know, general XML parsers such as Expat do not require unlimited memory to parse any finite input document, nor do they require infinite time.
The Document Type Description (DTD) system is equivalent to a BNF grammar for XML documents. It's not quite as flexible as a full BNF because it enforces that elements are correctly nested, but I don't see this as a bad thing.
And yes, DTDs are machine readable. Other grammars for XML documents such as DSD, XML Schema or Relax-NG are also machine readable.
Just as with BNF grammars and flex(1), you can take a DTD and generate an efficient parser from it using FleXML.
Comparisons with TeX aren't really appropriate because TeX is a Turing-complete language, and so impossible to parse automatically in 100% of cases (unless you want to allow that your program will sometimes fail to terminate, ie hang, on particular input files). I don't know what you mean by your subject line 'Maybe he should have read Knuth'...
-- Ed Avis ed@membled.com
XML isn't a replacement for Java or C++. Neither is HTML. You're looking at three seperate areas there.
HTML is a page description language.
C++ and Java are data processing languages.
XML is a data description language.
You can certainly describe a page using XML, and I see no reason why you couldn't construct a programming language using XML syntax, but how on earth are you going to store data in C++ or Java?
My Journal
XML is just one of the tools in our collective toolbox. Use it where it helps you solve a problem. Don't bother if it doesn't.
I agree that it's about tools and libraries. And this is what I think about them, too.
/m or /s on the end of the regexp.
At work, I brush up against XML occasionally, mostly for documentation or data-resultset purposes. In my own time, I use it in my photo
gallery - result-sets from database queries get converted to XML and then spat out through XSLT in Sablotron, straight to web. For all the hoops it goes through, it's actually still quite nippy.
However, I also dislike it intensly.
I've written a blog-like system-news announcement board using a Ruby CGI against postgresql as a backend. I can pull back a result-set - a
simple table-thing with each row being a text announcment, half a dozen fields (when posted, by whom, etc). And I wanted to output this in HTML form for the web, in plain-text to send to a user who wanted it via email every day, and in s-exp form for my own gratification.
However, the first problem you run into is the formatting. A textarea in an HTML form gives no line-wrapping (wanted for plaintext output,
but only in specific fields) and embeds ^M characters everywhere. When the output is HTML, those ^Ms want to become br tags. When the output
is plaintext or sexp, they want to become \n. Simple, if ONLY there were a way of doing either elementary reformatting or search-n-replace in XSLT. There is, but s/// is about 10 lines' worth, if my googling is to be believed. That makes it non-optimal for one of its primary uses: making transformations on big blocks of text-based data, and it can't even edit within a node correctly? Pathetic.
Why shouldn't I just write 3 output methods in my Ruby CGI script that take the result-set directly to text, HTML or sexp formats, with the power of
ruby to do a #gsub("^M", "\n") on just the fields I want, in a tiny few extra characters of code?
Now to tackle what you've said:
"Using Perl regexps to parse XML is silly"
No, it's not. Perl regexps are a highly featureful, pre-existing, code. I'd be surprised if libxml *didn't* use regexps in its XML parsers, frankly.
"e.g. attributes in any order, elements covering multiple lines) that regexps aren't good at handling."
These things are not a problem. You can easily match an attribute occurring, as it does, within a n opening-tag, and pull out both the name and the contents. Using that to set a variable of given name in your program - a highly important part, given that XML is a data-transfer format and it's the internal representation afterwards
that is its whole raison-d'etre - is trivial. Thus, perl wins.
Multi-line matching is explicitly catered-for in perl, with
"There's a number of tools and libraries "...
Indeed there are. And you know what? When I've got a small paragraph (under 10 lines) of data that I want to transfer from A to B, the last thing I'm going to do is invoke a 600Kb library so I can use a pompous and fashionable set of functions to produce "XML", when perl/ruby/sh have all had
perfectly valid "print" or "echo" commands for the past decade or more. If the output is valid XML, you've no reason to diss the method used to produce it.
As a final example, I've also had a few documents to be writing, of my own, at work. I've had two options: either sit down, set up emacs to
handle XML sources smoothly so I can open and close tags at the push of a key-chord the way I *want* to create the stuff, or program a
small sub-language. Lisp, in the form of _librep_, won the day, with a few small functions to produce strings based on the input. And guess what? Because this is a programming language rather than a mere text-transforming language, I made a CGI out of it, and can embed programs within my "data", too, without feeling the urge to write to
the W3C about it.
Editing it is an absolute dream - opening and closing paragraphs of text is a piece of cake and fits the way I want to work. (Maybe you like looking at spikey angle-bracket characters, I
dunno.)
In short, "programmed text" won the day for me.
~Tim
--
Rushing on down to the circle of the turn
It is customary to attribute quotations when you publish them. Otherwise it's called plagarism. Credit where credit is due and all that.
Unless, of course this particular AC is Rick Jelliffe, in which case I apologize.
Four fifths of all our troubles in this life would disappear if we would just sit down and keep still. -C. Coolidge
Well, I've written my own XML parser, as well as a compiler for a simplified version of C, so I think I'm somewhat qualified to talk on this. A generalized XML parser is not memory intensive, unless you are a very bad programmer. All you need is a depth-first stack, which will be as high as your XML tree is deep. And given that, a stack of size N is capable of handling a tree of size X^N, you are definitely going to run out of disk space before you run out of RAM. In other words, the memory required for parsing an XML tree is trivial.
An XML parser is one of the simplest parsers imaginable. It's a sophmore task to create a state machine to process the generic L(1) (or is it L(0)?) XML grammar. And as you should know, a state machine for an L(1) grammar is as fast as you can get.
Anything you do with regular expressions will be much more complicated. As I'm sure you know, regular expressions are turned into state machines before being used to process the input. And almost all regular expression state machines are much more complicated than the state machine you need for an XML parser. In an XML parser, definite boundaries exist on elements such as:
Regular expressions are not this smart. For example, looking for the substring "abc" in the longer string "abababaaabbbabcabababac" is already generating a statemachine that is more complicated than that needed for XML parsers.
Back to the "memory" intensive nature of XML parsers. If you parse your XML tree into a nested hashmap structure, then the memory needed will be proportional to the number of nodes in the XML tree. Maybe this is what you meant by "memory intensive". However, this is totally unnecessary. You can easily construct an XML parser to look for the specific elements you care about. Then you only get those elements, and you only need to allocate the memory for the elements required.