XML Co-Creator says XML Is Too Hard For Programmers
orangerobot writes "Tim Bray, one of the co-authors of the original XML 1.0 specification has a new entry on his website explaining why he's been feeling unsatisified lately with XML and says his last experience writing code for handling XML was 'irritating, time-consuming, and error-prone.' XML has always a divided response among the technical community. The anti-XML community has several sites stating their positions."
Well, programming *is* a hard task, and simplifying it is about building layers and layers of better abstractions to machine code and binary data.
Without XML, what would you normally do? Create a flat text file and read it using whatever syntax you'll like that day. I agree XML is ugly as hell to type in manually, but at least it's a standard, and every programming language in use today can handle it in a standard way - DOM, SAX, whatever.
Well, first he chose a bad tool (Perl regexp) for XML processing, and then complains about his tools being insufficient.
Using Perl regexps to parse XML is silly, because there's too much variability (e.g. attributes in any order, elements covering multiple lines) that regexps aren't good at handling. You can do it, of course, but it quickly gets messy.
There's a number of tools and libraries (with Perl or other languages) beyond plain DOM and SAX that use proper XML parsers and are reasonably easy to use. He should use one of those, and stop complaining.
Among other things ...
(1) They need to eliminate the doctype can of worms. Unfortunately, this cries out for an alternative solution for character entities.
(2) Namespaces need to be simplified and better integrated into the core of the language. Expanding on this, there need to be much better mechanisms for modularizing parts of the markup so that it isn't necessary to parse and hold everything in memory to make sense of it.
(3) There needs to be clean-up and standardization of element id's and references, integrating it with (1) and (2).
Do others have more? Should this be done compatibly with XML?
I believe that we really need a standard for arbitrary abstract data models, with XML as just one syntactic representation, but I would have to go into long details to justify this.
The documents are generally displayed as HTML on the web, but they're also read by a couple different programs for different purposes. When I first started here, it was mostly a mess of poorly hand-written HTML, but thankfully there were *only* about 20k documents at the time.
I was charged with the task of writing said programs to read these damn files. Unfortuneately, they weren't all marked up the same...
Now that we have XML and standard libraries for reading XML, it makes handling these documents a snap. Any program that needs to read them can simply have an XML parser plugged into it. The integrity of the documents themselves is maintained by the fact that they don't work if they're not properly marked up. So all these documents work, 100% all the time, and writing programs to read said documents is very simple and not prone to errors.
Yay for XML! :)
So, to sum up, XML is doing what it was meant to do, no less. Unfortuneately, it's also probably doing a bit more as well, XSL anyone? Yeck, why not just have a stand XML scripting language, why the need for the language to be valid XML itself?
Sticking feathers up your butt does not make you a chicken - Tyler Durden
Now, I have to say: a universal syntax for tree-structured data is very useful: experience since the 1970s with one such universal syntax, Lisp, has shown that. It is unfortunate that XML is about the worst imaginable implementation of that idea. XML combines being a nuisance to type with having comparatively complex semantics and lots of redundant features.
What is ironic is that the same "real world programmers" who wax ecstatic about XML also condemn Lisp as too complicated and too difficult to read. The universal syntax that XML aspires to, Lisp syntax delivered many decades ago. It's just that prejudice and ignorance caused people to re-invent the wheel (and in square form, too) in the form of XML.
I am pretty torn between whether XML is a blessing or a curse. We really need something like it, but XML is so bad that it may not even live up to the level of "poorly designed industry standard but better than nothing".
The criticism on XML is accurate, correct, valid, if only for the simple reason that the code needed to interface with the libraries is 90% plumbing-work and 10% business-solution. That 90% plumbing-work leaves oppertunity for _a lot of bugs_ to be created and for any solution using XML to become a resource-hog.
Having a standard interchange format like XML is a fun-thing, and "good", as it allows standardized processing of these formats. However, the article identifies a clear gap in the tooling and that gap needs to be addressed for XML to become a widespread success, instead of another buzzword hype.
We use XML heavily in a project I'm working on at my company. Some genius decided that everything should be in xml, and that we would use XSLT for a lot of the data manipulation. Naturally we also make heavy use of DTD and SAX. Lots of XML related technologies.
I can tell you now that XML is a Bad Thing. It strives to excel at too many things at once, and becomes inefficient and complex as a result.
XML tries to eliminate the step of writing parsers for data, although writing parsers has never been a significant part of application development to begin with. Its rigidity instead forces you to waste time taking the output of the parser (a complex tree) and putting it into meaningful form. XML document tree traversal = 10000x more complex than getting column data out of a ResultSet... Unfortunately it is also a billion times slower to parse XML than it is to perform a medium compexity database query.
The real problem is that XML only partly addresses the problems that relational database solved years ago (organizing and data accessable), but it does it without any of the efficiency benefits of a well designed database server. In my opinion, 90+ percent of the places where XML is being used today would be better served by using columns in a relational database table to store object fields. You get indexing, you get universal, simple and efficient searching, and you get speed.
XML has too many faults to really list in one short post. The truth of the matter is that it tries to do too many things and DOESNT DO ANYTHING WELL. Sort of like if someone tries to be skilled in all musical instruments but ends up being, at best, mediocre in a few of them.
Right, so instead of using one regexp for /etc/hosts and another regexp for /etc/passwd, I'd have to use ten pages of getTheGodDOMObjectFromTheGodDOMXMLFile crap for /etc/hosts.xml and another ten pages for /etc/passwd.xml.
How, exactly, has XML simplified *anything*?
So to be clear, XML is here to stay. (An example of XML penetration: there is a working schema for using XML in the farming industry!) Just imagine the chaos that will insue once MS Office saves all documents in true XML.
My take on the problem Tim's really talking about: inconsistency and the proliferation of people who want to be the next prodigy in their area of expertise. There are so many parsers and interfaces, even within a language domain, because vendors want to put their own spin on everything. The alphabet soup that results confuses the hell out of people. This has even happened in the open source world, where I can do a Google search on "php xml parsing" and read articles on no less than 10 different approaches. For the average guy who has been told by a project manager, "We need to take these XML files from our business partner, extract and store the data in our database," you need a standard approach. Not to stifle thought and innovation, yes, you should take the initiative to understand whether an event-driven approach (SAX parser) or an in-memory object model approach (DOM parser) is right for the job. After all, you do get paid to do this, so earn your keep! But the XML community hasn't done a good job of specifying best practices and leading people by the nose to a solution. Every XML book I've seen furthers the confusion, with each other offering his opinion with a slight variation of how to do things, leading programmers/scripters/whatevers to use the approach they most recently read about, and not necessarily the one that time has proven out to be the most efficient.
Part of this is the divide between the .Net guys, the Java camp, the Perl/PHP folks, etc., but in the spirit of interoperability, maybe the XML promoters just need to dumb things down a bit to get some simple concepts and best practices into the hands of Joe Sixpack Programmer. Maybe a central authority, a la java.sun.com or php.net?
Let's decompose the XML parsing "problem" (if one actually exists) into smaller components that we can reasonably discuss. XML parsing is too broad a topic to intelligently discuss, but if you limit it to XML parsing in Java you suddenly have a topic small enough to be manageable. So let's discuss Java parsing in XML.
When XML was first introduced, there were no standard libraries in the JDK to facilitate parsing. What's more, the few projects out there varied wildly in how you actually used their DOM tree or SAX callback mechanism. This isn't necessarily a Bad Thing (tm), it's the same problem every emerging technology faces: immature tools. This is basic biology - lots of competing implementations (life forms), each struggling for community (resources).
So, time goes by, and eventually a handful of implementations emerge dominant. Some dominate due to performance, and some dominate because of ease of use of the API. The victors in this game then sometimes go through a merging process of their own, where the performance victors lend technology to ease of use API victors. After a lot of merging (and flames usually), one or two projects emerge out of the XML kingdom as the dominant players. In my opinion, in the world of Java these are Xalan (Xerces) and Dom4J.
During the maturation process, Sun comes along and looks at the technology and says "Wow this XML stuff is really here to stay. What implementations are out there, and what similarities exist between them? How can we facilitate growth of these projects?" They realize that certain classes (like org.xml.sax.InputSource) are common entities in both projects (even if the class InputSource doesn't exist), and they standardize it. For a reference to all of the XML standards implemented in the JDK, do a search on java.sun.com for JAXP, JAXM, and JAXB (just to name a few).
At this point, the XML projects come back and work in support so that they can be "JAXP compatible" (again this is part of the biological process of evolution). This insures that the projects works well with whatever Sun ships in the JDK.
In the end (which is really where we are now) you end up with a pluggable architecture, where the JDK provides some common functionality or interfaces that are implemented by open source projects.
Java XML parsing was damn hard back in the day - you had to marry your code to a specific project. But these days with the standardization that has taken place (thanks Sun!), as long as you write code that makes use of the JAXP specification you can plug in any JAXP-compliant parser into your app and things *should* work.
The difficult problem is getting other entities (Application Servers for example) to get up-to-date with the standards. WebLogic 6.1 comes with a non-JAXP compliant parser, and thus doesn't work with the latest JDK, Xalan, etc.
Do it for da shorties