XML Co-Creator says XML Is Too Hard For Programmers
orangerobot writes "Tim Bray, one of the co-authors of the original XML 1.0 specification has a new entry on his website explaining why he's been feeling unsatisified lately with XML and says his last experience writing code for handling XML was 'irritating, time-consuming, and error-prone.' XML has always a divided response among the technical community. The anti-XML community has several sites stating their positions."
Sounds like visual basic programmers are complaining or something.
This is my sig. The post is over.
They should only be glad not to be coding cobol, intercal or befunge!
Note to self: get smarter troll to guard door.
Well, programming *is* a hard task, and simplifying it is about building layers and layers of better abstractions to machine code and binary data.
Without XML, what would you normally do? Create a flat text file and read it using whatever syntax you'll like that day. I agree XML is ugly as hell to type in manually, but at least it's a standard, and every programming language in use today can handle it in a standard way - DOM, SAX, whatever.
Sure it sucks, but it's a *standard* that everyone can use, and there are many libraries for it so you don't need to write your own parsing code
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
Well, first he chose a bad tool (Perl regexp) for XML processing, and then complains about his tools being insufficient.
Using Perl regexps to parse XML is silly, because there's too much variability (e.g. attributes in any order, elements covering multiple lines) that regexps aren't good at handling. You can do it, of course, but it quickly gets messy.
There's a number of tools and libraries (with Perl or other languages) beyond plain DOM and SAX that use proper XML parsers and are reasonably easy to use. He should use one of those, and stop complaining.
The last book on XML I read and understood was XML for Dummies.
First of all IDNRTA (I Did Not Read The Article)
:)
;)
OK... This is exactly why you SHOULD read the articles... I just posted blatantly off topic due to an annoying quick-read = misread mistake... yay me
Mod me down, I deserve it
.: Max Romantschuk
XLM parsing (just like the TeX language) has a problem that when there are problems in the input files, the situation diverges into two different caes, one requires an infinite memory and the other infinite time to deal gracefully with errors.
None of this would have ever been needed had CS been tuaght properly. There are other concepts to describe how files are to be organized. Some of the systems date from the 1950's. BNF (which seems to work very well for programmers to describe file formats to other programmers) dates from the early 1960's. What was needed is a BNF type grammar that is machine readable.
Would XLM have ever taken off if the web used something sane and not a hacked version of a nasty text formatting system from decades ago?
XML isn't intended for web pages. That's what you missed:
It's biggest use right now is data interchange. Moving bits between one magic widget and another. And for that, HTML sucks. It just can't represent arbitrary data. Programming languages (C++, Java) are for instructions, not data.
XML fits in perfectly where it's at use-wise. Tim Bray is talking about programming for it: The available interfaces are very counter-intuitive, and that's what Bray's getting at.
You mean like most other non-xml config files in /etc, like say hosts, DNS zone files, named.conf, passwd/shadow, hosts.allow/deny, sendmail.mc or resolv.conf (etc. etc.)? These have standard layouts, text-based, can be edited by hand and can be easily parsed.
My point: XML is over-used for a lot of things. In some places it makes sense, but in many places it doesn't.
Did you actually read the article?
I can sum it up very easily:
He's looking for a nicer api for processing XML, he's not looking to replace XML entirely.
Tim Bray thinks that callback based XML apis are a bit awkward to use. He would prefer to use something like a pull parser (see for example http://www.xmlpull.org for examples in java) to the current perl xml apis.
And he would probably want to be able to parse parts of documents ("XML Fragments"), rather than whole documents.
I agree with his views (not using perl too much, though). But this is *not* the end of XML or anything. Tim just has some thoughts about how the xml api could be better in perl. Not very exciting, perhaps...
Among other things ...
(1) They need to eliminate the doctype can of worms. Unfortunately, this cries out for an alternative solution for character entities.
(2) Namespaces need to be simplified and better integrated into the core of the language. Expanding on this, there need to be much better mechanisms for modularizing parts of the markup so that it isn't necessary to parse and hold everything in memory to make sense of it.
(3) There needs to be clean-up and standardization of element id's and references, integrating it with (1) and (2).
Do others have more? Should this be done compatibly with XML?
I believe that we really need a standard for arbitrary abstract data models, with XML as just one syntactic representation, but I would have to go into long details to justify this.
XML is not a stream - it has a hierarchical tree structure, and IMHO is not useful for anything that (a) by its very nature is a continuous stream of data (say, a log file), or (b) wants to be processed as a stream (because it's big, and would require too much memory to be handled as a single data structure).
The problem seems to be that XML is good for portability and standardization, and therefore is abused for things it's not well suited for (the well-known 'if all you have is a hammer, every problem looks like a nail' syndrome).
He's stating that he'd basically like others coders write more code the way he sees fit.
[quote]
while () {
next if (XX);
if (X|||X)
{ $divert = 'head'; }
elsif (XX)
{ &proc_jpeg($1); }
# and so on...
}
[/quote]
Repeat after me: I will never leave parsing XML up to a regexp especially if my xml may contain CDATA and Comment sections. I will never...
Unless you are 100% certain the file you are parsing is directly under your control, ie: no comments, no cdatas, params always in the same order, same indentation, same bloody encoding [pardon my french], well, you just will have to acces the data using some kind of DOM or abstract tree representation.
I don't think he thinks no one uses XML, he seems to deplore the fact that some people don't get it at all and resort to heavy duty tools for trivial tasks [thus justifying his example above].
Basically XML is quite simple, but that's not the matter, the problem is that XML bundles ACTUAL DATA, it's all about the complexity of those data, not the API used to access it [although writing a DOM implementation is a real pain]
The documents are generally displayed as HTML on the web, but they're also read by a couple different programs for different purposes. When I first started here, it was mostly a mess of poorly hand-written HTML, but thankfully there were *only* about 20k documents at the time.
I was charged with the task of writing said programs to read these damn files. Unfortuneately, they weren't all marked up the same...
Now that we have XML and standard libraries for reading XML, it makes handling these documents a snap. Any program that needs to read them can simply have an XML parser plugged into it. The integrity of the documents themselves is maintained by the fact that they don't work if they're not properly marked up. So all these documents work, 100% all the time, and writing programs to read said documents is very simple and not prone to errors.
Yay for XML! :)
So, to sum up, XML is doing what it was meant to do, no less. Unfortuneately, it's also probably doing a bit more as well, XSL anyone? Yeck, why not just have a stand XML scripting language, why the need for the language to be valid XML itself?
Sticking feathers up your butt does not make you a chicken - Tyler Durden
arggh!!! fuck'in XML tags!! lol
<?xml version="1.0" encoding="bork">
<troll>
<sovietrussiathing>In SOVIET RUSSIA, XML standardizes YOU!!</sovietrussiathing>
<offtopic>Let's bomb the french!</offtopic>
<flamebait>Anyway, XML is for loosers!</flamebait>
</troll>
XML isn't a replacement for Java or C++. Neither is HTML. You're looking at three seperate areas there.
HTML is a page description language.
C++ and Java are data processing languages.
XML is a data description language.
You can certainly describe a page using XML, and I see no reason why you couldn't construct a programming language using XML syntax, but how on earth are you going to store data in C++ or Java?
My Journal
XML is just one of the tools in our collective toolbox. Use it where it helps you solve a problem. Don't bother if it doesn't.
Now, I have to say: a universal syntax for tree-structured data is very useful: experience since the 1970s with one such universal syntax, Lisp, has shown that. It is unfortunate that XML is about the worst imaginable implementation of that idea. XML combines being a nuisance to type with having comparatively complex semantics and lots of redundant features.
What is ironic is that the same "real world programmers" who wax ecstatic about XML also condemn Lisp as too complicated and too difficult to read. The universal syntax that XML aspires to, Lisp syntax delivered many decades ago. It's just that prejudice and ignorance caused people to re-invent the wheel (and in square form, too) in the form of XML.
I am pretty torn between whether XML is a blessing or a curse. We really need something like it, but XML is so bad that it may not even live up to the level of "poorly designed industry standard but better than nothing".
On the 1st of January, 2003, Bjarne Stroustrup gave an interview to the IEEE's 'Computer' magazine.
Naturally, the editors thought he would be giving a retrospective view of twelve years of object-oriented design, using the language he created.
By the end of the interview, the interviewer got more than he had bargained for and, subsequently, the editor decided to suppress its contents, 'for the good of the industry' but, as with many of these things, there was a leak.
Here is a complete transcript of what was was said, unedited, and unrehearsed, so it isn't as neat as planned interviews.
Interviewer: Well, it's been a few years since you changed the world of software design, how does it feel, looking back?
Stroustrup: Actually, I was thinking about those days, just before you arrived. Do you remember? Everyone was writing 'C' and, the trouble was, they were pretty damn good at it. Universities got pretty good at teaching it, too. They were turning out competent - I stress the word 'competent' - graduates at a phenomenal rate. That's what caused the problem.
Interviewer: Problem?
Stroustrup: Yes, problem. Remember when everyone wrote Cobol?
Interviewer: Of course, I did too
Stroustrup: Well, in the beginning, these guys were like demi-gods. Their salaries were high, and they were treated like royalty.
Interviewer: Those were the days, eh?
Stroustrup: Right. So what happened? IBM got sick of it, and invested millions in training programmers, till they were a dime a dozen.
Interviewer: That's why I got out. Salaries dropped within a year, to the point where being a journalist actually paid better.
Stroustrup: Exactly. Well, the same happened with 'C' programmers.
Interviewer: I see, but what's the point?
Stroustrup: Well, one day, when I was sitting in my office, I thought of this little scheme, which would redress the balance a little. I thought 'I wonder what would happen, if there were a language so complicated, so difficult to learn, that nobody would ever be able to swamp the market with programmers? Actually, I got some of the ideas from X10, you know, X windows. That was such a bitch of a graphics system, that it only just ran on those Sun 3/60 things. They had all the ingredients for what I wanted. A really ridiculously complex syntax, obscure functions, and pseudo-OO structure. Even now, nobody writes raw X-windows code. Motif is the only way to go if you want to retain your sanity.
Interviewer: You're kidding...?
Stroustrup: Not a bit of it. In fact, there was another problem. Unix was written in 'C', which meant that any 'C' programmer could very easily become a systems programmer. Remember what a mainframe systems programmer used to earn?
Interviewer: You bet I do, that's what I used to do.
Stroustrup: OK, so this new language had to divorce itself from Unix, by hiding all the system calls that bound the two together so nicely. This would enable guys who only knew about DOS to earn a decent living too.
Interviewer: I don't believe you said that...
Stroustrup: Well, it's been long enough, now, and I believe most people have figured out for themselves that C++ is a waste of time but, I must say, it's taken them a lot longer than I thought it would.
Interviewer: So how exactly did you do it?
Stroustrup: It was only supposed to be a joke, I never thought people would take the book seriously. Anyone with half a brain can see that object-oriented programming is counter-intuitive, illogical and inefficient.
Interviewer: What?
Stroustrup: And as for 're-useable code' - when did you ever hear of a company re-using its code?
Interviewer: Well, never, actually, but...
Stroustrup: There you are then. Mind you, a few tried, in the early days. There was this Oregon company - Mentor Graphi
You just gave the best argument for adopting XML as widely as possible. Yes, all these can be parsed (with the possible exception of sendmail's config files which may be Turing-complete) but they all require *different* code for each config file. If they were in XML you'd still need different semantic code, of course, but a whole wodge of syntax issues (how do I quote strings, how do I escape newlines, how do I mark nested scopes, what happens when the string delimiter character occurs inside a string, how do I deal with comments, what is the character set, is there a formal grammar for the document, etc etc) would be dealt with. Maybe not in the way that you or I think is perfect - IMHO XML is a little bit verbose compared to say Lisp- or Tcl-style encodings. But they would be dealt with *once*. No need to learn a new or almost-the-same-but-slightly-different set of syntactic conventions for every single config file.
Maybe XML is over-used for a lot of things, but making up your own file format is definitely over-used a lot more. Simple line-oriented files are reasonable to have as plain text, for everything else please avoid the temptation to reinvent the wheel by devising a new syntax and block structure.
-- Ed Avis ed@membled.com
The criticism on XML is accurate, correct, valid, if only for the simple reason that the code needed to interface with the libraries is 90% plumbing-work and 10% business-solution. That 90% plumbing-work leaves oppertunity for _a lot of bugs_ to be created and for any solution using XML to become a resource-hog.
Having a standard interchange format like XML is a fun-thing, and "good", as it allows standardized processing of these formats. However, the article identifies a clear gap in the tooling and that gap needs to be addressed for XML to become a widespread success, instead of another buzzword hype.
Admitting something is too hard is too hard for programmers.
Now I'll go read the article.
We use XML heavily in a project I'm working on at my company. Some genius decided that everything should be in xml, and that we would use XSLT for a lot of the data manipulation. Naturally we also make heavy use of DTD and SAX. Lots of XML related technologies.
I can tell you now that XML is a Bad Thing. It strives to excel at too many things at once, and becomes inefficient and complex as a result.
XML tries to eliminate the step of writing parsers for data, although writing parsers has never been a significant part of application development to begin with. Its rigidity instead forces you to waste time taking the output of the parser (a complex tree) and putting it into meaningful form. XML document tree traversal = 10000x more complex than getting column data out of a ResultSet... Unfortunately it is also a billion times slower to parse XML than it is to perform a medium compexity database query.
The real problem is that XML only partly addresses the problems that relational database solved years ago (organizing and data accessable), but it does it without any of the efficiency benefits of a well designed database server. In my opinion, 90+ percent of the places where XML is being used today would be better served by using columns in a relational database table to store object fields. You get indexing, you get universal, simple and efficient searching, and you get speed.
XML has too many faults to really list in one short post. The truth of the matter is that it tries to do too many things and DOESNT DO ANYTHING WELL. Sort of like if someone tries to be skilled in all musical instruments but ends up being, at best, mediocre in a few of them.
...and for doing generic markup in a relatively simple way, it's good.
For storing arbitrary data, and use as a message format (as in SOAP), it's not so good because it has markup-like features, such as the distinction between attributes and elements and the distinction between text and element nodes. (The latter in particular is a huge pain, I wish people would agree to only use text nodes in leaf elements.)
This is why XML parsers/generators, once they get into entities and DTDs and so on, become really a lot more complicated than they would need to be if XML just stored a tree of elements.
However, it's the standard, so we might as well just shut up and use it.
My opinions have no special importance but it *is* important to remember that XML is a markup format that is being used mostly for things other than markup.
Whence? Hence. Whither? Thither.
It might be too late to correct some things in XML.
Good about XML is, that whatever will emerge in the future,
it will always be possible to convert old documents into any
new form, using simple tools.
There is a point with critics: Unlike Latex or HTML which
can be written easily by hand, XML can become too bloated to
be authored directly by humans.
Similar problem with MathML:
Latex: $x^5+3x-9=0$
MathML:
<mrow>
<mrow>
<msup>
<mi>x</mi>
<mn>5</mn>
</msup>
<mo>+</mo>
<mrow>
<mn>3</mn>
<mo>⁢</mo>
<mi>x</mi>
</mrow>
<mo>-</mo>
<mn>9</mn>
</mrow>
<mo>=</mo>
<mn>0</mn>
</mrow>
You can write complicated formulas in Latex directly but it is
almost impossible to do so in MathML, where one has to rely
on tools to generate it (i.e. export it with Mathematica or
TeX -> MathML converters). Wouldn't it be nice if browsers
would understand a basic version of LateX? (That it is possible
has been shown with IBM's texexplorer plugin).
If I understand it correctly, the author is lamenting that neither of the standard ways of parsing XML in a scripting language fit the straightforward model of scanning for something relevant and then acting upon it, where the two models are: 1) read in whole file and make a tree (take sup too much memory, is slow, etc.); or 2) use a callback interface.
The style of perl script he was seeking was a simple loop model: /ignorable/; ... } ... }
...
while () {
next if
if (/thing-one/) {
elsif (/thing-two/) {
}
To me the thing that distinguishes this the most from the provided XML parsing interfaces is that it has a minimal amount of state.
So isn't what is needed a corresponding structure to the while () above that iterates over the tree-nodes of the XML-encoded data structure, in a depth-first preorder traversal (to avoid having to build the whole tree first)? One could imagine a parser object that scans through the XML file returning nodes (and their parent history) while maintaining an absolute minimum of state. If one wanted to build an in-memory representation of a subtree given a node, then one can always do so when one finds the node one wants.
Such an interface wouldn't be good for integrity verification or the like, but for the sort of application the author was talking about, it would seem ideal. Much less flexible than the normal models, sure, but much easier to work with when the problem fits this sort of description. Perhaps I'm underestimating the difficulty of the task, but it doesn't sound too hard to write, given that it is doing so much less than the fully-featured XML parsing interfaces.
The other problem is the awkwardness of the use of XML in O-O languages such as addressed in the article linked-to by Tim Bray in his article. Though I haven't used this particular program, this seems to be the problem that FleXML is trying to address. When you don't need all of the flexibility that XML can provide, but instead have a fixed schema that your XML-representation follows, why not have your parser automatically built to read it? People have used lex/flex for scanning text files for decades --- in these days of XML Schema, it should be even easier. If FleXML lives up to its promise, it will be. Has anyone here used FleXML and are willing to comment on how well it addresses these sorts of problems?
Right, so instead of using one regexp for /etc/hosts and another regexp for /etc/passwd, I'd have to use ten pages of getTheGodDOMObjectFromTheGodDOMXMLFile crap for /etc/hosts.xml and another ten pages for /etc/passwd.xml.
How, exactly, has XML simplified *anything*?
XML got one thing right over unadorned S-expressions - document packaging, specifically versioning and character-set labeling. XML inherited this from SGML, and it's one of the few things it took from there that was actually worth keeping.
For a good laugh, read the Origin and Goals section of the XML spec. Of the ten goals for XML listed there:
XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML documents.
The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.
I'd say two of them were met, but were bad ideas (SGML compatibility, terseness unimportant), and five of them were completely missed (ease of use, human legibility, quickly designed, formal and concise, ease of creation).
Thirty per cent is a failing grade, folks...
To a Lisp hacker, XML is S-expressions in drag.
So to be clear, XML is here to stay. (An example of XML penetration: there is a working schema for using XML in the farming industry!) Just imagine the chaos that will insue once MS Office saves all documents in true XML.
My take on the problem Tim's really talking about: inconsistency and the proliferation of people who want to be the next prodigy in their area of expertise. There are so many parsers and interfaces, even within a language domain, because vendors want to put their own spin on everything. The alphabet soup that results confuses the hell out of people. This has even happened in the open source world, where I can do a Google search on "php xml parsing" and read articles on no less than 10 different approaches. For the average guy who has been told by a project manager, "We need to take these XML files from our business partner, extract and store the data in our database," you need a standard approach. Not to stifle thought and innovation, yes, you should take the initiative to understand whether an event-driven approach (SAX parser) or an in-memory object model approach (DOM parser) is right for the job. After all, you do get paid to do this, so earn your keep! But the XML community hasn't done a good job of specifying best practices and leading people by the nose to a solution. Every XML book I've seen furthers the confusion, with each other offering his opinion with a slight variation of how to do things, leading programmers/scripters/whatevers to use the approach they most recently read about, and not necessarily the one that time has proven out to be the most efficient.
Part of this is the divide between the .Net guys, the Java camp, the Perl/PHP folks, etc., but in the spirit of interoperability, maybe the XML promoters just need to dumb things down a bit to get some simple concepts and best practices into the hands of Joe Sixpack Programmer. Maybe a central authority, a la java.sun.com or php.net?
XML is bad like Democracy is bad. It's just better than the alternatives.
.xls. Without ever looking at our system's BOM files before I wrote a program that read the .xls and built a proper XML BOM file our system could read. If our system wasn't using XML, who knows how long it would have taken me to figure out the intricacies of a proprietary file format.
I had a problem at work when we switched from AutoCAD to Solidworks. Our manufacturing software couldn't read the new BOM files, which were Excel's
OddManIn: A Game of guns and game theory.
This is bad XML design.
This would be better:
<date year=2003 month=3 day=18/>
I used to think XML was just horribly bloaty and ugly, now I think it's more like VB in that it's easy to make something that's very poorly designed.
> However, the article identifies a clear gap in the tooling and that gap needs to be addressed for XML to become a widespread success, instead of another buzzword hype.
It takes more than a set of good tools for a technology to become 'a widespread success'. A clear justification why XML is better than existing standard marshalling techniques would be a good starting point. ASN.1 DER, simple container LSB serialization and others.
I'm probably beating the dead horse here but XML has at least two properties rendering it useless for any performance-aware application:
(a) unlike, say, TLV it does not allow effeciently skipping parts of the data you dont need or aware of. I.e. in order to skip the section, you need to read and parse it first.
(b) XML's is a lazy man ASN.1 DER. It's all there in much more compact and elegant form. The only 'drawback' in the eyes of XML crowd is that it's binary. Sure, everyone knows that encoding numbers as strings is a definite way to improve upon the performance and scalability of everything from network protocols (SOAP, BXXP, UPNP) to a basic document processing. Right on.
The bottom line is that XML has probably reached its acceptance limits. Whoever accepted XML for granted or stuck with it or is not willing to learn about alternatives will keep on whining about tools being sucky. That's life, but OTOH it's only the small part of it.
3.243F6A8885A308D313
Let's decompose the XML parsing "problem" (if one actually exists) into smaller components that we can reasonably discuss. XML parsing is too broad a topic to intelligently discuss, but if you limit it to XML parsing in Java you suddenly have a topic small enough to be manageable. So let's discuss Java parsing in XML.
When XML was first introduced, there were no standard libraries in the JDK to facilitate parsing. What's more, the few projects out there varied wildly in how you actually used their DOM tree or SAX callback mechanism. This isn't necessarily a Bad Thing (tm), it's the same problem every emerging technology faces: immature tools. This is basic biology - lots of competing implementations (life forms), each struggling for community (resources).
So, time goes by, and eventually a handful of implementations emerge dominant. Some dominate due to performance, and some dominate because of ease of use of the API. The victors in this game then sometimes go through a merging process of their own, where the performance victors lend technology to ease of use API victors. After a lot of merging (and flames usually), one or two projects emerge out of the XML kingdom as the dominant players. In my opinion, in the world of Java these are Xalan (Xerces) and Dom4J.
During the maturation process, Sun comes along and looks at the technology and says "Wow this XML stuff is really here to stay. What implementations are out there, and what similarities exist between them? How can we facilitate growth of these projects?" They realize that certain classes (like org.xml.sax.InputSource) are common entities in both projects (even if the class InputSource doesn't exist), and they standardize it. For a reference to all of the XML standards implemented in the JDK, do a search on java.sun.com for JAXP, JAXM, and JAXB (just to name a few).
At this point, the XML projects come back and work in support so that they can be "JAXP compatible" (again this is part of the biological process of evolution). This insures that the projects works well with whatever Sun ships in the JDK.
In the end (which is really where we are now) you end up with a pluggable architecture, where the JDK provides some common functionality or interfaces that are implemented by open source projects.
Java XML parsing was damn hard back in the day - you had to marry your code to a specific project. But these days with the standardization that has taken place (thanks Sun!), as long as you write code that makes use of the JAXP specification you can plug in any JAXP-compliant parser into your app and things *should* work.
The difficult problem is getting other entities (Application Servers for example) to get up-to-date with the standards. WebLogic 6.1 comes with a non-JAXP compliant parser, and thus doesn't work with the latest JDK, Xalan, etc.
Do it for da shorties