Slashdot Mirror


XML Co-Creator says XML Is Too Hard For Programmers

orangerobot writes "Tim Bray, one of the co-authors of the original XML 1.0 specification has a new entry on his website explaining why he's been feeling unsatisified lately with XML and says his last experience writing code for handling XML was 'irritating, time-consuming, and error-prone.' XML has always a divided response among the technical community. The anti-XML community has several sites stating their positions."

22 of 562 comments (clear)

  1. Re:xml by Pyromage · · Score: 4, Informative

    XML isn't intended for web pages. That's what you missed:

    It's biggest use right now is data interchange. Moving bits between one magic widget and another. And for that, HTML sucks. It just can't represent arbitrary data. Programming languages (C++, Java) are for instructions, not data.

    XML fits in perfectly where it's at use-wise. Tim Bray is talking about programming for it: The available interfaces are very counter-intuitive, and that's what Bray's getting at.

  2. Re:xml by CynicTheHedgehog · · Score: 2, Informative

    When you're writing an application and you have to decide what format messages should be written in, or what type of file configuration data should be stored in, most people say, "Why, XML, of course. That way we're guarenteed that it is extensible, transformable, and readable by anyone who would ever need to read it." Granted, there are lots of other document formats in which that is the case, but they are not industry standard. As long as there is a schema, everyone will accept it. And if it's not in the format that they would like, they are free to run it through an XSL transformation. Easy as pie.

    XML is not hard, but it is a discipline. It requires a lot of reading and a fair amount of practice, but once you have it down, that's it. And from now on, your document storage design decisions (barring any space/memory constraints) are made for you.

  3. Re:xml by BFKrew · · Score: 2, Informative

    On the web, a big problem is that the content of the page is mixed in with the formatting. So, this content cannot be displayed easily on a PDA, phone or even across different browsers to an extent.

    By separting the content from how it is displayed makes it easier to display it in pretty much any format. By taking a single XML document you could create a page that looks great on Mozilla, great on IE, a WAP enabled phone, Opera, Microwave, Fridge - whatever!

    XML is NOT a programming language. It is more like a way of describing data and one MAJOR benefit in my opinion is that it is human as well as machine readable. I can ask my 'pointy haired boss' to make an ammendment to an XML document and he will pretty much be able to read it quite easily.

    It has plenty of uses such as a way of sharing data. There is no reason, for example, why a XML source could not be used in other webpages, as an input source for a database, or even as a way of getting output from your C++ program into my Java app, my ASP.NET page or even another C++ program!

  4. Re:Too hard? by Omkar · · Score: 2, Informative

    blah blah blah...right tool for the right task...blah blah blah.
    Seriously, don't knock VB until you need to code a quick dbaccess (or other simple) app in a couple of days for internal use. Easy languages have their places!

  5. Short summary by Anonymous Coward · · Score: 5, Informative

    Tim Bray thinks that callback based XML apis are a bit awkward to use. He would prefer to use something like a pull parser (see for example http://www.xmlpull.org for examples in java) to the current perl xml apis.

    And he would probably want to be able to parse parts of documents ("XML Fragments"), rather than whole documents.

    I agree with his views (not using perl too much, though). But this is *not* the end of XML or anything. Tim just has some thoughts about how the xml api could be better in perl. Not very exciting, perhaps...

  6. Re:It's about tools, libraries by kinnell · · Score: 4, Informative

    As he say in the article, the reason he uses Perl regexp is that the tools/libraries have to read the entire file. If this is a long stream, it's grossly inefficient - you have to load the entire thing into a massive tree structure in memory. If the job can be done serially with regexps without using a noticeable amount of memory or time, then it is often better. This is the point of the article - there is a choice between using a method which is often grossly innefficient for real world problems (XML libs) and a fast but messy method (Perl regexp). Neither of these is really satisfactory, hence the complaint.

    --
    If I seem short sighted, it is because I stand on the shoulders of midgets
  7. Re:Maybe he should have read Knuth by Ed+Avis · · Score: 5, Informative
    XLM parsing (just like the TeX language) has a problem that when there are problems in the input files, the situation diverges into two different caes, one requires an infinite memory and the other infinite time to deal gracefully with errors.

    WTF? Perhaps you could explain more about these two cases. As far as I know, general XML parsers such as Expat do not require unlimited memory to parse any finite input document, nor do they require infinite time.

    The Document Type Description (DTD) system is equivalent to a BNF grammar for XML documents. It's not quite as flexible as a full BNF because it enforces that elements are correctly nested, but I don't see this as a bad thing.

    And yes, DTDs are machine readable. Other grammars for XML documents such as DSD, XML Schema or Relax-NG are also machine readable.

    Just as with BNF grammars and flex(1), you can take a DTD and generate an efficient parser from it using FleXML.

    Comparisons with TeX aren't really appropriate because TeX is a Turing-complete language, and so impossible to parse automatically in 100% of cases (unless you want to allow that your program will sometimes fail to terminate, ie hang, on particular input files). I don't know what you mean by your subject line 'Maybe he should have read Knuth'...

    --
    -- Ed Avis ed@membled.com
  8. WTF? by samael · · Score: 4, Informative

    XML isn't a replacement for Java or C++. Neither is HTML. You're looking at three seperate areas there.
    HTML is a page description language.
    C++ and Java are data processing languages.
    XML is a data description language.

    You can certainly describe a page using XML, and I see no reason why you couldn't construct a programming language using XML syntax, but how on earth are you going to store data in C++ or Java?

  9. "Load into memory" vs. "Callbacks" by itsallinthemind · · Score: 4, Informative
    Say what you will about Microsoft - and many of you have - but they really got it right with their XmlReader class in .NET. It streams the document like SAX (the "callback" interface Tim mentions in his comments), but allows the programmer to cursor over the document manually rather than having to handle everything in thrown event handlers (which I agree can be a real headache, especially in highly variable or deeply nested documents.)

    XML is just one of the tools in our collective toolbox. Use it where it helps you solve a problem. Don't bother if it doesn't.

  10. Re:Don't Blink by samael · · Score: 2, Informative

    You can use XSL to translate any XML document into a different format. So your old documents should be convertable.

    If your subdialect keeps changing, that's down to the people defining the syntax, not the language itself.

  11. SSAX by Anonymous Coward · · Score: 1, Informative

    Try the SSAX XML parser- has the streaminess of SAX, the objectiness of DOM.

    Also neatly illustrates the essential equivalence of XML to a small subset of Lisp.

  12. Re:It's about tools, libraries by PigleT · · Score: 4, Informative

    I agree that it's about tools and libraries. And this is what I think about them, too.

    At work, I brush up against XML occasionally, mostly for documentation or data-resultset purposes. In my own time, I use it in my photo
    gallery - result-sets from database queries get converted to XML and then spat out through XSLT in Sablotron, straight to web. For all the hoops it goes through, it's actually still quite nippy.

    However, I also dislike it intensly.

    I've written a blog-like system-news announcement board using a Ruby CGI against postgresql as a backend. I can pull back a result-set - a
    simple table-thing with each row being a text announcment, half a dozen fields (when posted, by whom, etc). And I wanted to output this in HTML form for the web, in plain-text to send to a user who wanted it via email every day, and in s-exp form for my own gratification.
    However, the first problem you run into is the formatting. A textarea in an HTML form gives no line-wrapping (wanted for plaintext output,
    but only in specific fields) and embeds ^M characters everywhere. When the output is HTML, those ^Ms want to become br tags. When the output
    is plaintext or sexp, they want to become \n. Simple, if ONLY there were a way of doing either elementary reformatting or search-n-replace in XSLT. There is, but s/// is about 10 lines' worth, if my googling is to be believed. That makes it non-optimal for one of its primary uses: making transformations on big blocks of text-based data, and it can't even edit within a node correctly? Pathetic.
    Why shouldn't I just write 3 output methods in my Ruby CGI script that take the result-set directly to text, HTML or sexp formats, with the power of
    ruby to do a #gsub("^M", "\n") on just the fields I want, in a tiny few extra characters of code?

    Now to tackle what you've said:

    "Using Perl regexps to parse XML is silly"

    No, it's not. Perl regexps are a highly featureful, pre-existing, code. I'd be surprised if libxml *didn't* use regexps in its XML parsers, frankly.

    "e.g. attributes in any order, elements covering multiple lines) that regexps aren't good at handling."

    These things are not a problem. You can easily match an attribute occurring, as it does, within a n opening-tag, and pull out both the name and the contents. Using that to set a variable of given name in your program - a highly important part, given that XML is a data-transfer format and it's the internal representation afterwards
    that is its whole raison-d'etre - is trivial. Thus, perl wins.
    Multi-line matching is explicitly catered-for in perl, with /m or /s on the end of the regexp.

    "There's a number of tools and libraries "...

    Indeed there are. And you know what? When I've got a small paragraph (under 10 lines) of data that I want to transfer from A to B, the last thing I'm going to do is invoke a 600Kb library so I can use a pompous and fashionable set of functions to produce "XML", when perl/ruby/sh have all had
    perfectly valid "print" or "echo" commands for the past decade or more. If the output is valid XML, you've no reason to diss the method used to produce it.

    As a final example, I've also had a few documents to be writing, of my own, at work. I've had two options: either sit down, set up emacs to
    handle XML sources smoothly so I can open and close tags at the push of a key-chord the way I *want* to create the stuff, or program a
    small sub-language. Lisp, in the form of _librep_, won the day, with a few small functions to produce strings based on the input. And guess what? Because this is a programming language rather than a mere text-transforming language, I made a CGI out of it, and can embed programs within my "data", too, without feeling the urge to write to
    the W3C about it.
    Editing it is an absolute dream - opening and closing paragraphs of text is a piece of cake and fits the way I want to work. (Maybe you like looking at spikey angle-bracket characters, I
    dunno.)
    In short, "programmed text" won the day for me.

    --
    ~Tim
    --
    .|` Clouds cross the black moonlight,
    Rushing on down to the circle of the turn
  13. XML parsing models by HalfFlat · · Score: 3, Informative

    If I understand it correctly, the author is lamenting that neither of the standard ways of parsing XML in a scripting language fit the straightforward model of scanning for something relevant and then acting upon it, where the two models are: 1) read in whole file and make a tree (take sup too much memory, is slow, etc.); or 2) use a callback interface.

    The style of perl script he was seeking was a simple loop model:
    while () {
    next if /ignorable/;
    if (/thing-one/) { ... }
    elsif (/thing-two/) { ... }
    ...
    }

    To me the thing that distinguishes this the most from the provided XML parsing interfaces is that it has a minimal amount of state.

    So isn't what is needed a corresponding structure to the while () above that iterates over the tree-nodes of the XML-encoded data structure, in a depth-first preorder traversal (to avoid having to build the whole tree first)? One could imagine a parser object that scans through the XML file returning nodes (and their parent history) while maintaining an absolute minimum of state. If one wanted to build an in-memory representation of a subtree given a node, then one can always do so when one finds the node one wants.

    Such an interface wouldn't be good for integrity verification or the like, but for the sort of application the author was talking about, it would seem ideal. Much less flexible than the normal models, sure, but much easier to work with when the problem fits this sort of description. Perhaps I'm underestimating the difficulty of the task, but it doesn't sound too hard to write, given that it is doing so much less than the fully-featured XML parsing interfaces.

    The other problem is the awkwardness of the use of XML in O-O languages such as addressed in the article linked-to by Tim Bray in his article. Though I haven't used this particular program, this seems to be the problem that FleXML is trying to address. When you don't need all of the flexibility that XML can provide, but instead have a fixed schema that your XML-representation follows, why not have your parser automatically built to read it? People have used lex/flex for scanning text files for decades --- in these days of XML Schema, it should be even easier. If FleXML lives up to its promise, it will be. Has anyone here used FleXML and are willing to comment on how well it addresses these sorts of problems?

  14. Re:Maybe he should have read Knuth by Sique · · Score: 2, Informative

    Comparisons with TeX aren't really appropriate because TeX is a Turing-complete language, and so impossible to parse automatically in 100% of cases (unless you want to allow that your program will sometimes fail to terminate, ie hang, on particular input files). I don't know what you mean by your subject line 'Maybe he should have read Knuth'...

    Maybe you should read Knuth also... There are two different things: One is the grammar and the other one is the language. You can write a turing complete language in a regular grammar (Chomsky Type 3), completely parseable with regexp (think: (([linenumber] ((INC|DEC) [register])|JMZ [linenumber])[newline])*). You can also write a primitive-recursive language using a free grammar (Chomsky Type 0) (think: your average english book about primitive-recursive languages), which is unparseable within finite time and memory.

    So TeX is a Turing complete language written in a Chomsky Type 1 grammar (It should be LL2, but I am not sure). XML for itself is a turing incomplete way to describe Chomsky Type 2 grammars.

    --
    .sig: Sique *sigh*
  15. Re:It's about tools, libraries by Len · · Score: 3, Informative
    Generic XML parsers are memory intensive and can't be as fast as regular expressions. That's just computer science. Deal with it.

    You're right, but the problem is that "deal with it" may equate to "don't use XML" in a lot of cases, which makes XML less of the universal data representation language than it wants to be.

    When the parser uses a lot of memory (like DOM reading the entire input into a tree) it becomes inefficient, sometimes infeasible, to handle large input documents. That's one of the specific problems mentioned by Tim Bray and others.

  16. Re:Meta XML by rabidcow · · Score: 3, Informative

    This is bad XML design.

    This would be better:
    <date year=2003 month=3 day=18/>

    I used to think XML was just horribly bloaty and ugly, now I think it's more like VB in that it's easy to make something that's very poorly designed.

  17. Re:But XML is great for computers... by rabidcow · · Score: 2, Informative

    most other non-xml config files in /etc, like say hosts, DNS zone files, named.conf, passwd/shadow, hosts.allow/deny, sendmail.mc or resolv.conf (etc. etc.)

    all these can be parsed but they all require *different* code for each config file.

    Nonsense, if you're smart about your parser, you'll need about 3. If you're not smart about your parser, you'd probably design lousy XML anyway.

    how do I quote strings, how do I escape newlines, how do I mark nested scopes, what happens when the string delimiter character occurs inside a string, how do I deal with comments, what is the character set, is there a formal grammar for the document, etc etc

    afaik, most config files ignore these issues, but you could easily separate these options from the core of the parser. Pass them in as a traits class or something.

  18. Re:Really? by stand · · Score: 4, Informative

    It is customary to attribute quotations when you publish them. Otherwise it's called plagarism. Credit where credit is due and all that.

    Unless, of course this particular AC is Rick Jelliffe, in which case I apologize.

    --
    Four fifths of all our troubles in this life would disappear if we would just sit down and keep still. -C. Coolidge
  19. Re:Maybe he should have read Knuth by Minna+Kirai · · Score: 2, Informative

    I don't know what you mean by 'in theory'. A finite input file requires finite resources. Period.

    He probably means "taken to the limit". A way of characterizing the performance of a system- how does it fail, when faced with an overwhelming amount of work? (It's like O-notation, which assumes the problem size is infinite to elimiate lower-order effects from the description)

    An infinite input file could require infinite memory to parse it. So what?

    The intention probably was to point out that a program which extracts from a non-XML database can be written to use constant memory, regardless of file size (or log memory, to be pedantic). Whereas with XML, the memory used increases as long as the file size does.

    (There are tricks which can reduce the memory use, but they usually come down to making assumptions about the formatting of the file, which can lead to skipping over malformed XML chunks)

    (I'm not espousing those views, just attempting to translate for you)

  20. Re:It's about tools, libraries by Loma · · Score: 5, Informative
    You have used many big words, and you may have your language levels incorrect, but you are clearly wrong in one respect:

    Generic XML parsers are memory intensive and can't be as fast as regular expressions. That's just computer science. Deal with it.


    Well, I've written my own XML parser, as well as a compiler for a simplified version of C, so I think I'm somewhat qualified to talk on this. A generalized XML parser is not memory intensive, unless you are a very bad programmer. All you need is a depth-first stack, which will be as high as your XML tree is deep. And given that, a stack of size N is capable of handling a tree of size X^N, you are definitely going to run out of disk space before you run out of RAM. In other words, the memory required for parsing an XML tree is trivial.

    An XML parser is one of the simplest parsers imaginable. It's a sophmore task to create a state machine to process the generic L(1) (or is it L(0)?) XML grammar. And as you should know, a state machine for an L(1) grammar is as fast as you can get.

    Anything you do with regular expressions will be much more complicated. As I'm sure you know, regular expressions are turned into state machines before being used to process the input. And almost all regular expression state machines are much more complicated than the state machine you need for an XML parser. In an XML parser, definite boundaries exist on elements such as:
    '<' and '>'


    Regular expressions are not this smart. For example, looking for the substring "abc" in the longer string "abababaaabbbabcabababac" is already generating a statemachine that is more complicated than that needed for XML parsers.

    Back to the "memory" intensive nature of XML parsers. If you parse your XML tree into a nested hashmap structure, then the memory needed will be proportional to the number of nodes in the XML tree. Maybe this is what you meant by "memory intensive". However, this is totally unnecessary. You can easily construct an XML parser to look for the specific elements you care about. Then you only get those elements, and you only need to allocate the memory for the elements required.
  21. Re:Too hard? by Tet · · Score: 2, Informative
    Motorolla 6809 ASM, and Motorolla 6502 ASM.

    Of course, while the 6809 was indeed a Motorola chip, the 6502 was made by MOS (a company started by former Motorola employees). The initial 6501 was pin compatible with the 6800, and Motorola sued, resulting in the 6502, which had a different pin layout.

    Other than that, I agree with your comments.

    --
    "The invisible and the non-existent look very much alike." -- Delos B. McKown
  22. Re:Maybe he should have read Knuth by J.+Random+Software · · Score: 2, Informative

    It's fairly common to comment out markup when hand-editing, since <![IGNORE[...]]> can't be used within a document. Skipping non-markup in the document should be just a matter of matching the Perl regex

    ( [^<]+ | <!--.*?--> | <\?.*?\?> | <![CDATA[.*?]]> )* <foo

    If someone else defines a foo element in a different namespace, I don't see how you can do anything other than ignore it--it's almost certainly not what you were looking for, and you have no idea what it might mean.