Slashdot Mirror


Which XML Parser Do You Recommend?

tshieh asks: "I'm trying to add XML-configurability to a Java application, and I'm trying to figure out which XML parser I should use. Any thoughts on whether I would be better off using Xerces, expat, XML4J, or JDOM (or any others)? So far, I've decided to use DOM rather than SAX since I've heard that DOM is easier to use and I don't anticipate my configuration file becoming so large that the slower and more memory-intensive DOM parsing becomes an issue."

12 of 17 comments (clear)

  1. XML/XSLT parsers by shrike · · Score: 2

    There are several good XML parsers, some free, some commercial. Have a look at the following URLs for more info on free versions:

    xml.apache.org
    users.iclway.co.uk/mhkay/saxon
    www.jclark.com/xml

    I hope this is of some use to use.

    1. Re:XML/XSLT parsers by trsheph · · Score: 2

      Resin is a webserver that can run JSP/Servlets like tomcat, but it can also parse xml (you have to name *.xml files *.xtp in order to parse it) it is located here.

  2. Re:SAX or DOM by msuzio · · Score: 2

    The most common SAX idiom I've seen (holds true for all event-based parsers, including XML::Parser in Perl when I was using that) is to do event handling, building up intermediate data structures as branches of the total tree are evaluated.

    For example, given this:

    Bob
    admin

    I'm going to hit the beginning of the "object" tag, and create an intermediate data structure (let's call it ObjectNode). As I continue to hit the start of the param tag, I'll pass that information on to the current object node. Once
    I hit the end tag of object, I pop that node off the stack and transform it into the final object I'm interested in (in this case, I finally end up create a com.razorsys.User object from my little
    XML-object-serialization example).
    What this shows is a way to do the sort of "remember where I was" stuff DOM gives you, without creating an entire tree. I believe some parsers out there actually do this for DOM parsing -- they really underneath have a pseudo-DOM, and build full branches only when they are requested. Much quicker and less memory intensive.

    Anyway... I endorse use of SAX. I'd stay away from JDOM or JAXP, but that's only because I like to stay language-neutral when it makes sense. Using just SAX means I can probably move my parsing code to a variety of languages and keep most of the logic...

  3. Crimson by dgenr8 · · Score: 2

    I'm surprised nobody mentioned Crimson (scroll down to the bottom of the directory).

    This Java parser open-source, under the Apache project. It was developed by Sun under the name of JAXP and is far more lightweight than IBM's Xerces, mainly because it uses built-in Java I/O instead of reinventing the wheel as Xerces does (with its ChunkyByteArray etc.)

    For whatever reason Sun plans to adopt Xerces as the official JAXP implementation. I dunno why... I think Crimson is much better for a lot of applications.
  4. SAX or DOM by geophile · · Score: 4
    I've been using Xerces and have had no problems with it. Haven't tried the others.

    SAX is really simple to use. You write event handlers, where an event is something like "start of document", "end of foobar tag", etc.

    If you can write your application using the SAX event model, then you'd find DOM too complex. DOM is good when you need the entire document as a tree structure, e.g. to do some analysis.

    If you use SAX, and you find yourself writing a lot of code to remember what you've seen, then you should probably consider DOM. If you can look at each tag or data item once and then forget about it, SAX is the way to go.

  5. JDOM is by far the easiest by krakan · · Score: 2

    First of all DOM doesn't parse files; it normally uses SAX for that. DOM is an, IMHO, rather complicated model to access the data once it's been parsed.

    As your goal is to build a java application, I'd definitely recommend JDOM because it is lightweight and the data is stored in normal java constructs.

    /J

  6. Another anti-DOM voice. by Bazzargh · · Score: 2

    All the XML parsers I've tried have been fine (apart from one - guess whose? see below). Other posted have already pointed out that SAX/DOM/JDOM are not parsers as such but theres a couple of other things you should know:

    - if you write your code using the JAXP api you should be independent of your choice of parser (it lets you get hold of one, which you then use SAX or DOM with, without explicitly mentioning the other package)
    - Beware of non-standard features. For example, some XML parsers (like Oracle's) let you reparent a DOM node to a different document. This is not supposed to be legal. (because the underlying representations of the nodes may differ; eg you could have just attached part of a (text) document to a one-row-per-element database representation)
    - The DOM is chock full of gotchas. Like, you MUST remember to normalize an element before grabbing the first child node and expecting it to be text (you see this mistake made *a lot*). Some parsers will e.g. split lines of text into separate text nodes, each of which must be retrieved separately if you havent normalized the element.
    - upshot is that SAX is very easy to use in comparison. You can also very easily build an application in terms of SAX filters; with the DOM you can find yourself spending all your time writing stuff to traverse bits of the tree.

    So what was the bad XML parser? Consider the following scenario: an XML document gets stored as a string in a database. An application reads the string back, and tries to parse the document. Its supposed to infer the character encoding by looking at the chars or the explicit 'encoding=' attribute of the xml PI. However, in VB, the msxml parser will only respect encodings when reading from _files_. Strings are stored internally in MS-land as DBCS, and it will silently convert the characters in your XML as it reads from the DB. Thus, as soon as it sees an encoding of (eg) ISO-8859-1 in the XML string, it barfs and dies. The original data was stored by the DB in the correct charset, it seems to be the internal conversion to DBCS that screws up. Neat. not.

  7. Xerces by Simon+Brooke · · Score: 3

    Says it all. I've used them all, and in My experience Xerces is relaible, as fast as any, and has the right license.

    --
    I'm old enough to remember when discussions on Slashdot were well informed.
    1. Re:Xerces by Steve+Cox · · Score: 2
      For our current project, I have chosen Xerces (using DOM). Like the guy says - fast and reliable.

      Also bear in mind that Xerces is a validating parser - I have no idea if the other parsers mentions are or not, but Xerces certainly is. Using a DTD and letting Xerces take care of the majority (all) of the validation of your XML data is great.

      I started using Xerces for a JAVA projects, and eventually required a C++ element to it - the Xerces XML parser is also available in a C++ version and the API is practically identical to that of the JAVA version (incl. smart pointer based access to allow a JAVA style memory management and parameter passing). This made my job one hell of a lot easier - rigorous use of Ctrl+C, Ctrl+V.

      Steve.

  8. JDOM by bnenning · · Score: 4
    You might want to check out JDOM. It uses the parser of your choice to build a tree structure like DOM, but it has a much simpler API. From their mission page:
    There is no compelling reason for a Java API to manipulate XML to be complex, tricky, unintuitive, or a pain in the neck. JDOM is both Java-centric and Java-optimized. It behaves like Java, it uses Java collections, it is completely natural API for current Java developers, and it provides a low-cost entry point for using XML.

    We've been using it at work and are happy with it so far.

    --
    How to solve most of our problems: 1.Lots of nuclear plants. 2.Cure aging.
  9. DOM sucks by ikekrull · · Score: 3

    Personally, i find SAX easier to use than DOM..

    It also takes a whole lot less memory and time to parse than the DOM approach.

    We process XML files that can be up to 10MB in size, and DOM parsing these files brings my 500MHz P3 to its knees (yes i have increased the JVM heap size)

    Parsing with SAX, however, has proved simple, clean and easy with no performance problems at all.

    If youre definitely wanting to use the DOM, try NanoXML.. its much smaller that Xerces and the like, and is perfect for parsing config files etc., as will as being small enough (6KB or so) for client-side and embedded use.

    However, for an all-round, no compromise XML parsing solution, then Xerces is pretty good.

    --
    I gots ta ding a ding dang my dang a long ling long
  10. DOM vs SAX by dingbat_hp · · Score: 2

    Neither DOM nor SAX really specify a parsing mechanism. Rather they specify the means by which the parser exposes the parsed content to the client app. SAX is events, DOM is the whole thing, in a bucket.

    DOM is nearly always easier to work with, except when it isn't. Then it becomes completely unworkable (usually due to either huge documents, or long time delays) and you use SAX because you've simply no other choice.

    JDOM ? I've never felt the slightest need for it myself. Maybe it's because I first hit XML DOMs through M$oft's (where there is no JDOM). Once you've got the knack of using the DOM, I don't see what JDOM offers long-term.