Which XML Parser Do You Recommend?
tshieh asks: "I'm trying to add XML-configurability to a Java application, and I'm trying to figure out which XML parser I should use. Any thoughts on whether I would be better off using Xerces, expat, XML4J, or JDOM (or any others)? So far, I've decided to use DOM rather than SAX since I've heard that DOM is easier to use and I don't anticipate my configuration file becoming so large that the slower and more memory-intensive DOM parsing becomes an issue."
I don't know about the MS Parser (I've been using IBM's XML for Java, which works great) but I do want to second the opinion that SAX is easier to use than DOM.
Using SAX you simply write code that accepts events and does the right thing (writes into your data structures, does the work, whatever), or traverses the data structures to write out XML. Using DOM the parser generates a DOM data structure that I then need to traverse in order to copy data into my internal data structures (or build a DOM structure so that I can write out XML).
The short version is that I don't see a significant advantage to using the DOM data structures in parallel with my application's data structures, and it seems odd and inefficient to use DOM as the internal data structure for applications. Beyond that, SAX makes it easy to code in a robust manner that (properly) ignores everything in an XML file that it doesn't care about.
To an extent the preference between DOM and SAX come down to programming model. I find event-driven programming natural, and the data structure crawling code required a waste of time and a pile of extra code to debug. Other people seem to find event-driven programming confusing, and like writing loops .
There are several good XML parsers, some free, some commercial. Have a look at the following URLs for more info on free versions:
xml.apache.org
users.iclway.co.uk/mhkay/saxon
www.jclark.com/xml
I hope this is of some use to use.
The most common SAX idiom I've seen (holds true for all event-based parsers, including XML::Parser in Perl when I was using that) is to do event handling, building up intermediate data structures as branches of the total tree are evaluated.
For example, given this:
Bob
admin
I'm going to hit the beginning of the "object" tag, and create an intermediate data structure (let's call it ObjectNode). As I continue to hit the start of the param tag, I'll pass that information on to the current object node. Once
I hit the end tag of object, I pop that node off the stack and transform it into the final object I'm interested in (in this case, I finally end up create a com.razorsys.User object from my little
XML-object-serialization example).
What this shows is a way to do the sort of "remember where I was" stuff DOM gives you, without creating an entire tree. I believe some parsers out there actually do this for DOM parsing -- they really underneath have a pseudo-DOM, and build full branches only when they are requested. Much quicker and less memory intensive.
Anyway... I endorse use of SAX. I'd stay away from JDOM or JAXP, but that's only because I like to stay language-neutral when it makes sense. Using just SAX means I can probably move my parsing code to a variety of languages and keep most of the logic...
It's a strange world -- let's keep it that way
I'm surprised nobody mentioned Crimson (scroll down to the bottom of the directory).
This Java parser open-source, under the Apache project. It was developed by Sun under the name of JAXP and is far more lightweight than IBM's Xerces, mainly because it uses built-in Java I/O instead of reinventing the wheel as Xerces does (with its ChunkyByteArray etc.)
For whatever reason Sun plans to adopt Xerces as the official JAXP implementation. I dunno why... I think Crimson is much better for a lot of applications.I would Highly recommend you take a wide berth around that god forsaken JDOM. In addition to them changing the API's quite significantly between betas, and the utter lack of usable documentation, they code up stub methods for anything they havn't implemented, and don't tell you what is not complete.
... go back to SAX, that DOM thing can use HUGE chunks of ram...
As it is, parsing text is one of the things that java does the absolute worst, as it's string management is an abyssmal. Your best bet is to make your choice based on what your goals for your documents are.
1. Are you going to need to evaluate information in your XML documents linearly, or random-access. If you are just reading thru the document, and don't need to keep a tree, use SAX. Very Simple, Very Straightforward interface...
2. Are you going to need to access the same document several times? Use DOM, and cache the document in memory.
3. Are your documents LARGE?
4. Read Read Read Download several Parsers, try them out, _Understand_ the way this technology works, and then benchmark your typical use of the technology with several parsers and methods. Depending on how you use this technology, and it's implementation is going to be as important as the flavor of parser. Do you Need to validate your documents? How many are you going thru? how fast does it need to be. These questions change the answer for 'which parser to use' drastically.
And remember: XML is NOT the magic bullet. For pete's sake, it's bloody text parsing... It's not designed to be ultra fast, it is designed to be flexible...
A few words from me.
"...In your answer, ignore facts. Just go with what feels true..."
SAX is really simple to use. You write event handlers, where an event is something like "start of document", "end of foobar tag", etc.
If you can write your application using the SAX event model, then you'd find DOM too complex. DOM is good when you need the entire document as a tree structure, e.g. to do some analysis.
If you use SAX, and you find yourself writing a lot of code to remember what you've seen, then you should probably consider DOM. If you can look at each tag or data item once and then forget about it, SAX is the way to go.
First of all DOM doesn't parse files; it normally uses SAX for that. DOM is an, IMHO, rather complicated model to access the data once it's been parsed.
As your goal is to build a java application, I'd definitely recommend JDOM because it is lightweight and the data is stored in normal java constructs.
/J
All the XML parsers I've tried have been fine (apart from one - guess whose? see below). Other posted have already pointed out that SAX/DOM/JDOM are not parsers as such but theres a couple of other things you should know:
- if you write your code using the JAXP api you should be independent of your choice of parser (it lets you get hold of one, which you then use SAX or DOM with, without explicitly mentioning the other package)
- Beware of non-standard features. For example, some XML parsers (like Oracle's) let you reparent a DOM node to a different document. This is not supposed to be legal. (because the underlying representations of the nodes may differ; eg you could have just attached part of a (text) document to a one-row-per-element database representation)
- The DOM is chock full of gotchas. Like, you MUST remember to normalize an element before grabbing the first child node and expecting it to be text (you see this mistake made *a lot*). Some parsers will e.g. split lines of text into separate text nodes, each of which must be retrieved separately if you havent normalized the element.
- upshot is that SAX is very easy to use in comparison. You can also very easily build an application in terms of SAX filters; with the DOM you can find yourself spending all your time writing stuff to traverse bits of the tree.
So what was the bad XML parser? Consider the following scenario: an XML document gets stored as a string in a database. An application reads the string back, and tries to parse the document. Its supposed to infer the character encoding by looking at the chars or the explicit 'encoding=' attribute of the xml PI. However, in VB, the msxml parser will only respect encodings when reading from _files_. Strings are stored internally in MS-land as DBCS, and it will silently convert the characters in your XML as it reads from the DB. Thus, as soon as it sees an encoding of (eg) ISO-8859-1 in the XML string, it barfs and dies. The original data was stored by the DB in the correct charset, it seems to be the internal conversion to DBCS that screws up. Neat. not.
Says it all. I've used them all, and in My experience Xerces is relaible, as fast as any, and has the right license.
I'm old enough to remember when discussions on Slashdot were well informed.
There are several pull-based parsers out there, which might be useful for picking out sparse data subsets from an XML document. One such is kXML, out of the University of Dortmund. It's not a full XML parser, but handles what is known as Common XML.
http://www-ai.cs.uni-dortmund.de/SOFTWARE/KXML/
http://www.simonstl.com/articles/cxmlspec.txt
If you post it, they will read.
We've been using it at work and are happy with it so far.
How to solve most of our problems: 1.Lots of nuclear plants. 2.Cure aging.
Personally, i find SAX easier to use than DOM..
It also takes a whole lot less memory and time to parse than the DOM approach.
We process XML files that can be up to 10MB in size, and DOM parsing these files brings my 500MHz P3 to its knees (yes i have increased the JVM heap size)
Parsing with SAX, however, has proved simple, clean and easy with no performance problems at all.
If youre definitely wanting to use the DOM, try NanoXML.. its much smaller that Xerces and the like, and is perfect for parsing config files etc., as will as being small enough (6KB or so) for client-side and embedded use.
However, for an all-round, no compromise XML parsing solution, then Xerces is pretty good.
I gots ta ding a ding dang my dang a long ling long
Neither DOM nor SAX really specify a parsing mechanism. Rather they specify the means by which the parser exposes the parsed content to the client app. SAX is events, DOM is the whole thing, in a bucket.
DOM is nearly always easier to work with, except when it isn't. Then it becomes completely unworkable (usually due to either huge documents, or long time delays) and you use SAX because you've simply no other choice.
JDOM ? I've never felt the slightest need for it myself. Maybe it's because I first hit XML DOMs through M$oft's (where there is no JDOM). Once you've got the knack of using the DOM, I don't see what JDOM offers long-term.
Microsoft's XML parser. Of course that'll only work if you're on NT/2k, but whatever. MSXML hit v3, and you know Microsoft: by the time it hits v3, most of the bugs have been worked out.
Incidently, I could be wrong but I believe that SAX was easier to use than DOM. That's just my opinion though... You'd be better served finding someone who knows both and asking them...
--
Peace,
Lord Omlette
ICQ# 77863057
[o]_O