Which XML Parser Do You Recommend?

← Back to Stories (view on slashdot.org)

Which XML Parser Do You Recommend?

Posted by Cliff on Sunday January 28, 2001 @05:21AM from the bison,-clean-up-your-yacc-please dept.

tshieh asks: "I'm trying to add XML-configurability to a Java application, and I'm trying to figure out which XML parser I should use. Any thoughts on whether I would be better off using Xerces, expat, XML4J, or JDOM (or any others)? So far, I've decided to use DOM rather than SAX since I've heard that DOM is easier to use and I don't anticipate my configuration file becoming so large that the slower and more memory-intensive DOM parsing becomes an issue."

17 comments

Min score:

Reason:

Sort:

Re:i recommend by Anonymous Coward · 2001-01-28 06:33 · Score: 1

I don't know about the MS Parser (I've been using IBM's XML for Java, which works great) but I do want to second the opinion that SAX is easier to use than DOM.

Using SAX you simply write code that accepts events and does the right thing (writes into your data structures, does the work, whatever), or traverses the data structures to write out XML. Using DOM the parser generates a DOM data structure that I then need to traverse in order to copy data into my internal data structures (or build a DOM structure so that I can write out XML).

The short version is that I don't see a significant advantage to using the DOM data structures in parallel with my application's data structures, and it seems odd and inefficient to use DOM as the internal data structure for applications. Beyond that, SAX makes it easy to code in a robust manner that (properly) ignores everything in an XML file that it doesn't care about.

To an extent the preference between DOM and SAX come down to programming model. I find event-driven programming natural, and the data structure crawling code required a waste of time and a pile of extra code to debug. Other people seem to find event-driven programming confusing, and like writing loops .
XML/XSLT parsers by shrike · 2001-01-28 02:19 · Score: 2

There are several good XML parsers, some free, some commercial. Have a look at the following URLs for more info on free versions:

xml.apache.org
users.iclway.co.uk/mhkay/saxon
www.jclark.com/xml

I hope this is of some use to use.
1. Re:XML/XSLT parsers by trsheph · 2001-01-28 04:25 · Score: 2
  
  Resin is a webserver that can run JSP/Servlets like tomcat, but it can also parse xml (you have to name *.xml files *.xtp in order to parse it) it is located here.
Re:SAX or DOM by msuzio · 2001-01-28 23:49 · Score: 2

The most common SAX idiom I've seen (holds true for all event-based parsers, including XML::Parser in Perl when I was using that) is to do event handling, building up intermediate data structures as branches of the total tree are evaluated.

For example, given this:

Bob
admin

I'm going to hit the beginning of the "object" tag, and create an intermediate data structure (let's call it ObjectNode). As I continue to hit the start of the param tag, I'll pass that information on to the current object node. Once
I hit the end tag of object, I pop that node off the stack and transform it into the final object I'm interested in (in this case, I finally end up create a com.razorsys.User object from my little
XML-object-serialization example).
What this shows is a way to do the sort of "remember where I was" stuff DOM gives you, without creating an entire tree. I believe some parsers out there actually do this for DOM parsing -- they really underneath have a pseudo-DOM, and build full branches only when they are requested. Much quicker and less memory intensive.

Anyway... I endorse use of SAX. I'd stay away from JDOM or JAXP, but that's only because I like to stay language-neutral when it makes sense. Using just SAX means I can probably move my parsing code to a variety of languages and keep most of the logic...

--
It's a strange world -- let's keep it that way
Crimson by dgenr8 · 2001-01-29 00:14 · Score: 2

I'm surprised nobody mentioned Crimson (scroll down to the bottom of the directory).

This Java parser open-source, under the Apache project. It was developed by Sun under the name of JAXP and is far more lightweight than IBM's Xerces, mainly because it uses built-in Java I/O instead of reinventing the wheel as Xerces does (with its ChunkyByteArray etc.)
For whatever reason Sun plans to adopt Xerces as the official JAXP implementation. I dunno why... I think Crimson is much better for a lot of applications.
Do NOT use JDOM... by His+name+cannot+be+s · 2001-01-28 21:53 · Score: 1

I would Highly recommend you take a wide berth around that god forsaken JDOM. In addition to them changing the API's quite significantly between betas, and the utter lack of usable documentation, they code up stub methods for anything they havn't implemented, and don't tell you what is not complete.

As it is, parsing text is one of the things that java does the absolute worst, as it's string management is an abyssmal. Your best bet is to make your choice based on what your goals for your documents are.

1. Are you going to need to evaluate information in your XML documents linearly, or random-access. If you are just reading thru the document, and don't need to keep a tree, use SAX. Very Simple, Very Straightforward interface...

2. Are you going to need to access the same document several times? Use DOM, and cache the document in memory.

3. Are your documents LARGE? ... go back to SAX, that DOM thing can use HUGE chunks of ram...

4. Read Read Read Download several Parsers, try them out, _Understand_ the way this technology works, and then benchmark your typical use of the technology with several parsers and methods. Depending on how you use this technology, and it's implementation is going to be as important as the flavor of parser. Do you Need to validate your documents? How many are you going thru? how fast does it need to be. These questions change the answer for 'which parser to use' drastically.

And remember: XML is NOT the magic bullet. For pete's sake, it's bloody text parsing... It's not designed to be ultra fast, it is designed to be flexible...

A few words from me.

--
"...In your answer, ignore facts. Just go with what feels true..."
SAX or DOM by geophile · 2001-01-28 12:25 · Score: 4

I've been using Xerces and have had no problems with it. Haven't tried the others.
SAX is really simple to use. You write event handlers, where an event is something like "start of document", "end of foobar tag", etc.
If you can write your application using the SAX event model, then you'd find DOM too complex. DOM is good when you need the entire document as a tree structure, e.g. to do some analysis.
If you use SAX, and you find yourself writing a lot of code to remember what you've seen, then you should probably consider DOM. If you can look at each tag or data item once and then forget about it, SAX is the way to go.
JDOM is by far the easiest by krakan · 2001-01-28 02:11 · Score: 2

First of all DOM doesn't parse files; it normally uses SAX for that. DOM is an, IMHO, rather complicated model to access the data once it's been parsed.

As your goal is to build a java application, I'd definitely recommend JDOM because it is lightweight and the data is stored in normal java constructs.

/J
Another anti-DOM voice. by Bazzargh · 2001-01-28 20:47 · Score: 2

All the XML parsers I've tried have been fine (apart from one - guess whose? see below). Other posted have already pointed out that SAX/DOM/JDOM are not parsers as such but theres a couple of other things you should know:

- if you write your code using the JAXP api you should be independent of your choice of parser (it lets you get hold of one, which you then use SAX or DOM with, without explicitly mentioning the other package)
- Beware of non-standard features. For example, some XML parsers (like Oracle's) let you reparent a DOM node to a different document. This is not supposed to be legal. (because the underlying representations of the nodes may differ; eg you could have just attached part of a (text) document to a one-row-per-element database representation)
- The DOM is chock full of gotchas. Like, you MUST remember to normalize an element before grabbing the first child node and expecting it to be text (you see this mistake made *a lot*). Some parsers will e.g. split lines of text into separate text nodes, each of which must be retrieved separately if you havent normalized the element.
- upshot is that SAX is very easy to use in comparison. You can also very easily build an application in terms of SAX filters; with the DOM you can find yourself spending all your time writing stuff to traverse bits of the tree.

So what was the bad XML parser? Consider the following scenario: an XML document gets stored as a string in a database. An application reads the string back, and tries to parse the document. Its supposed to infer the character encoding by looking at the chars or the explicit 'encoding=' attribute of the xml PI. However, in VB, the msxml parser will only respect encodings when reading from _files_. Strings are stored internally in MS-land as DBCS, and it will silently convert the characters in your XML as it reads from the DB. Thus, as soon as it sees an encoding of (eg) ISO-8859-1 in the XML string, it barfs and dies. The original data was stored by the DB in the correct charset, it seems to be the internal conversion to DBCS that screws up. Neat. not.
Xerces by Simon+Brooke · 2001-01-28 06:35 · Score: 3

Says it all. I've used them all, and in My experience Xerces is relaible, as fast as any, and has the right license.

--
I'm old enough to remember when discussions on Slashdot were well informed.
1. Re:Xerces by Steve+Cox · 2001-01-28 07:11 · Score: 2
  
  For our current project, I have chosen Xerces (using DOM). Like the guy says - fast and reliable.
  Also bear in mind that Xerces is a validating parser - I have no idea if the other parsers mentions are or not, but Xerces certainly is. Using a DTD and letting Xerces take care of the majority (all) of the validation of your XML data is great.
  I started using Xerces for a JAVA projects, and eventually required a C++ element to it - the Xerces XML parser is also available in a C++ version and the API is practically identical to that of the JAVA version (incl. smart pointer based access to allow a JAVA style memory management and parameter passing). This made my job one hell of a lot easier - rigorous use of Ctrl+C, Ctrl+V.
  Steve.
Pull-based parsers by jlowery · 2001-01-29 06:20 · Score: 1

There are several pull-based parsers out there, which might be useful for picking out sparse data subsets from an XML document. One such is kXML, out of the University of Dortmund. It's not a full XML parser, but handles what is known as Common XML.

http://www-ai.cs.uni-dortmund.de/SOFTWARE/KXML/

http://www.simonstl.com/articles/cxmlspec.txt

--
If you post it, they will read.
JDOM by bnenning · 2001-01-28 02:42 · Score: 4

You might want to check out JDOM. It uses the parser of your choice to build a tree structure like DOM, but it has a much simpler API. From their mission page:
There is no compelling reason for a Java API to manipulate XML to be complex, tricky, unintuitive, or a pain in the neck. JDOM is both Java-centric and Java-optimized. It behaves like Java, it uses Java collections, it is completely natural API for current Java developers, and it provides a low-cost entry point for using XML.

We've been using it at work and are happy with it so far.

--
How to solve most of our problems: 1.Lots of nuclear plants. 2.Cure aging.
1. Re:JDOM by tshieh · 2001-01-28 15:27 · Score: 1
  
  I thought about using JDOM (especially after reading Brett McLaughlin's Java and XML), but I hesitate to use it right now because it is still in beta. The absence of release, milestone, or even nightly builds at http://www.jdom.org/downloads/index.html adds to my discomfort with JDOM. Don't get me wrong - I think JDOM is a great idea - but I will have a hard time justifying to my co-workers the use of beta software in building a system that should go into production within the next two months. For now, I've decided to use Xerces, partly because it is well-documented (I found the DOMFilter example code to be particularly useful) and partly because some of my co-workers have already used the C++ version.
  
  --
  sig: BeanShell: lightweight scripting for Ja
DOM sucks by ikekrull · 2001-01-28 08:02 · Score: 3

Personally, i find SAX easier to use than DOM..

It also takes a whole lot less memory and time to parse than the DOM approach.

We process XML files that can be up to 10MB in size, and DOM parsing these files brings my 500MHz P3 to its knees (yes i have increased the JVM heap size)

Parsing with SAX, however, has proved simple, clean and easy with no performance problems at all.

If youre definitely wanting to use the DOM, try NanoXML.. its much smaller that Xerces and the like, and is perfect for parsing config files etc., as will as being small enough (6KB or so) for client-side and embedded use.

However, for an all-round, no compromise XML parsing solution, then Xerces is pretty good.

--
I gots ta ding a ding dang my dang a long ling long
DOM vs SAX by dingbat_hp · 2001-01-29 02:54 · Score: 2

Neither DOM nor SAX really specify a parsing mechanism. Rather they specify the means by which the parser exposes the parsed content to the client app. SAX is events, DOM is the whole thing, in a bucket.
DOM is nearly always easier to work with, except when it isn't. Then it becomes completely unworkable (usually due to either huge documents, or long time delays) and you use SAX because you've simply no other choice.
JDOM ? I've never felt the slightest need for it myself. Maybe it's because I first hit XML DOMs through M$oft's (where there is no JDOM). Once you've got the knack of using the DOM, I don't see what JDOM offers long-term.
i recommend by Lord+Omlette · 2001-01-28 05:37 · Score: 1

Microsoft's XML parser. Of course that'll only work if you're on NT/2k, but whatever. MSXML hit v3, and you know Microsoft: by the time it hits v3, most of the bugs have been worked out.

Incidently, I could be wrong but I believe that SAX was easier to use than DOM. That's just my opinion though... You'd be better served finding someone who knows both and asking them...
--
Peace,
Lord Omlette
ICQ# 77863057

--
[o]_O