Using XML in Performance Sensitive Apps?
A Parser's Baggage queries: "For the last couple of years I've been working with XML based protocols and one thing that keeps coming up is the amount of CPU power needed to handle 10, 20, 30 or 40 concurrent requests. I've ran benchmarks on both Java and C#, and my results show that on a 2ghz CPU, the upper boundary for concurrent clients is around 20, regardless of the platform. How have other developers dealt with these issues and what kinds of argument do you use to make the performance concerns know to the execs. I'm in favor of using XML for it's flexibility, but for performance sensitive applications, the weight is simply too big. This is especially true when some executive expects and demands that it handle 1000 requests/second on a 1 or 2 cpu server. Things like stream/pull parsers help for SOAP, but when you're reading and using the entire message, pull parsing doesn't buy you any advantages."
It might be of some use if you actually told us what libraries you used, what methods, etc, not just "I tried to parse some XML files". Is that result of 20 concurrent requests using a SAX parser or DOM? Are you using the standard java DOM implementation (slow and bulky), or one of the slicker ones like JDOM, dom4j, etc (there's a bunch you should have a look at). Another thing you could do t o improve performance is to identify the points where you don't really need a DOM (eg you're just reading the values once and discarding) and use a SAX parser instead to fill in a custom class or a hashtable or such.
Daniel
Carpe Diem
well there's your problem.
With mod_perl, XML::LibXML, XML::LibXSLT, I EASILY get 100/per second. and my code is shitty.
what do you do with the XML, do you generate HTML from it with XSLT or what?
another thing to try: intelligently cache your results in shared memory. you can easily double performance or more.
The XML Police that exist in several communities will come down on you like flies on manure. "You can't parse XML in regexps! That's not really parsing! You need to use the standard-flavor-of-the-month XML libraries for your language (which of course, may need dozens of prerequisite libraries)! What about CDATA? DTDs?! Encodings!? OH THINK OF THE CHILDREN!"
<stage_whisper>But in my experience, most of the time, you're right</stage_whisper>
Get off my lawn.
This is an example of the wrong way to use XML.
XML is great because it's extensible and a markup language. It's great for storage, configuration files, and certain forms of data transmission (which is just a sub-set of storage).
What XML is not good for is performance-critical transmission protocols. It's too verbose and too complex, and both are bad for protocols. That is the mistake made by the author of the article. Go with a structured protocol and skip the XML.
"Times have not become more violent. They have just become more televised."
-Marilyn Manson
In a protocol designed for efficiency you shouldn't have to parse anything at all!
If some binary protocol was used you'd would for example use 1 char to represent the field types
another to represent the record types and so forth. If you put all this into a packet that can be DIRECTLY mapped on a C structure you'll
save god knows how many cycles. I like the way you say you just have to recognise tags. Have you any idea of the amount of
processing involved in even simple regexp matching?? This is the problem when high level coders try to design low level
systems, they simply don't have a clue how things really work and assume that the high level procedures/objects that they work with
are some sort of magic that "just happens" and you can use them everywhere with no performance degradation.
Most of the work in an off the shelf XML parser is verifying that the XML is "good" or matches some schema specification. If its coming from one of your programs and going to one of your programs and you've done reasonable debugging, its good. You just parse it and use it. Not enough has been done to optimize the "trusted" app communications scenario even though in reality, that's probably 95%+ of the actual usage of XML. Very few sites are actually publishing XML that is really getting used by programs and pages other than the ones they've written.
Parsing it is very easy and quick if you're in full control of the encoding. You can optimize your parser greatly by choosing not to handle the general case, but to instead handle only what your specific encoder generates.
Use the protocol, pick up the buzz word for your app, but leave the pain of the generalities meant to handle some free data exchange world that is 15 years in the future out. When the semantic net comes about and applications can actually use any XML without needing to be written to use that XML schema, then you can worry about the general case.