Using XML in Performance Sensitive Apps?
A Parser's Baggage queries: "For the last couple of years I've been working with XML based protocols and one thing that keeps coming up is the amount of CPU power needed to handle 10, 20, 30 or 40 concurrent requests. I've ran benchmarks on both Java and C#, and my results show that on a 2ghz CPU, the upper boundary for concurrent clients is around 20, regardless of the platform. How have other developers dealt with these issues and what kinds of argument do you use to make the performance concerns know to the execs. I'm in favor of using XML for it's flexibility, but for performance sensitive applications, the weight is simply too big. This is especially true when some executive expects and demands that it handle 1000 requests/second on a 1 or 2 cpu server. Things like stream/pull parsers help for SOAP, but when you're reading and using the entire message, pull parsing doesn't buy you any advantages."
<?xml version="1.0" encoding="UTF-8"?>z RKiQIXOSMfOskEqGRJJIpshcyjxLokLGIoRkCplDvP/zVhe/Dv /n+77d5/5+nude92zn7LPWXmt91mevvc+++1Vl5I7AhBB0+/v6 G5rpaNAwCBRiY3SRTkwMQid8wtza1NDO3M3UBAIDLk9C0Ajglz xEB4LEwuEQJAoN0SPcBsFh0Dg48F+yEAwUhUMC/6UCIVyfhuDQ CBRwLS4OoTO1NiH0DCH5x8XO9DwdICEchoPQQX//wNCQn78h1n Q0v1qQGDjuzzYUHIMFtSGxoGexSAQS3IZFgNpQ8A3akKg/2mCA NDBwG/rPZ2EwJO5PmWEwFBz0LByHAN0Hx6FB9yFR0D/1ANrgf+ oLA4wFB7ehQc9i4CjQOzBwDEgPLHAjqA2HBr0XB4eC25BIDKgN jQHpi8NB/3wHHJD6T1ngUAT2T92At4L8BQ4Y+M/3wmFQLBTUhg DZHA5DgcYeDsNCQc8Cwvw5pnA4HA16LzDMoHfAMWD5EFDwOxBw 5J8+DkcAwQBqw8DA/eEwoP6QgHagNiQK9CwSAwM/i0OAxhkFQ4 P6QyFAfg9HoeEgPVBAwP3ZhobiQGOP3sBGaBQK9F7ArUD3YcBY AgfcGfRewByg/jAYkD/DMThQ7MOxMAzoPiwS/CwWUATUhgU/i4 OCsAmACBho/HAosB44DAgTAbcC+R8CGP0/bY6AgrETAcWAxh4B xYFiEAGDQ8FtSMSfY4qAoTF/jh8ChoOCZIHDQPEBeAHuz3hDwM GxjwAGHyQzoAioPwQC/qePIxAoUPwiEBgsaEyRUBCOI5CA94La kDjQGCAxCNA7gNtAbSgYCCeBvAvyA4LIoPeiAID+sw0NA+UKBB oBwjoEGngY1IYFjxUGCAdQGxwNkg+zwRhgMAiQjTA4UCwAaA8F 9YfdwK+waCxIDywO7Pc4GCg3InAI8Pjh0FDws1hQHkRCoSAsRg Ie/acsSCgKlOORgEuC78OBxgoJgyP/lA8JhMef44KEYU">
<session session="2003-06-27T17:03:39GMT+08:00" session-serialNumber="06302003b01" encode-version="1.8"><structure id="bzip2"><info cdate="2003-07-12T14:57:07+08:00" expiry-date="" id="OBD12" mdate="2003-07-12T14:57:07+08:00" name="" notes="" organization="Sd7+/OtxQ==" version="1.0"/><content code="H4sIAAAAAAAAAMy9CThW2xc/rpQpYxKJvIakEu88IJk
Hint: The shorter the header, the faster.
P.S. This is a joke, for humor-impaired
1. I use DOM objects, in this case the MSXML free threaded model, to handle xml strings and read out the string only at the last point.
2. I would also suggest using wstring/string in the STL library as you can reserve string buffers in advance in case you have to handle the XML as strings, that's if your using c++, don't know much about c#/java sorry.
using this method I have manage to push it to ~200 concurrent requests.
mlati
It might be of some use if you actually told us what libraries you used, what methods, etc, not just "I tried to parse some XML files". Is that result of 20 concurrent requests using a SAX parser or DOM? Are you using the standard java DOM implementation (slow and bulky), or one of the slicker ones like JDOM, dom4j, etc (there's a bunch you should have a look at). Another thing you could do t o improve performance is to identify the points where you don't really need a DOM (eg you're just reading the values once and discarding) and use a SAX parser instead to fill in a custom class or a hashtable or such.
Daniel
Carpe Diem
well there's your problem.
With mod_perl, XML::LibXML, XML::LibXSLT, I EASILY get 100/per second. and my code is shitty.
what do you do with the XML, do you generate HTML from it with XSLT or what?
another thing to try: intelligently cache your results in shared memory. you can easily double performance or more.
I love XML, and I use it anywhere I can get away with it, but I know from my old job, that switching to a binary protocol that is streamlined for the task at hand can give you performance gains over XML protocols that are just plain ridiculous.
I think we the results we measured were something like 1000 times as many connections on a custom binary protocol over an XML based one.
That was in C++ mind you. YMMV.
Give me liberty or give me kill -s 9
First off, any chance you could post those benchmarks? 20 requests/second seems low, I'm wondering what the rest of the setup was.
For the first part: we had performance problems on an app where the customer had insisted on xml everywhere. However, in one particularly critical part of the system we were getting hammered by the garbage collection overhead of SAX (its efficient for text in elements, but not for attribute values or element names).
Anyway - we knew what was coming into the system as we were also the producers of this xml at an earlier stage. So we wrote a custom SAX parser that only supported ASCII, no DTDs, internal subsets etc; and wrote it to return element/attribute names from a pool (IIRC we used a ternary tree to store this stuff, so we didn't need to create a string to do the lookup).
It was like night and day. XML parsing dropped from generating 80% of the garbage to about 5% and it just didn't appear on my list of performance issues from then on.
Java strings do a lot of copying, the point is to get yourself as close as possible to a zero-copy xml parser as you can.
You might want to look at switching toolkits entirely as well - GLUEs benchmarks sound a lot better than yours.
Have you profiled your application?
/db connection/db speed). Look at your own code with a profiler to see the bottleneck.
:-)
Do you test on a dedicated test system?
If your only getting 20 concurrent users regardless of platform (could be, it really depends on the setup and complexity of the problem), maybe the technology isn't the problem but it could be network etc.
benchmarking is fine, but if you do it on the whole system you don't know what the problem really is.
Find out precisely what the problem is (network/xml parser/your app logic
If you do end up blaming the parser, change it! (and i don't mean using a different parsing method as most use a sax parser to generate the tree anyway) there are parsers that are 50% faster than those used as standard (xerces isn't the fastest java parser around!). Also look at the most efficient way of using the tree (java dom is, as already said, slow in usage) or maybe you can go from sax directly to your object model without using a tree but building your own sax parser.
If you can't get a performance gain (which I really doubt), be honest to your client. "If you want to do it that way it's going to cost you" or "it can't be done on one machine" how did they get the idea they could handle 1000's of requests a second anyway? Work on your expectationmanagment (basicly work on making their expectations more realistic). If you promise mountains make sure you can deliver them first. If you can't deliver them make them not want mountains but molehills
The problem with perceived XML inefficiency is that many implementations build a whole parse tree in memory - that's slow mostly because of node allocations/deallocations. Removing the intermediary parse tree decreased CPU time per request by the factor of 15 in my application.
The XML Police that exist in several communities will come down on you like flies on manure. "You can't parse XML in regexps! That's not really parsing! You need to use the standard-flavor-of-the-month XML libraries for your language (which of course, may need dozens of prerequisite libraries)! What about CDATA? DTDs?! Encodings!? OH THINK OF THE CHILDREN!"
<stage_whisper>But in my experience, most of the time, you're right</stage_whisper>
Get off my lawn.
This is an example of the wrong way to use XML.
XML is great because it's extensible and a markup language. It's great for storage, configuration files, and certain forms of data transmission (which is just a sub-set of storage).
What XML is not good for is performance-critical transmission protocols. It's too verbose and too complex, and both are bad for protocols. That is the mistake made by the author of the article. Go with a structured protocol and skip the XML.
"Times have not become more violent. They have just become more televised."
-Marilyn Manson