Using XML in Performance Sensitive Apps?
A Parser's Baggage queries: "For the last couple of years I've been working with XML based protocols and one thing that keeps coming up is the amount of CPU power needed to handle 10, 20, 30 or 40 concurrent requests. I've ran benchmarks on both Java and C#, and my results show that on a 2ghz CPU, the upper boundary for concurrent clients is around 20, regardless of the platform. How have other developers dealt with these issues and what kinds of argument do you use to make the performance concerns know to the execs. I'm in favor of using XML for it's flexibility, but for performance sensitive applications, the weight is simply too big. This is especially true when some executive expects and demands that it handle 1000 requests/second on a 1 or 2 cpu server. Things like stream/pull parsers help for SOAP, but when you're reading and using the entire message, pull parsing doesn't buy you any advantages."
1. I use DOM objects, in this case the MSXML free threaded model, to handle xml strings and read out the string only at the last point.
2. I would also suggest using wstring/string in the STL library as you can reserve string buffers in advance in case you have to handle the XML as strings, that's if your using c++, don't know much about c#/java sorry.
using this method I have manage to push it to ~200 concurrent requests.
mlati
First off, any chance you could post those benchmarks? 20 requests/second seems low, I'm wondering what the rest of the setup was.
For the first part: we had performance problems on an app where the customer had insisted on xml everywhere. However, in one particularly critical part of the system we were getting hammered by the garbage collection overhead of SAX (its efficient for text in elements, but not for attribute values or element names).
Anyway - we knew what was coming into the system as we were also the producers of this xml at an earlier stage. So we wrote a custom SAX parser that only supported ASCII, no DTDs, internal subsets etc; and wrote it to return element/attribute names from a pool (IIRC we used a ternary tree to store this stuff, so we didn't need to create a string to do the lookup).
It was like night and day. XML parsing dropped from generating 80% of the garbage to about 5% and it just didn't appear on my list of performance issues from then on.
Java strings do a lot of copying, the point is to get yourself as close as possible to a zero-copy xml parser as you can.
You might want to look at switching toolkits entirely as well - GLUEs benchmarks sound a lot better than yours.
The problem with perceived XML inefficiency is that many implementations build a whole parse tree in memory - that's slow mostly because of node allocations/deallocations. Removing the intermediary parse tree decreased CPU time per request by the factor of 15 in my application.
The solution would be to load the DOM in the backend and have front-end applications access it.
You could try using AOLserver as a multi-threaded web server and tDOM as your DOM processor.
-- Don't Tase me, bro!
Many have asked about what libraries you are using to get at the XML. Loading up a whole DOM document is indeed quite inefficient.
.Net platform, I would suggest using the XmlTextReader class. This class and its bretheren are the parsers underlying Microsoft's DOM implementation, and anything else that needs access to XML. The class is noted for its strong performance advantage over loading a DOM or using XPathNavigator - and it is indeed a very lightweight class. It is certainly not as comfortable to use as the DOM, but neither is it incredibly painful, especially if your documents are relatively simple.
On the
Give XmlTextReader a shot.
I'd have to agree with people's assertion that performance intensive apps should use a custom protocol and preferably binary based or some kind of delayed stream parser that only accesses the XML node when the app calls for it. I believe Sun has an API in the works for XML stream parsing JSR 317. It's too bad the jsr is still in public review phase. I've written custom parser in the past using SAX and it can definitely improve performance if you convert it to an object model. The question is trade off between being generalized and performance.
In the case of a webservice that uses schema, it's going to be hard to get around the performance issue. An obvious solution in situations where XML is required is to send as little as possible and only get the nodes you need. In that respect XPP2 and XmlTextReader help, until you need the entire document and you use the whole document.
--toomuchPerl
If you are using C/C++ check out gSOAP. It goes real fast, runs on many platforms, and I've used it to talk to Java, PHP, C# etc without a problem. It does about 3000 transactions per second on my little desktop PC. Obviously 100 parallel clients aren't going to get that speed, but it sounds like it will be much faster than what you're using!
http://www.cs.fsu.edu/~engelen/soap.html
I have to agree with many of the comments. The parser you choose is the most important decision. DOM is typically a memory hog and takes time. In my experience the MSXML 4.0 parser is very fast, written in C, etc. DOM is easier to user, but obviously can have some downsides. XML is great for portability and faster development, but performance concerns can arise.
Find out where the bottleneck lies. If you are running an XSLT processor on the server, that will limit your request/sec. I've found that stream XML from the server to a client (such as IE6, gasp) and having the client render to HTML is wicked fast. The XSLT parser in IE renders asynchronously allowing the results to be displayed before the entire doc is loaded. Of course this is MS specific stuff I've experienced, etc.
SAX is faster for grabbing XML events. While writing a web spider, I was parsing HTML using an HTML parser. I switched from that to regex and saw crawl speed increase significantly. It depends if you need to whole XML doc or not.
You may want to try loading the XML DOM once and serialize the binary. You could then ship the binary around town. Macromedia has some tools like this that can send binary objects to a flash client, etc. Limit the parsing.
Another tip... if you have control over the XML schema, you may want to research how to structure XML for performance. I've heard that attribute heavy XML docs are more efficient than docs with embedded data, etc. Also look into some XML tricks like IDs, etc.
Good luck in your pursuit. Choose your parser carefully. If testing turns out negative, you may just want to use some binary data. XML is a wonderful technology designed to aid in system integration, and ease of use... but it comes at a price.
If you can do with basic parsing, the nanoxml and picoxml libraries will put everything else to shame.
-I like my women like I like my tea: green-
We use Biztalk for a lot of enterprise-level XML parsing, and we get up to 200+ documents parsed per second. Of course, there's a lot of hardware being used - 3 2-processor processing boxes handling the workload, for example. But for a system pushing and pulling messages in and out of a SQL Server database it works pretty well. And these are pretty decently sized documents, doing mapping and using all kinds of functoids and whatnot.
"On the Internet, nobody knows you're a dog!" - a dog