Slashdot Mirror


Using XML in Performance Sensitive Apps?

A Parser's Baggage queries: "For the last couple of years I've been working with XML based protocols and one thing that keeps coming up is the amount of CPU power needed to handle 10, 20, 30 or 40 concurrent requests. I've ran benchmarks on both Java and C#, and my results show that on a 2ghz CPU, the upper boundary for concurrent clients is around 20, regardless of the platform. How have other developers dealt with these issues and what kinds of argument do you use to make the performance concerns know to the execs. I'm in favor of using XML for it's flexibility, but for performance sensitive applications, the weight is simply too big. This is especially true when some executive expects and demands that it handle 1000 requests/second on a 1 or 2 cpu server. Things like stream/pull parsers help for SOAP, but when you're reading and using the entire message, pull parsing doesn't buy you any advantages."

38 of 97 comments (clear)

  1. Sure by jsse · · Score: 5, Funny

    <?xml version="1.0" encoding="UTF-8"?>
    <session session="2003-06-27T17:03:39GMT+08:00" session-serialNumber="06302003b01" encode-version="1.8"><structure id="bzip2"><info cdate="2003-07-12T14:57:07+08:00" expiry-date="" id="OBD12" mdate="2003-07-12T14:57:07+08:00" name="" notes="" organization="Sd7+/OtxQ==" version="1.0"/><content code="H4sIAAAAAAAAAMy9CThW2xc/rpQpYxKJvIakEu88IJkz RKiQIXOSMfOskEqGRJJIpshcyjxLokLGIoRkCplDvP/zVhe/Dv /n+77d5/5+nude92zn7LPWXmt91mevvc+++1Vl5I7AhBB0+/v6 G5rpaNAwCBRiY3SRTkwMQid8wtza1NDO3M3UBAIDLk9C0Ajglz xEB4LEwuEQJAoN0SPcBsFh0Dg48F+yEAwUhUMC/6UCIVyfhuDQ CBRwLS4OoTO1NiH0DCH5x8XO9DwdICEchoPQQX//wNCQn78h1n Q0v1qQGDjuzzYUHIMFtSGxoGexSAQS3IZFgNpQ8A3akKg/2mCA NDBwG/rPZ2EwJO5PmWEwFBz0LByHAN0Hx6FB9yFR0D/1ANrgf+ oLA4wFB7ehQc9i4CjQOzBwDEgPLHAjqA2HBr0XB4eC25BIDKgN jQHpi8NB/3wHHJD6T1ngUAT2T92At4L8BQ4Y+M/3wmFQLBTUhg DZHA5DgcYeDsNCQc8Cwvw5pnA4HA16LzDMoHfAMWD5EFDwOxBw 5J8+DkcAwQBqw8DA/eEwoP6QgHagNiQK9CwSAwM/i0OAxhkFQ4 P6QyFAfg9HoeEgPVBAwP3ZhobiQGOP3sBGaBQK9F7ArUD3YcBY AgfcGfRewByg/jAYkD/DMThQ7MOxMAzoPiwS/CwWUATUhgU/i4 OCsAmACBho/HAosB44DAgTAbcC+R8CGP0/bY6AgrETAcWAxh4B xYFiEAGDQ8FtSMSfY4qAoTF/jh8ChoOCZIHDQPEBeAHuz3hDwM GxjwAGHyQzoAioPwQC/qePIxAoUPwiEBgsaEyRUBCOI5CA94La kDjQGCAxCNA7gNtAbSgYCCeBvAvyA4LIoPeiAID+sw0NA+UKBB oBwjoEGngY1IYFjxUGCAdQGxwNkg+zwRhgMAiQjTA4UCwAaA8F 9YfdwK+waCxIDywO7Pc4GCg3InAI8Pjh0FDws1hQHkRCoSAsRg Ie/acsSCgKlOORgEuC78OBxgoJgyP/lA8JhMef44KEYU">

    Hint: The shorter the header, the faster.

    P.S. This is a joke, for humor-impaired

  2. using DOM by mlati · · Score: 5, Informative

    1. I use DOM objects, in this case the MSXML free threaded model, to handle xml strings and read out the string only at the last point.
    2. I would also suggest using wstring/string in the STL library as you can reserve string buffers in advance in case you have to handle the XML as strings, that's if your using c++, don't know much about c#/java sorry.

    using this method I have manage to push it to ~200 concurrent requests.

    mlati

    1. Re:using DOM by macrom · · Score: 2, Informative

      I am not 100% sure, but I believe the System.Xml namespace in C# uses DOM. Which is sad because an article a few months back in Windows Developer Journal cited a test where MSXML was the slowest parser around. I believe it was Xerces that ran the fastest.

      As mentioned above, we use std::wstring as the storage mechanism (which isolates developers from the dreaded BSTR that MSXML uses. Ick.), but beware because that isolates your non-C++ users from the interface. We're looking at moving our business rule-enforcing parser to C# for better compatibility between .NET, COM and pure C++ applications.

    2. Re:using DOM by DukeyToo · · Score: 2, Insightful

      If you break it down, there are two basic methods of parsing XML - DOM-based or Stream-based. DOM requires the whole XML document to be loaded in memory, and so is inherently bad for scalability.

      Stream-based combined with XPATH processing is the way to go if you want to just get particular elements from the document. Even if you need to parse the whole document, I would still stay with stream-based method.

      --
      Most writers regard truth as their most valuable possession, and therefore are most economical in its use - Mark Twain
  3. XML is just hard to parse by PD · · Score: 2, Insightful

    It's hard to parse. That takes cycles. You can probably tweak the parsing to make it faster, but that might not get you from 20 concurrent to 2000 concurrent.

    You've got two choices. More processors, which are pretty cheap right now; or a simpler and more specialized language to replace XML.

    1. Re:XML is just hard to parse by archeopterix · · Score: 4, Informative
      It's hard to parse. That takes cycles. You can probably tweak the parsing to make it faster, but that might not get you from 20 concurrent to 2000 concurrent.
      In my experience XML isn't hard to parse at all. Basically, you just have to recognize tags (basic regexp) and match opening ones and closing ones (use a stack, Luke).

      The problem with perceived XML inefficiency is that many implementations build a whole parse tree in memory - that's slow mostly because of node allocations/deallocations. Removing the intermediary parse tree decreased CPU time per request by the factor of 15 in my application.

    2. Re:XML is just hard to parse by clintp · · Score: 4, Insightful
      In my experience XML isn't hard to parse at all. Basically, you just have to recognize tags (basic regexp) and match opening ones and closing ones (use a stack, Luke).
      SHHH! Don't say that too loudly!

      The XML Police that exist in several communities will come down on you like flies on manure. "You can't parse XML in regexps! That's not really parsing! You need to use the standard-flavor-of-the-month XML libraries for your language (which of course, may need dozens of prerequisite libraries)! What about CDATA? DTDs?! Encodings!? OH THINK OF THE CHILDREN!"

      <stage_whisper>But in my experience, most of the time, you're right</stage_whisper>

      --
      Get off my lawn.
    3. Re:XML is just hard to parse by archeopterix · · Score: 2, Interesting

      Well, I wasn't really advocating writing your own XML parser, although if enough parameters are fixed (encoding, namespaces and such) and the DTD is simple, that might be an option. I was just trying to say that the parser does not have to be slow. Just try to find a SAX-style parser, one that lets you define events associated with tags (parsing on-the-fly) instead of one that slurps an XML file and produces a DOM-tree out of it. While the tree might prove more convenient (you can traverse it in all directions), its construction and destruction might be expensive.

    4. Re:XML is just hard to parse by andrewl6097 · · Score: 2, Insightful

      Even writing your own parser isn't entirely a bad idea. It depends on your message size. A few months ago, in an all-night hacking session, I whipped up a SAX parser that was over 3 times faster than expat for messages under a certain amount (roughly 200 bytes, IIRC). Often parsers will bog down because they have lots of features most people don't need - like namespaces for instance.

    5. Re:XML is just hard to parse by Viol8 · · Score: 3, Insightful

      In a protocol designed for efficiency you shouldn't have to parse anything at all!
      If some binary protocol was used you'd would for example use 1 char to represent the field types
      another to represent the record types and so forth. If you put all this into a packet that can be DIRECTLY mapped on a C structure you'll
      save god knows how many cycles. I like the way you say you just have to recognise tags. Have you any idea of the amount of
      processing involved in even simple regexp matching?? This is the problem when high level coders try to design low level
      systems, they simply don't have a clue how things really work and assume that the high level procedures/objects that they work with
      are some sort of magic that "just happens" and you can use them everywhere with no performance degradation.

    6. Re:XML is just hard to parse by Arandir · · Score: 2, Interesting

      Maybe it's time someone wrote an intelligent pre-parser. Take a cursory look at the XML and pass it on to an appropriate parser based on encoding, DTD, size, etc. Or run the document through a pipeline, where every single request takes longer to process, but you can several in the pipe at the same time.

      There's no reason there has to be a single heroic XML parser that does everything.

      --
      A Government Is a Body of People, Usually Notably Ungoverned
    7. Re:XML is just hard to parse by BlackHawk-666 · · Score: 2, Insightful
      XML is not designed for speed, but for information exchange. Mapping onto a C structure may work well for a single platform and a single compiler but each processor and compiler have their own ideas about ordering of struct members and padding e.g. Intel likes DWORD alignment if available and used to pad as required...not sure about the latest batch of processors and compilers.

      You lose portability between platforms by trying this low level mapping. How well do you thin big endian systems will like to share with little endian ones? Portability, readability and exchangability are the reasons for XML, not flat out speed. That said, we use XSL around here for marking up our web pages and it is lightening fast!

      --
      All those moments will be lost in time, like tears in rain.
  4. Is that using SAX or DOM? by KDan · · Score: 4, Insightful

    It might be of some use if you actually told us what libraries you used, what methods, etc, not just "I tried to parse some XML files". Is that result of 20 concurrent requests using a SAX parser or DOM? Are you using the standard java DOM implementation (slow and bulky), or one of the slicker ones like JDOM, dom4j, etc (there's a bunch you should have a look at). Another thing you could do t o improve performance is to identify the points where you don't really need a DOM (eg you're just reading the values once and discarding) and use a SAX parser instead to fill in a custom class or a hashtable or such.

    Daniel

    --
    Carpe Diem
    1. Re:Is that using SAX or DOM? by Lechter · · Score: 2, Insightful

      First of all, the people who say that you should simply switch to a structured binary protocol, and get at your messages through casting are right. That'll be a lot faster. But if you're stuck with implementing a WebService then you're stuck with XML.

      As for using DOM, I'd argue that you should never use it in a performance critical application. I understand that you need to refer to different parts of the message at concurrently so an event-based parser alone won't work. But what you ought to consider is using a lighter weight representation of your messages than DOM. After all DOM gives you access to alot of information that you really don't need. You might look into XML->object mapping API's like Castor or maybe Betwixt. Or you could just roll your own. That way you could use a quick push parser like SAX to parse the XML, but still have the ability to access all of the message. You might also want to look into the parameters available in your parser, to try and strip it down...maybe turn off validation, DTD's etc...

      --
      credo quia absurdum
  5. java and c#? by Anonymous Coward · · Score: 5, Insightful

    well there's your problem.

    With mod_perl, XML::LibXML, XML::LibXSLT, I EASILY get 100/per second. and my code is shitty.

    what do you do with the XML, do you generate HTML from it with XSLT or what?

    another thing to try: intelligently cache your results in shared memory. you can easily double performance or more.

    1. Re:java and c#? by jslag · · Score: 2, Interesting

      With mod_perl, XML::LibXML, XML::LibXSLT, I EASILY get 100/per second. and my code is shitty.

      Amen. All of my XML processing code for the last year has been written using the above-mentioned tools, and it's been fast enough that I haven't needed to spend time performance tuning.

      See the apache axkit project for more info.

  6. Switch to a custom protocol by setien · · Score: 5, Interesting

    I love XML, and I use it anywhere I can get away with it, but I know from my old job, that switching to a binary protocol that is streamlined for the task at hand can give you performance gains over XML protocols that are just plain ridiculous.
    I think we the results we measured were something like 1000 times as many connections on a custom binary protocol over an XML based one.
    That was in C++ mind you. YMMV.

    --
    Give me liberty or give me kill -s 9
  7. Benchmarks, handmade parser... by Bazzargh · · Score: 4, Informative

    First off, any chance you could post those benchmarks? 20 requests/second seems low, I'm wondering what the rest of the setup was.

    For the first part: we had performance problems on an app where the customer had insisted on xml everywhere. However, in one particularly critical part of the system we were getting hammered by the garbage collection overhead of SAX (its efficient for text in elements, but not for attribute values or element names).

    Anyway - we knew what was coming into the system as we were also the producers of this xml at an earlier stage. So we wrote a custom SAX parser that only supported ASCII, no DTDs, internal subsets etc; and wrote it to return element/attribute names from a pool (IIRC we used a ternary tree to store this stuff, so we didn't need to create a string to do the lookup).

    It was like night and day. XML parsing dropped from generating 80% of the garbage to about 5% and it just didn't appear on my list of performance issues from then on.

    Java strings do a lot of copying, the point is to get yourself as close as possible to a zero-copy xml parser as you can.

    You might want to look at switching toolkits entirely as well - GLUEs benchmarks sound a lot better than yours.

    1. Re:Benchmarks, handmade parser... by Twylite · · Score: 2, Interesting

      So what you're saying is that you stopped using XML and used something completely different that has a visual similarity to XML.

      Hint: if it doesn't do unicode, DTDs, CDATA sections and all the other crap, its not XML.

      --
      i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
    2. Re:Benchmarks, handmade parser... by Anonymous Coward · · Score: 2, Insightful

      What, you mean someone actually does implement all that unicode, DTD, CDATA and other crap into their software? Don't they have anything better to do?

  8. profile your application by Bart+van+der+Ouderaa · · Score: 4, Interesting

    Have you profiled your application?
    Do you test on a dedicated test system?

    If your only getting 20 concurrent users regardless of platform (could be, it really depends on the setup and complexity of the problem), maybe the technology isn't the problem but it could be network etc.

    benchmarking is fine, but if you do it on the whole system you don't know what the problem really is.
    Find out precisely what the problem is (network/xml parser/your app logic /db connection/db speed). Look at your own code with a profiler to see the bottleneck.

    If you do end up blaming the parser, change it! (and i don't mean using a different parsing method as most use a sax parser to generate the tree anyway) there are parsers that are 50% faster than those used as standard (xerces isn't the fastest java parser around!). Also look at the most efficient way of using the tree (java dom is, as already said, slow in usage) or maybe you can go from sax directly to your object model without using a tree but building your own sax parser.

    If you can't get a performance gain (which I really doubt), be honest to your client. "If you want to do it that way it's going to cost you" or "it can't be done on one machine" how did they get the idea they could handle 1000's of requests a second anyway? Work on your expectationmanagment (basicly work on making their expectations more realistic). If you promise mountains make sure you can deliver them first. If you can't deliver them make them not want mountains but molehills :-)

  9. So don't use XML. by WasterDave · · Score: 2, Insightful

    I don't understand what the problem is here. You're saying that you like XML, but it's slow. Fine, don't use it. It's not like it's the only tool in existence, is it?

    Dave

    --
    I write a blog now, you should be afraid.
    1. Re:So don't use XML. by Knight2K · · Score: 2, Insightful

      I would guess that using XML is to some degree a political issue that can't be avoided. Which is really symptomatic of the age-old problem of the business and technical sides not really listening to each other.

      --
      ======
      In X-Windows the client serves YOU!
  10. AOLserver and tDOM by Col.+Klink+(retired) · · Score: 2, Informative
    I'm just going to guess at what your problem is since you didn't really tell us. I'm assuming that your application needs to load the entire DOM tree 20 times for 20 concurrent requests and that's taking either too much CPU or too much memory.

    The solution would be to load the DOM in the backend and have front-end applications access it.

    You could try using AOLserver as a multi-threaded web server and tDOM as your DOM processor.

    --

    -- Don't Tase me, bro!

  11. XmlTextReader by MrProgrammer · · Score: 2, Informative

    Many have asked about what libraries you are using to get at the XML. Loading up a whole DOM document is indeed quite inefficient.

    On the .Net platform, I would suggest using the XmlTextReader class. This class and its bretheren are the parsers underlying Microsoft's DOM implementation, and anything else that needs access to XML. The class is noted for its strong performance advantage over loading a DOM or using XPathNavigator - and it is indeed a very lightweight class. It is certainly not as comfortable to use as the DOM, but neither is it incredibly painful, especially if your documents are relatively simple.

    Give XmlTextReader a shot.

  12. Wrong uses of XML by Randolpho · · Score: 5, Insightful

    This is an example of the wrong way to use XML.

    XML is great because it's extensible and a markup language. It's great for storage, configuration files, and certain forms of data transmission (which is just a sub-set of storage).

    What XML is not good for is performance-critical transmission protocols. It's too verbose and too complex, and both are bad for protocols. That is the mistake made by the author of the article. Go with a structured protocol and skip the XML.

    --
    "Times have not become more violent. They have just become more televised."
    -Marilyn Manson
    1. Re:Wrong uses of XML by __past__ · · Score: 2, Insightful
      It's quite funny that you highlight XML being a markup language (or rather, a tookit to build markup languages), and don't even include document markup as something it's good for.

      Despite all the hype behind XML, markup somehow doesn't really seem to be any more hip than in the dark SGML ages. Sometimes I really wonder why all the data-heads try reinventing ASN.1 with more bloat and complexity so hard.

  13. Interesting article by f00zbll · · Score: 2, Informative
    There's an interesting article that compares the different types of parser and their advantage at a fairly low level. Dennis Sosnoski's article on xml performance was included on IBM's site a while back. It's a worth while read.

    I'd have to agree with people's assertion that performance intensive apps should use a custom protocol and preferably binary based or some kind of delayed stream parser that only accesses the XML node when the app calls for it. I believe Sun has an API in the works for XML stream parsing JSR 317. It's too bad the jsr is still in public review phase. I've written custom parser in the past using SAX and it can definitely improve performance if you convert it to an object model. The question is trade off between being generalized and performance.

    In the case of a webservice that uses schema, it's going to be hard to get around the performance issue. An obvious solution in situations where XML is required is to send as little as possible and only get the nodes you need. In that respect XPP2 and XmlTextReader help, until you need the entire document and you use the whole document.

  14. Explain more by vadim_t · · Score: 2, Insightful

    First, what does your program do? Why are you so sure XML takes so much time to process? And, is really XML the best format for your application?

    You could get speed improvements by making things simpler. If XML data takes so much to process on your server then I guess you have two possible problems: Either the amount of data is very big, or you're doing something wrong. You don't really have to use every feature of XML in your program.

    Make sure you also understand what XML is for. Sending bitmaps by transferring gigabytes of <pixel r="10" g="100" b="0" /> is definitely not a good use of XML. For some kinds of data perfectly good formats already exist.

    Also, do you really need XML? If it's something time or bandwidth critical, rolling your own could be easier. Especially if you don't need a lot of interoperation with other programs. Binary protocols are quite easy to make extensible, too. For example, you can send everything in a kind of container. Say, a structure with a char or int for a command ID, and a long for a command length. Then put any data inside. That's just 5-8 bytes per header, and should let you add stuff easily.

  15. S-expressions by toomuchPerl · · Score: 2, Informative
    why even bother w/ XML? S-expressions are truly superior, and much easier to parse. You can write an S-expression parser in about a hundred lines of Perl, and there exist decent libraries or bindings for S-expression parsers available for C, Python, Java, Ruby. It's much faster and the overhead is always less.

    --toomuchPerl

  16. Re:Parsing isn't the issue by Arandir · · Score: 2, Interesting

    So compress the XML. Since it's text, and usually very regular text, it compresses nicely. A simple pretuned huffman filter will do wonders.

    --
    A Government Is a Body of People, Usually Notably Ungoverned
  17. You picked the wrong tools by Voivod · · Score: 2, Informative

    If you are using C/C++ check out gSOAP. It goes real fast, runs on many platforms, and I've used it to talk to Java, PHP, C# etc without a problem. It does about 3000 transactions per second on my little desktop PC. Obviously 100 parallel clients aren't going to get that speed, but it sounds like it will be much faster than what you're using!

    http://www.cs.fsu.edu/~engelen/soap.html

  18. Proper Parsing by jkichline · · Score: 3, Informative

    I have to agree with many of the comments. The parser you choose is the most important decision. DOM is typically a memory hog and takes time. In my experience the MSXML 4.0 parser is very fast, written in C, etc. DOM is easier to user, but obviously can have some downsides. XML is great for portability and faster development, but performance concerns can arise.

    Find out where the bottleneck lies. If you are running an XSLT processor on the server, that will limit your request/sec. I've found that stream XML from the server to a client (such as IE6, gasp) and having the client render to HTML is wicked fast. The XSLT parser in IE renders asynchronously allowing the results to be displayed before the entire doc is loaded. Of course this is MS specific stuff I've experienced, etc.

    SAX is faster for grabbing XML events. While writing a web spider, I was parsing HTML using an HTML parser. I switched from that to regex and saw crawl speed increase significantly. It depends if you need to whole XML doc or not.

    You may want to try loading the XML DOM once and serialize the binary. You could then ship the binary around town. Macromedia has some tools like this that can send binary objects to a flash client, etc. Limit the parsing.

    Another tip... if you have control over the XML schema, you may want to research how to structure XML for performance. I've heard that attribute heavy XML docs are more efficient than docs with embedded data, etc. Also look into some XML tricks like IDs, etc.

    Good luck in your pursuit. Choose your parser carefully. If testing turns out negative, you may just want to use some binary data. XML is a wonderful technology designed to aid in system integration, and ease of use... but it comes at a price.

  19. More info needed by wickedhobo · · Score: 2

    We need to know more about what you are doing to really be able to understand.

    For example: Are you serializing XML to/from objects in Java or C#? Are you writing custom serializers? Or are you using the built in introspective type serializers for Objects?

    Are you using Document centric SOAP, in which case your doing more parsing and logical operation that serialization/deserialization?

    Do you really know that SOAP is your bottleneck? Have you profiled it?

    I'm using SOAP in production with J2EE right now with no problems. We use both Document centric (DOM Element[]) and serialization/deserialization.
    It's fast, without problems. We are using load-balancing/clustering as well, and SOAP does not seem to be the scale-bottleneck for us. The write's/tx's to the database are a bigger problem. Smart usage of caching solves the problem for us.

    --

    --Stupidity is Self Curing!
  20. Parse it, don't check it by RhettLivingston · · Score: 3, Insightful

    Most of the work in an off the shelf XML parser is verifying that the XML is "good" or matches some schema specification. If its coming from one of your programs and going to one of your programs and you've done reasonable debugging, its good. You just parse it and use it. Not enough has been done to optimize the "trusted" app communications scenario even though in reality, that's probably 95%+ of the actual usage of XML. Very few sites are actually publishing XML that is really getting used by programs and pages other than the ones they've written.

    Parsing it is very easy and quick if you're in full control of the encoding. You can optimize your parser greatly by choosing not to handle the general case, but to instead handle only what your specific encoder generates.

    Use the protocol, pick up the buzz word for your app, but leave the pain of the generalities meant to handle some free data exchange world that is 15 years in the future out. When the semantic net comes about and applications can actually use any XML without needing to be written to use that XML schema, then you can worry about the general case.

  21. Fastest all-around full-featured XML support libs by aminorex · · Score: 3, Informative
    If you really do require full XML support, the fastest libraries are the GNOME libxml et al. See the benchmark results if you don't believe me.

    If you can do with basic parsing, the nanoxml and picoxml libraries will put everything else to shame.

    --
    -I like my women like I like my tea: green-
  22. Re:In the MS Smartphone by BlackHawk-666 · · Score: 2, Funny

    Wouldn't it be better for MS to fix the memory leaks? I have a copy of BoundChecker they can borrow if they're a little strapped for cash ;->

    --
    All those moments will be lost in time, like tears in rain.
  23. Biztalk by badfish2 · · Score: 2, Informative

    We use Biztalk for a lot of enterprise-level XML parsing, and we get up to 200+ documents parsed per second. Of course, there's a lot of hardware being used - 3 2-processor processing boxes handling the workload, for example. But for a system pushing and pulling messages in and out of a SQL Server database it works pretty well. And these are pretty decently sized documents, doing mapping and using all kinds of functoids and whatnot.

    --
    "On the Internet, nobody knows you're a dog!" - a dog