Slashdot Mirror


Using XML in Performance Sensitive Apps?

A Parser's Baggage queries: "For the last couple of years I've been working with XML based protocols and one thing that keeps coming up is the amount of CPU power needed to handle 10, 20, 30 or 40 concurrent requests. I've ran benchmarks on both Java and C#, and my results show that on a 2ghz CPU, the upper boundary for concurrent clients is around 20, regardless of the platform. How have other developers dealt with these issues and what kinds of argument do you use to make the performance concerns know to the execs. I'm in favor of using XML for it's flexibility, but for performance sensitive applications, the weight is simply too big. This is especially true when some executive expects and demands that it handle 1000 requests/second on a 1 or 2 cpu server. Things like stream/pull parsers help for SOAP, but when you're reading and using the entire message, pull parsing doesn't buy you any advantages."

97 comments

  1. Sure by jsse · · Score: 5, Funny

    <?xml version="1.0" encoding="UTF-8"?>
    <session session="2003-06-27T17:03:39GMT+08:00" session-serialNumber="06302003b01" encode-version="1.8"><structure id="bzip2"><info cdate="2003-07-12T14:57:07+08:00" expiry-date="" id="OBD12" mdate="2003-07-12T14:57:07+08:00" name="" notes="" organization="Sd7+/OtxQ==" version="1.0"/><content code="H4sIAAAAAAAAAMy9CThW2xc/rpQpYxKJvIakEu88IJkz RKiQIXOSMfOskEqGRJJIpshcyjxLokLGIoRkCplDvP/zVhe/Dv /n+77d5/5+nude92zn7LPWXmt91mevvc+++1Vl5I7AhBB0+/v6 G5rpaNAwCBRiY3SRTkwMQid8wtza1NDO3M3UBAIDLk9C0Ajglz xEB4LEwuEQJAoN0SPcBsFh0Dg48F+yEAwUhUMC/6UCIVyfhuDQ CBRwLS4OoTO1NiH0DCH5x8XO9DwdICEchoPQQX//wNCQn78h1n Q0v1qQGDjuzzYUHIMFtSGxoGexSAQS3IZFgNpQ8A3akKg/2mCA NDBwG/rPZ2EwJO5PmWEwFBz0LByHAN0Hx6FB9yFR0D/1ANrgf+ oLA4wFB7ehQc9i4CjQOzBwDEgPLHAjqA2HBr0XB4eC25BIDKgN jQHpi8NB/3wHHJD6T1ngUAT2T92At4L8BQ4Y+M/3wmFQLBTUhg DZHA5DgcYeDsNCQc8Cwvw5pnA4HA16LzDMoHfAMWD5EFDwOxBw 5J8+DkcAwQBqw8DA/eEwoP6QgHagNiQK9CwSAwM/i0OAxhkFQ4 P6QyFAfg9HoeEgPVBAwP3ZhobiQGOP3sBGaBQK9F7ArUD3YcBY AgfcGfRewByg/jAYkD/DMThQ7MOxMAzoPiwS/CwWUATUhgU/i4 OCsAmACBho/HAosB44DAgTAbcC+R8CGP0/bY6AgrETAcWAxh4B xYFiEAGDQ8FtSMSfY4qAoTF/jh8ChoOCZIHDQPEBeAHuz3hDwM GxjwAGHyQzoAioPwQC/qePIxAoUPwiEBgsaEyRUBCOI5CA94La kDjQGCAxCNA7gNtAbSgYCCeBvAvyA4LIoPeiAID+sw0NA+UKBB oBwjoEGngY1IYFjxUGCAdQGxwNkg+zwRhgMAiQjTA4UCwAaA8F 9YfdwK+waCxIDywO7Pc4GCg3InAI8Pjh0FDws1hQHkRCoSAsRg Ie/acsSCgKlOORgEuC78OBxgoJgyP/lA8JhMef44KEYU">

    Hint: The shorter the header, the faster.

    P.S. This is a joke, for humor-impaired

    1. Re:Sure by Anonymous Coward · · Score: 0

      The first B in line 10 should be an 'x' the B will cause a CRC error

      Can't you get anything right?

  2. using DOM by mlati · · Score: 5, Informative

    1. I use DOM objects, in this case the MSXML free threaded model, to handle xml strings and read out the string only at the last point.
    2. I would also suggest using wstring/string in the STL library as you can reserve string buffers in advance in case you have to handle the XML as strings, that's if your using c++, don't know much about c#/java sorry.

    using this method I have manage to push it to ~200 concurrent requests.

    mlati

    1. Re:using DOM by macrom · · Score: 2, Informative

      I am not 100% sure, but I believe the System.Xml namespace in C# uses DOM. Which is sad because an article a few months back in Windows Developer Journal cited a test where MSXML was the slowest parser around. I believe it was Xerces that ran the fastest.

      As mentioned above, we use std::wstring as the storage mechanism (which isolates developers from the dreaded BSTR that MSXML uses. Ick.), but beware because that isolates your non-C++ users from the interface. We're looking at moving our business rule-enforcing parser to C# for better compatibility between .NET, COM and pure C++ applications.

    2. Re:using DOM by DukeyToo · · Score: 2, Insightful

      If you break it down, there are two basic methods of parsing XML - DOM-based or Stream-based. DOM requires the whole XML document to be loaded in memory, and so is inherently bad for scalability.

      Stream-based combined with XPATH processing is the way to go if you want to just get particular elements from the document. Even if you need to parse the whole document, I would still stay with stream-based method.

      --
      Most writers regard truth as their most valuable possession, and therefore are most economical in its use - Mark Twain
    3. Re:using DOM by __past__ · · Score: 1
      Pull parsers have become a little more popular recently. There is a more thorough overview at xml.com, by the way.

      As to your second paragraph, I don't seem to get what you are talking about. Stream-based APIs and XPath generally don't mix at all - how should an XPath expression like //foo[position()=last()] be handled in, say, a SAX handler?

      There is, however, some kind of middle ground, namely Streaming Transformations for XML, an XSLT-ripoff based on SAX with a limited XPath lookalike. Quite useful, IMHO.

    4. Re:using DOM by DukeyToo · · Score: 1

      I do not know of any implementations, but I do not think there is anything inherent about a stream that prevents a single xpath expression from being evaluated. The stream just has to skip over parts of the document that are not relevant to the Xpath expression.

      When I wrote my comment I thought that .NET had such a beast, but further investigation showed it does not. In any case, I do not think it could be based off of something derived from SAX, but it could be derived from an XMLReader (.NET object).

      I get your point about //foo[position()=last()], i.e. the stream would be past where it needs to be before it knew that it should have returned something. However, that is a minor technical hurdle.

      --
      Most writers regard truth as their most valuable possession, and therefore are most economical in its use - Mark Twain
    5. Re:using DOM by Ed+Avis · · Score: 1

      Probably the fastest XML parser possible is FleXML. You feed it the DTD for your format and it generates C code a la lex/yacc.

      --
      -- Ed Avis ed@membled.com
  3. XML is just hard to parse by PD · · Score: 2, Insightful

    It's hard to parse. That takes cycles. You can probably tweak the parsing to make it faster, but that might not get you from 20 concurrent to 2000 concurrent.

    You've got two choices. More processors, which are pretty cheap right now; or a simpler and more specialized language to replace XML.

    1. Re:XML is just hard to parse by archeopterix · · Score: 4, Informative
      It's hard to parse. That takes cycles. You can probably tweak the parsing to make it faster, but that might not get you from 20 concurrent to 2000 concurrent.
      In my experience XML isn't hard to parse at all. Basically, you just have to recognize tags (basic regexp) and match opening ones and closing ones (use a stack, Luke).

      The problem with perceived XML inefficiency is that many implementations build a whole parse tree in memory - that's slow mostly because of node allocations/deallocations. Removing the intermediary parse tree decreased CPU time per request by the factor of 15 in my application.

    2. Re:XML is just hard to parse by clintp · · Score: 4, Insightful
      In my experience XML isn't hard to parse at all. Basically, you just have to recognize tags (basic regexp) and match opening ones and closing ones (use a stack, Luke).
      SHHH! Don't say that too loudly!

      The XML Police that exist in several communities will come down on you like flies on manure. "You can't parse XML in regexps! That's not really parsing! You need to use the standard-flavor-of-the-month XML libraries for your language (which of course, may need dozens of prerequisite libraries)! What about CDATA? DTDs?! Encodings!? OH THINK OF THE CHILDREN!"

      <stage_whisper>But in my experience, most of the time, you're right</stage_whisper>

      --
      Get off my lawn.
    3. Re:XML is just hard to parse by archeopterix · · Score: 2, Interesting

      Well, I wasn't really advocating writing your own XML parser, although if enough parameters are fixed (encoding, namespaces and such) and the DTD is simple, that might be an option. I was just trying to say that the parser does not have to be slow. Just try to find a SAX-style parser, one that lets you define events associated with tags (parsing on-the-fly) instead of one that slurps an XML file and produces a DOM-tree out of it. While the tree might prove more convenient (you can traverse it in all directions), its construction and destruction might be expensive.

    4. Re:XML is just hard to parse by Anonymous Coward · · Score: 1, Insightful

      Why not use a simpler, easier to parse, more general language?

      Sexp parsing libraries exist for Lisp (duh), Scheme, Java, C, Perl, Python.

    5. Re:XML is just hard to parse by andrewl6097 · · Score: 2, Insightful

      Even writing your own parser isn't entirely a bad idea. It depends on your message size. A few months ago, in an all-night hacking session, I whipped up a SAX parser that was over 3 times faster than expat for messages under a certain amount (roughly 200 bytes, IIRC). Often parsers will bog down because they have lots of features most people don't need - like namespaces for instance.

    6. Re:XML is just hard to parse by Viol8 · · Score: 3, Insightful

      In a protocol designed for efficiency you shouldn't have to parse anything at all!
      If some binary protocol was used you'd would for example use 1 char to represent the field types
      another to represent the record types and so forth. If you put all this into a packet that can be DIRECTLY mapped on a C structure you'll
      save god knows how many cycles. I like the way you say you just have to recognise tags. Have you any idea of the amount of
      processing involved in even simple regexp matching?? This is the problem when high level coders try to design low level
      systems, they simply don't have a clue how things really work and assume that the high level procedures/objects that they work with
      are some sort of magic that "just happens" and you can use them everywhere with no performance degradation.

    7. Re:XML is just hard to parse by jovlinger · · Score: 1
      yes...

      I like that idea. Let's map the input directly to a c struct. For complicated items containing lists with interrellationships, you just map it to an array of such structs. The items just store offsets, so you can just add that offset to the base pointer to get the referred item.

      ... or any other item in your address space.

      This idea is perhaps the most braindead idea I have ever heard. Completely throw out security for a bit of efficiency?

      Of course you could validate your data structure, but since it is no longer described by a language, YOU HAVE TO WRITE YOUR PARSER YOURSELF. At the bit level, no less. No automatic correctness guarantee, no performance enhancements.

      Not only are you addressing the wrong problem (it has already been convincingly suggested that allocation overhead is the likely culprit), you've suggested perhaps the most insecure solution possible, and one whose secure implementation would likely be slower than what it is supposed to fix.

      Don't program anything I'd use. Please.

    8. Re:XML is just hard to parse by Viol8 · · Score: 0, Flamebait

      Throw out security? Wtf are you talking about?? Anything can be encrypted if thats what you want.
      Do you think ssh uses XML?? Don't be fuckwit, go get a clue and if you manage to find one then get back to me.

    9. Re:XML is just hard to parse by jovlinger · · Score: 1

      I take it you are unfamiliar with the buffer overflow problem.

      SSH and encryption do nothing against attacks against the machine, only messages in transit.

      another problem of wrong problem, wrong solution.

      My request stands.

    10. Re:XML is just hard to parse by Anonymous Coward · · Score: 0

      err, sorry, bloatware has nothing to do w/ buffer overflows. please don't program anything *I* use, XML man.

    11. Re:XML is just hard to parse by moncyb · · Score: 1

      PD wrote: "It's hard to parse. That takes cycles."

      In my experience XML isn't hard to parse at all. Basically, you just have to recognize tags (basic regexp) and match opening ones and closing ones (use a stack, Luke).

      The cycles he was talking about were obviously CPU processing cycles. Show me a CPU which has opcodes for regular expressions. Do you even know enough about how processors work to tell which operations will require more processing time? Even a line by line text file is easier to process than XML.

    12. Re:XML is just hard to parse by Hognoxious · · Score: 1
      I like that idea. Let's map the input directly to a c struct. For complicated items containing lists with interrellationships, you just map it to an array of such structs. The items just store offsets, so you can just add that offset to the base pointer to get the referred item. ... or any other item in your address space.
      Replace 'c struct' with 'cobol record' and you've pretty much got EDI, which has been around since before the people who invented XML were born. And it worked, and still does, for the purposes it was designed for.
      most insecure solution possible
      only since someone (you) introduced pointers into a file, FFS. Or was this just a strawman? In either case, you've proved beyond doubt that you haven't got a clue WTF you're talkinng about.
      Don't program anything I'd use. Please.
      I won't. But your salary (if you ever get one) will probably be paid by a mechanism closer to what Viol8 suggested than XML.
      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    13. Re:XML is just hard to parse by jovlinger · · Score: 1

      Ok. I know nothing about COBOL, so I can't discuss that. So I'll stick to XML.

      Unless you are importing completely trivial, flat, records (in which case using xml seems like overkill), you need a nesting structure... let's say lists. Since you don't know the length of the list when writing your program, you need to dynamically allocate it. That sounds like a pointer. If you allocate it on the stack, read on.

      So now you need to make sure that all uses of that pointer (ie, references from one item to another) are well formed, and don't go beyond the end of the list. if they do, and you're on the stack, Bad Things can happen. If it's on the heap, you'll probably just crash.

      Of course, you can avoid this by validating the binary structure before using it. But then you're back to the parsing you wanted to avoid.

      Additionally, this approach brings in endian ordering issues. I doubt the int representation in a struct is the same accross PPC, x86, and big iron. Likely COBOL mandates a byte ordering.

      I'll hazard a guess that the "purposes it was designed for" for Cobol's records and XML are not the same. Notably, the latter is designed for cross-platform data representation and transport. Hence we assume it comes from untrusted sources.

      Note that I have at no point said that binary representations are a bad thing. I just doubt very much my salary will be paid by using unvalidated input from untrusted sources. At least for very long.

      However, I may go and write a binary-format parser generator, if such a thing doesn't already exist. I imagine it does, which is a shame: it would be fun figuring out how to represent the cross linked indecies in a flexible way. That way the validator can be built into the parser.

    14. Re:XML is just hard to parse by Arandir · · Score: 2, Interesting

      Maybe it's time someone wrote an intelligent pre-parser. Take a cursory look at the XML and pass it on to an appropriate parser based on encoding, DTD, size, etc. Or run the document through a pipeline, where every single request takes longer to process, but you can several in the pipe at the same time.

      There's no reason there has to be a single heroic XML parser that does everything.

      --
      A Government Is a Body of People, Usually Notably Ungoverned
    15. Re:XML is just hard to parse by cait56 · · Score: 1

      A binary parsing program? Oh you mean like RPC or CORBA, or any of a thousand existing debugged solutions that are more efficient in terms of processing overhead and network utilization?

      Oh that's right. I forgot. Those were designed by Neanderthals before HTTP existed, so they aren't worth looking at.

    16. Re:XML is just hard to parse by jovlinger · · Score: 1

      Yeah. like those.

      So, then.

      Why isn't RPC-gen used as a container format?

      You've got me thinking, and I'm curious.

    17. Re:XML is just hard to parse by wirde · · Score: 1
      If you need a (complex) binary protocol, why not use ASN.1? Mature, tested, compact (if using Packet Encoding Rules), almost readable (if using Basic Encoding Rules).

      There are many ASN.1 compilers available (most of which have a rather steep licence cost...)

      --
      in GNUin GNUin GNUin GNUin GNUin GNUin GNUin GNUSegmentation fault
    18. Re:XML is just hard to parse by Anonymous Coward · · Score: 0

      Using regexp is a *really* inefficient way to parse XML, because of the look-ahead etc.

      Having said that, I agree that XML is rather heavy-weight for high performance applications not only because of the parse overhead, but also as it tends to increase the bandwidth of the data at least 10-fold.

      While I quite like the idea of DOM, it is just too slow and uses too much memory. With a SAX parser it is easier to control the amount of memory used in an application and hence the speed of parsing.

      But all-in-all, XML is just too FAT!

    19. Re:XML is just hard to parse by Viol8 · · Score: 1

      "Additionally, this approach brings in endian ordering issues. I doubt the int representation
      in a struct is the same accross PPC, x86, and big iron. Likely COBOL mandates a byte ordering"

      When you have time to tear yourself away from your Dummies Guide To Markup Languages I suggest
      you go check out the man pages/help files on the htons(), ntohs(), htonl()& ntohl() C functions.

    20. Re:XML is just hard to parse by BlackHawk-666 · · Score: 2, Insightful
      XML is not designed for speed, but for information exchange. Mapping onto a C structure may work well for a single platform and a single compiler but each processor and compiler have their own ideas about ordering of struct members and padding e.g. Intel likes DWORD alignment if available and used to pad as required...not sure about the latest batch of processors and compilers.

      You lose portability between platforms by trying this low level mapping. How well do you thin big endian systems will like to share with little endian ones? Portability, readability and exchangability are the reasons for XML, not flat out speed. That said, we use XSL around here for marking up our web pages and it is lightening fast!

      --
      All those moments will be lost in time, like tears in rain.
    21. Re:XML is just hard to parse by BlackHawk-666 · · Score: 1
      Ah, there's nothing like posting on SlashDot to humiliate yourself amoungst your peers. Buffer overflows and stack smashing are the worst security flaws these days. Encryption is there to keep people reading your cleartext, and is only a sub-set of good security principles. Please, try to post only on topics in which you are knowledgable.

      P.S. if the other guy returns with that clue I suggest you nab it and try using it for yourself.

      --
      All those moments will be lost in time, like tears in rain.
    22. Re:XML is just hard to parse by BlackHawk-666 · · Score: 1

      Of course those protocols are binary formatted packet streams, they were designed for extremely quick and low cost message passing. You might as well compare XML to GIF files or a JPEG for all the sense that last statement made. Binary formatting is appropriate for low level network transports, DCOM, RPC, CORBA, etc, but is not appropriate for a format that is designed to allow easy interchange of data between completely disparate systems e.g. 32 bit big endian machines and 64 bit little endian machines. Even TCP/IP forces you to convert to and from network endianess before sending out your packets.

      --
      All those moments will be lost in time, like tears in rain.
    23. Re:XML is just hard to parse by Anonymous Coward · · Score: 0

      1) We are _not_ his peers; this man has mistaken this section for the games./.org

      2) He will stalk you.

    24. Re:XML is just hard to parse by ynohoo · · Score: 1

      the big endian/little endian issue only arises if you are passing binary numeric fields - COMP in COBOL, int or integer in C, Pascal etc.

      So the first rule of portability is don't use binary or packed formats - use character based ones. This approach also means you can easily translate ASCII into EBCDIC into Unicode...

    25. Re:XML is just hard to parse by Viol8 · · Score: 1

      Oh for gods sake , as I've pointed out to someone else , go check out the htons() , ntohs(), htonl() and ntohl() functions
      for solving endian issues. Ordering of struct members?? In the C standard it states that all members MUST be laid
      out in memory in the order they're defined in the C code. Otherwise half of the unix networking code would fail!
      Alignment is a non issue when passing data from one machine to another.

      No doubt you've read about all these terms in some book and think you're being smart but all
      you've done is prove my point about high level coders being clueless.

    26. Re:XML is just hard to parse by BlackHawk-666 · · Score: 1
      I'm already well familiar with the htons () and ntohs () functions seeing as I've programmed low level network handling code. Those functions aren't really the issue, it's the way some broken (read Microsoft) compilers *don't* follow the standards and lay their structure out differently in memory. You only need 1 compiler on 1 platform to be doing this to end up with a forked code base because blasting records to the drive or network won't work with these dysfunctional compilers. That is the main issue. Let's not be forgetting different architectures ideas on the size of an int, so this will need to be handled, possibly with typedefs.

      No doubt you've read about all these terms in some book and think you're being smart but all you've done is prove my point about high level coders being clueless.Your inability to see the value in losing a little performace to gain a lot of compatibilty is showing who the real clueless person in this thread is. It's all about the right tool for the right job. XML is not meant for low level networking or for high speed transfers, or for low footprint data storage. What it does, and does well, is allow tow disparate systems to speak a rudimentary and easy to parse common language.

      --
      All those moments will be lost in time, like tears in rain.
    27. Re:XML is just hard to parse by Viol8 · · Score: 1

      "it's the way some broken (read Microsoft) compilers *don't* follow the standards and lay their"

      I didn't know about that but it doesn't surprise me. However , just because one company doesn't
      comply with a standard doesn't mean that it shouldn't be used. After all , MS don't follow the
      telnet RFCs to the letter but we still use it.

      "Your inability to see the value in losing a little performace to gain a lot of compatibilty is showing who the real clueless person in this thread is. It's all about the right tool for the right job. XML is not meant for low level networking or for high speed transfers, or for low footprint data storage"

      And I agree with you entirely, but you seem to have forgotten that the original story was
      called "Using XML in Performance Sensensitive Apps" in this case the guy was talking about using
      it in a 1000 per sec concurrent request system which IMO is crazy. Sure if you want to pass across some large data file in a
      one off transfer or maybe send a few packets a second I would have no issues with using XML , but
      using it for some high throughput quick response system is just folly.

    28. Re:XML is just hard to parse by BlackHawk-666 · · Score: 1
      And I agree with you entirely, but you seem to have forgotten that the original story was called "Using XML in Performance Sensensitive Apps" in this case the guy was talking about using it in a 1000 per sec concurrent request system which IMO is crazy.

      Oh yeh, I forgot about that. In which case the guy is crazy unless he's planning on building a reasonable sized cluster or moving the transforms back onto the client machines. It's easy enough to do 1000 hits/sec, but 1000 page request / sec is another ballpark altogether. NOTE: even my XBox can handle 850 hits/secs!

      --
      All those moments will be lost in time, like tears in rain.
  4. Is that using SAX or DOM? by KDan · · Score: 4, Insightful

    It might be of some use if you actually told us what libraries you used, what methods, etc, not just "I tried to parse some XML files". Is that result of 20 concurrent requests using a SAX parser or DOM? Are you using the standard java DOM implementation (slow and bulky), or one of the slicker ones like JDOM, dom4j, etc (there's a bunch you should have a look at). Another thing you could do t o improve performance is to identify the points where you don't really need a DOM (eg you're just reading the values once and discarding) and use a SAX parser instead to fill in a custom class or a hashtable or such.

    Daniel

    --
    Carpe Diem
    1. Re:Is that using SAX or DOM? by jazir1979 · · Score: 1


      I assume the poster has tried a number of methods if they went to the trouble of mentioning pull-parsers.

      I doubt he/she is daft enough to be using a slow DOM implementation in situations where SAX would suffice.

      --
      What's your GCNSEQNO?
    2. Re:Is that using SAX or DOM? by Lechter · · Score: 2, Insightful

      First of all, the people who say that you should simply switch to a structured binary protocol, and get at your messages through casting are right. That'll be a lot faster. But if you're stuck with implementing a WebService then you're stuck with XML.

      As for using DOM, I'd argue that you should never use it in a performance critical application. I understand that you need to refer to different parts of the message at concurrently so an event-based parser alone won't work. But what you ought to consider is using a lighter weight representation of your messages than DOM. After all DOM gives you access to alot of information that you really don't need. You might look into XML->object mapping API's like Castor or maybe Betwixt. Or you could just roll your own. That way you could use a quick push parser like SAX to parse the XML, but still have the ability to access all of the message. You might also want to look into the parameters available in your parser, to try and strip it down...maybe turn off validation, DTD's etc...

      --
      credo quia absurdum
  5. java and c#? by Anonymous Coward · · Score: 5, Insightful

    well there's your problem.

    With mod_perl, XML::LibXML, XML::LibXSLT, I EASILY get 100/per second. and my code is shitty.

    what do you do with the XML, do you generate HTML from it with XSLT or what?

    another thing to try: intelligently cache your results in shared memory. you can easily double performance or more.

    1. Re:java and c#? by jslag · · Score: 2, Interesting

      With mod_perl, XML::LibXML, XML::LibXSLT, I EASILY get 100/per second. and my code is shitty.

      Amen. All of my XML processing code for the last year has been written using the above-mentioned tools, and it's been fast enough that I haven't needed to spend time performance tuning.

      See the apache axkit project for more info.

  6. Switch to a custom protocol by setien · · Score: 5, Interesting

    I love XML, and I use it anywhere I can get away with it, but I know from my old job, that switching to a binary protocol that is streamlined for the task at hand can give you performance gains over XML protocols that are just plain ridiculous.
    I think we the results we measured were something like 1000 times as many connections on a custom binary protocol over an XML based one.
    That was in C++ mind you. YMMV.

    --
    Give me liberty or give me kill -s 9
    1. Re:Switch to a custom protocol by jo42 · · Score: 0, Redundant


      Here be another voice of reason: don't use XML! XML is a pig. Period.

  7. In the MS Smartphone by samjam · · Score: 1

    The homescreen app in the MS Smartphone has it's config specified by XML.

    For speed, and to avoid parser-usage memory leaks that may exist or be introduced by improper usage of other homescreen plugin developers a seperate app loads all the homescreen plugins feeding them their xml config. This app then streams the plugins out in a binary format (each plugin must support streaming) and then quits, solvin gany memory leaks.

    Then the homescreen app streams them back in and out again as needed without the xml speed hit or danger of leaks from xml parser.

    1. Re:In the MS Smartphone by BlackHawk-666 · · Score: 2, Funny

      Wouldn't it be better for MS to fix the memory leaks? I have a copy of BoundChecker they can borrow if they're a little strapped for cash ;->

      --
      All those moments will be lost in time, like tears in rain.
  8. Benchmarks, handmade parser... by Bazzargh · · Score: 4, Informative

    First off, any chance you could post those benchmarks? 20 requests/second seems low, I'm wondering what the rest of the setup was.

    For the first part: we had performance problems on an app where the customer had insisted on xml everywhere. However, in one particularly critical part of the system we were getting hammered by the garbage collection overhead of SAX (its efficient for text in elements, but not for attribute values or element names).

    Anyway - we knew what was coming into the system as we were also the producers of this xml at an earlier stage. So we wrote a custom SAX parser that only supported ASCII, no DTDs, internal subsets etc; and wrote it to return element/attribute names from a pool (IIRC we used a ternary tree to store this stuff, so we didn't need to create a string to do the lookup).

    It was like night and day. XML parsing dropped from generating 80% of the garbage to about 5% and it just didn't appear on my list of performance issues from then on.

    Java strings do a lot of copying, the point is to get yourself as close as possible to a zero-copy xml parser as you can.

    You might want to look at switching toolkits entirely as well - GLUEs benchmarks sound a lot better than yours.

    1. Re:Benchmarks, handmade parser... by Twylite · · Score: 2, Interesting

      So what you're saying is that you stopped using XML and used something completely different that has a visual similarity to XML.

      Hint: if it doesn't do unicode, DTDs, CDATA sections and all the other crap, its not XML.

      --
      i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
    2. Re:Benchmarks, handmade parser... by Anonymous Coward · · Score: 2, Insightful

      What, you mean someone actually does implement all that unicode, DTD, CDATA and other crap into their software? Don't they have anything better to do?

    3. Re:Benchmarks, handmade parser... by Hognoxious · · Score: 1
      unicode
      Unifuckingcode. I hate bastard unicode, I do.
      If summat can't be writ with iso8859-1, it ain't worth the writtin', so help me God.
      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    4. Re:Benchmarks, handmade parser... by Anonymous Coward · · Score: 0


      What's wrong with cp437 if you avoid standards anyway?

  9. profile your application by Bart+van+der+Ouderaa · · Score: 4, Interesting

    Have you profiled your application?
    Do you test on a dedicated test system?

    If your only getting 20 concurrent users regardless of platform (could be, it really depends on the setup and complexity of the problem), maybe the technology isn't the problem but it could be network etc.

    benchmarking is fine, but if you do it on the whole system you don't know what the problem really is.
    Find out precisely what the problem is (network/xml parser/your app logic /db connection/db speed). Look at your own code with a profiler to see the bottleneck.

    If you do end up blaming the parser, change it! (and i don't mean using a different parsing method as most use a sax parser to generate the tree anyway) there are parsers that are 50% faster than those used as standard (xerces isn't the fastest java parser around!). Also look at the most efficient way of using the tree (java dom is, as already said, slow in usage) or maybe you can go from sax directly to your object model without using a tree but building your own sax parser.

    If you can't get a performance gain (which I really doubt), be honest to your client. "If you want to do it that way it's going to cost you" or "it can't be done on one machine" how did they get the idea they could handle 1000's of requests a second anyway? Work on your expectationmanagment (basicly work on making their expectations more realistic). If you promise mountains make sure you can deliver them first. If you can't deliver them make them not want mountains but molehills :-)

  10. So don't use XML. by WasterDave · · Score: 2, Insightful

    I don't understand what the problem is here. You're saying that you like XML, but it's slow. Fine, don't use it. It's not like it's the only tool in existence, is it?

    Dave

    --
    I write a blog now, you should be afraid.
    1. Re:So don't use XML. by Knight2K · · Score: 2, Insightful

      I would guess that using XML is to some degree a political issue that can't be avoided. Which is really symptomatic of the age-old problem of the business and technical sides not really listening to each other.

      --
      ======
      In X-Windows the client serves YOU!
  11. AOLserver and tDOM by Col.+Klink+(retired) · · Score: 2, Informative
    I'm just going to guess at what your problem is since you didn't really tell us. I'm assuming that your application needs to load the entire DOM tree 20 times for 20 concurrent requests and that's taking either too much CPU or too much memory.

    The solution would be to load the DOM in the backend and have front-end applications access it.

    You could try using AOLserver as a multi-threaded web server and tDOM as your DOM processor.

    --

    -- Don't Tase me, bro!

    1. Re:AOLserver and tDOM by BlackHawk-666 · · Score: 1

      With this lousy performance I am starting to wonder if the DOM contains his entire website and he is parsing out the page that he needs to serve. Honestly, XML and XSL are lightning fast on the servers and platform we use, which is just plain old Windows 2000 Server and MSXML 4.0. We're talking hundreds of pages marked up per second, each with maybe a dozen sections that are transformed individually (by our CMS).

      --
      All those moments will be lost in time, like tears in rain.
  12. XmlTextReader by MrProgrammer · · Score: 2, Informative

    Many have asked about what libraries you are using to get at the XML. Loading up a whole DOM document is indeed quite inefficient.

    On the .Net platform, I would suggest using the XmlTextReader class. This class and its bretheren are the parsers underlying Microsoft's DOM implementation, and anything else that needs access to XML. The class is noted for its strong performance advantage over loading a DOM or using XPathNavigator - and it is indeed a very lightweight class. It is certainly not as comfortable to use as the DOM, but neither is it incredibly painful, especially if your documents are relatively simple.

    Give XmlTextReader a shot.

    1. Re:XmlTextReader by f00zbll · · Score: 1

      XmlTextReader is a pull parser right. Therefore it still wouldn't help in situations where the entire message is used by the application. Assuming of course the sender is not including un-necessary data in the message. I believe in those cases, XmlTextReader at best will be equal to DOM. Of course using a different method like .NET Remoting may be an option, if performance is really required and webservices is non-negotiable.

    2. Re:XmlTextReader by MrProgrammer · · Score: 1
      But, unlike the DOM, the XmlTextReader does not have to allocate the entire tree in memory. In fact, it shouldn't have to allocate anything except for the string to hold the text itself. It simply changes a flag to tell what kind of "element" it is looking at (Element, EndElement, CDATA, Comment, etc etc). Even when reading in the entire document, there is less overhead.

      I don't have the time to do it, but I would suggest creating a quick test to compare the relative performance of the DOM and XmlTextReader, or even XPathNavigator. Ultimately, that will provide the most conclusive statement of the ideal parser. However, this document gives a guideline to the appropriate XML api to use. It lists the DOM as 2 to 3 times slower than XmlTextReader, and lists XmlTextReader as the most efficient parser, memory-wise.

      In this MSDNTV episode, the Microsoft developer describes XmlTextReader as "essentially the XML parser for .Net". Thus, the DOM is using XmlTextReader anyway, and any other features provided are uneeded overhead.

      Reading XML with the XmlReader gives some guidelines for using the XmlReader classes.

    3. Re:XmlTextReader by f00zbll · · Score: 1
      I don't have the time to do it, but I would suggest creating a quick test to compare the relative performance of the DOM and XmlTextReader, or even XPathNavigator.

      I've run benchmarks, in situations where the entire document is needed and used DOM can be better (depends on use case). Of course if you don't need DOM, then don't use it. The challenge from my perspective is this. If you have a consumer which uses objects and your webservice has to reason over that same object model, you'll probably have to use schema to convert the SOAP message to a C# objects. Once that is a requirement, you're choices are limited. To correctly create an object, you would have to traverse the entire message, instantiate the necessary objects and call the proper method. XmlTextReader is very efficient right, but a process like schema conversion is heavy weight. One option is to use .NET remoting, if the consumer is a VB client or some other .NET client. Read the article by Sosnoski about different parse techniques. He explains it much better than I can.

  13. Wrong uses of XML by Randolpho · · Score: 5, Insightful

    This is an example of the wrong way to use XML.

    XML is great because it's extensible and a markup language. It's great for storage, configuration files, and certain forms of data transmission (which is just a sub-set of storage).

    What XML is not good for is performance-critical transmission protocols. It's too verbose and too complex, and both are bad for protocols. That is the mistake made by the author of the article. Go with a structured protocol and skip the XML.

    --
    "Times have not become more violent. They have just become more televised."
    -Marilyn Manson
    1. Re:Wrong uses of XML by __past__ · · Score: 2, Insightful
      It's quite funny that you highlight XML being a markup language (or rather, a tookit to build markup languages), and don't even include document markup as something it's good for.

      Despite all the hype behind XML, markup somehow doesn't really seem to be any more hip than in the dark SGML ages. Sometimes I really wonder why all the data-heads try reinventing ASN.1 with more bloat and complexity so hard.

    2. Re:Wrong uses of XML by Anonymous Coward · · Score: 0

      That's because XML isn't good for markup, since it's a tree-description-language, not a proper markup language. If it were a markup language, I wouldn't need to worry about tree structure.

      XML is too strict for human use, yet badly designed for computer use.

  14. Interesting article by f00zbll · · Score: 2, Informative
    There's an interesting article that compares the different types of parser and their advantage at a fairly low level. Dennis Sosnoski's article on xml performance was included on IBM's site a while back. It's a worth while read.

    I'd have to agree with people's assertion that performance intensive apps should use a custom protocol and preferably binary based or some kind of delayed stream parser that only accesses the XML node when the app calls for it. I believe Sun has an API in the works for XML stream parsing JSR 317. It's too bad the jsr is still in public review phase. I've written custom parser in the past using SAX and it can definitely improve performance if you convert it to an object model. The question is trade off between being generalized and performance.

    In the case of a webservice that uses schema, it's going to be hard to get around the performance issue. An obvious solution in situations where XML is required is to send as little as possible and only get the nodes you need. In that respect XPP2 and XmlTextReader help, until you need the entire document and you use the whole document.

  15. Explain more by vadim_t · · Score: 2, Insightful

    First, what does your program do? Why are you so sure XML takes so much time to process? And, is really XML the best format for your application?

    You could get speed improvements by making things simpler. If XML data takes so much to process on your server then I guess you have two possible problems: Either the amount of data is very big, or you're doing something wrong. You don't really have to use every feature of XML in your program.

    Make sure you also understand what XML is for. Sending bitmaps by transferring gigabytes of <pixel r="10" g="100" b="0" /> is definitely not a good use of XML. For some kinds of data perfectly good formats already exist.

    Also, do you really need XML? If it's something time or bandwidth critical, rolling your own could be easier. Especially if you don't need a lot of interoperation with other programs. Binary protocols are quite easy to make extensible, too. For example, you can send everything in a kind of container. Say, a structure with a char or int for a command ID, and a long for a command length. Then put any data inside. That's just 5-8 bytes per header, and should let you add stuff easily.

  16. Caching and IO by shadowpuppy · · Score: 1

    In my experience that seems about right. I'm using AxKit with caching shutoff and my own Language module and those are about the results I get. For me its not a big deal since speed isn't that important.

    What little I have looked at the speed issue points to 2 things. First caching probably helps alot. Second it may pay to customize the output code. From the few performance tests I've done libxml's output code was the main slow down. Keeping in mind the performance testing was done on C++ code not perl code. It may be possible to write something that walks the DOM and spews the result in a much faster fashion.

  17. Event-based parsing...caching...JDOM by The+Mayor · · Score: 1

    Sounds like you make extensive use of DOM-based parsing. Ever look at the memory footprint required to parse a large XML document using a DOM parser? I've seen 1MB XML files that require >100MB of memory to parse to a DOM. Event-based parsing helps a lot here.

    Now, if you need to manipulate the DOM (and thus require DOM-based parsing), I would suggest caching. One thing some of the commercial XML-based databases do is to parse XML files when they are added to the DB, storing the resulting DOM rather than the original XML document. Then, they make heavy use of lazy loading and in-memory caching of the DOM so that the entire DOM doesn't need to be stored in memory (if all you are concerned about is a changine a small part of the DOM), and so that frequent accesses to the same nodes can be speeded up.

    Finally, you can use one of the non-standard DOM-like parsers, such as JDOM. These things don't require the memory footprint nor the parsing time that traditional DOM parsers require. Good stuff.

    Of course, I would suggest you use a combination of all three. JDOM + stored, post-parsed DOM + in-memory DOM node caching can result in a solution that is as fast as using proprietary binary representations while still retaining all of the flexibility and compatibility of XML.

    --
    --Be human.
  18. S-expressions by toomuchPerl · · Score: 2, Informative
    why even bother w/ XML? S-expressions are truly superior, and much easier to parse. You can write an S-expression parser in about a hundred lines of Perl, and there exist decent libraries or bindings for S-expression parsers available for C, Python, Java, Ruby. It's much faster and the overhead is always less.

    --toomuchPerl

    1. Re:S-expressions by Anonymous Coward · · Score: 0

      But we like the overhead of XML, dammit!

      What about S-exy expressions (sexpressions)?

      (hey (baby) (do you come around (here) much?))

      Rating +1 Not Informative

  19. Parsing isn't the issue by cait56 · · Score: 1

    Even if the parsing could be done in zero execution time, XML is still consuming excessive network bandwidth.

    XML is very flexible, and an excellent solution when flexibility is truly required in what the next data element is.

    Howeveer, doubling (or worse) the network bandwidth used in downloading a table in order to allow each record to have a different set of fields is just plain stupid.

    A realistic compromise is to use XML to describe different "row formats" that will be used. And then to deliver each row in a compact tagged binary format.

    Save the full flexibility for transfers that need it, a text document is a great example. There are too many options for what might apply to the next paragraph for a fixed format to be useful.

    1. Re:Parsing isn't the issue by Arandir · · Score: 2, Interesting

      So compress the XML. Since it's text, and usually very regular text, it compresses nicely. A simple pretuned huffman filter will do wonders.

      --
      A Government Is a Body of People, Usually Notably Ungoverned
    2. Re:Parsing isn't the issue by cait56 · · Score: 1

      OK, if you have an extraordinary amount of CPU time to waste, then compressed XML would work just fine.

      You also need to be able to postpone processing until large chunks are received, because compression doesn't work well over small data transfers.

      In the real world however, XML wastes bandwidth and at least some processing power. Dynamic compression would reduce the bandwidth waste (perhaps even eliminate it) but only by increasing the processing power being wasted to a truly bothersome level.

      i have no problems with file transfers delivering XML, just with using XML for what a binary protocol or an adaptation layer (RPC, RMI, CORBA, etc.) should have solved.

      The "benefit" usually cited for using XMLs is that it looks like HTTP traffic. Bypassing firewall controls is not a benefit.

    3. Re:Parsing isn't the issue by Arandir · · Score: 1

      That's why I said to use to pretuned huffman tree. It doesn't stress the CPU, and you can use it on your first byte. It won't work for just any random XML stream, but you can tune it for the stuff you have control over. In fact, if you are considering a roll-your-own binary format, you already have control over both ends.

      --
      A Government Is a Body of People, Usually Notably Ungoverned
    4. Re:Parsing isn't the issue by BlackHawk-666 · · Score: 1
      A better solution is Gigabit Ethernet. Honestly, you read the crap that is SlashDot and yet complain about wasting bandwidth ;->

      Do you use a text only browser to do all your web browsing or are you downloading some of those nasty bug graphics too? Isn't CSS and tables to layout web pages a waste of bandwidth? Why not get rid of all the formatting tags that bloat web pages too and just have plain black and white web pages with (perhaps) the H1-H6 tags. Bandwidth is cheap and plentiful, especially inside a data processing centre where this is all going to take place.

      --
      All those moments will be lost in time, like tears in rain.
  20. You picked the wrong tools by Voivod · · Score: 2, Informative

    If you are using C/C++ check out gSOAP. It goes real fast, runs on many platforms, and I've used it to talk to Java, PHP, C# etc without a problem. It does about 3000 transactions per second on my little desktop PC. Obviously 100 parallel clients aren't going to get that speed, but it sounds like it will be much faster than what you're using!

    http://www.cs.fsu.edu/~engelen/soap.html

    1. Re:You picked the wrong tools by Anonymous Coward · · Score: 0
      It does about 3000 transactions per second on my little desktop PC

      That's pretty impressive. What kind of system was that on? I've done some benchmarks on a 2ghz AMD system using .NET 1.0, IIS 5.0 and webservice to open a XML file locally and send it as the response without manipulating the document at all. the numbers I get is 50-65 req/sec with 5 concurrent clients. As the number of concurrent requests drops, the req/sec drops as expected. If I understand XmlDocument API correctly, it uses XmlTextReader as some one else mentioned. Since I don't manipulate the document or access any nodes, the over head should be fairly minimal compared doing something heavy like converting it to a custom object model. I've looked at gSOAP in the past, but I've never used it and haven't benchmarked it. From my limited knowledge of gSOAP from articles about SOAP performance on the internet, the speed of gSOAP is still slower than binary protocols like RMI. 3000 requests/sec is pretty good considering it's handling SOAP.

      If you can elaborate more, was the message converted to some object model or used in RPC fashion?

    2. Re:You picked the wrong tools by r4lv3k · · Score: 1

      Also, if you're trying to pass larger chunks of data, gSOAP supports streaming DIME attachments.

      gSOAP is just awesome!

  21. My memory hurts, and like my bladder it leaks. by ratfynk · · Score: 0

    So you are the guys who write .processor code so I have to keep adding ram all the time. .processor languages like xml are really the invention of an evil computer marketing genuis. Who ever the SOB is, it is getting silly.

    --
    OH THE SHAME I fell off the wagon and use sigs again!
  22. Proper Parsing by jkichline · · Score: 3, Informative

    I have to agree with many of the comments. The parser you choose is the most important decision. DOM is typically a memory hog and takes time. In my experience the MSXML 4.0 parser is very fast, written in C, etc. DOM is easier to user, but obviously can have some downsides. XML is great for portability and faster development, but performance concerns can arise.

    Find out where the bottleneck lies. If you are running an XSLT processor on the server, that will limit your request/sec. I've found that stream XML from the server to a client (such as IE6, gasp) and having the client render to HTML is wicked fast. The XSLT parser in IE renders asynchronously allowing the results to be displayed before the entire doc is loaded. Of course this is MS specific stuff I've experienced, etc.

    SAX is faster for grabbing XML events. While writing a web spider, I was parsing HTML using an HTML parser. I switched from that to regex and saw crawl speed increase significantly. It depends if you need to whole XML doc or not.

    You may want to try loading the XML DOM once and serialize the binary. You could then ship the binary around town. Macromedia has some tools like this that can send binary objects to a flash client, etc. Limit the parsing.

    Another tip... if you have control over the XML schema, you may want to research how to structure XML for performance. I've heard that attribute heavy XML docs are more efficient than docs with embedded data, etc. Also look into some XML tricks like IDs, etc.

    Good luck in your pursuit. Choose your parser carefully. If testing turns out negative, you may just want to use some binary data. XML is a wonderful technology designed to aid in system integration, and ease of use... but it comes at a price.

  23. Would CORBA be useful? by Anonymous Coward · · Score: 0

    From what I've heard, you can get much higher overall data transfer rates using CORBA than XML. It uses a binary protocol, but with a customisable API. Depending on your application, this may be a viable solution - plenty of open-source ORBs are available, with omniorb quite well known as being one of the fastest.

  24. More info needed by wickedhobo · · Score: 2

    We need to know more about what you are doing to really be able to understand.

    For example: Are you serializing XML to/from objects in Java or C#? Are you writing custom serializers? Or are you using the built in introspective type serializers for Objects?

    Are you using Document centric SOAP, in which case your doing more parsing and logical operation that serialization/deserialization?

    Do you really know that SOAP is your bottleneck? Have you profiled it?

    I'm using SOAP in production with J2EE right now with no problems. We use both Document centric (DOM Element[]) and serialization/deserialization.
    It's fast, without problems. We are using load-balancing/clustering as well, and SOAP does not seem to be the scale-bottleneck for us. The write's/tx's to the database are a bigger problem. Smart usage of caching solves the problem for us.

    --

    --Stupidity is Self Curing!
  25. Parse it, don't check it by RhettLivingston · · Score: 3, Insightful

    Most of the work in an off the shelf XML parser is verifying that the XML is "good" or matches some schema specification. If its coming from one of your programs and going to one of your programs and you've done reasonable debugging, its good. You just parse it and use it. Not enough has been done to optimize the "trusted" app communications scenario even though in reality, that's probably 95%+ of the actual usage of XML. Very few sites are actually publishing XML that is really getting used by programs and pages other than the ones they've written.

    Parsing it is very easy and quick if you're in full control of the encoding. You can optimize your parser greatly by choosing not to handle the general case, but to instead handle only what your specific encoder generates.

    Use the protocol, pick up the buzz word for your app, but leave the pain of the generalities meant to handle some free data exchange world that is 15 years in the future out. When the semantic net comes about and applications can actually use any XML without needing to be written to use that XML schema, then you can worry about the general case.

  26. You mean like this? by Anonymous Coward · · Score: 0

    > This is the problem when high level coders try to
    > design low level systems, they...assume that the
    > high level procedures/objects that they work with
    > are some sort of magic that "just happens"

    You mean like this RFC:
    http://www.ietf.org/rfc/rfc3252.txt

    which basically turns IPV4 packets in XML? Just imagine sending XML on these networks!;-)

  27. ASN.1 by Steven+Reddie · · Score: 1

    This is somewhat simplified, but code like:
    if (strcmp(tag, "surname") == 0)
    ; // handle surname
    else if (strcmp(tag, "firstname") == 0)
    ; // handle firstname
    is obviously a whole lot slower than code like:
    if (tagByte == TAG_SURNAME)
    ; // handle surname
    else if (tagByte == TAG_FIRSTNAME)
    ; // handle firstname

    The problem with XML is that it is a general-purpose textual encoding, and as with most textual encodings it requires more bytes than a dedicated binary encoding does. The result is that it requires many more cycles to process.

    If speed it of prime importance then don't use XML.

    ASN.1 with it's various encodings (BER/DER/PER), as used by PKI standards to encode things like keys and X.509 certificates (to name a very small fraction of what it is used for) can be very compact. It takes quite a bit more effort to understand, but it does result in efficient encodings of data. This is one of the reasons why ASN.1 has been an international standard for many years and is used for protocols in mobile phone networks.

    1. Re:ASN.1 by plummis · · Score: 1

      I wholeheartedly agree. You can run into performance issues using XML in large distributed systems. ASN or TIRPC is the way to go.

  28. Fastest all-around full-featured XML support libs by aminorex · · Score: 3, Informative
    If you really do require full XML support, the fastest libraries are the GNOME libxml et al. See the benchmark results if you don't believe me.

    If you can do with basic parsing, the nanoxml and picoxml libraries will put everything else to shame.

    --
    -I like my women like I like my tea: green-
  29. Yes, I see. So you'll be going with TCP/IP then? by j.e.hahn · · Score: 1

    You do realize that, to some degree, what he described is precisely how TCP/IP's wire protocol was meant to work, right?

    But of course, we all know how awful an idea TCP/IP was.

  30. Biztalk by badfish2 · · Score: 2, Informative

    We use Biztalk for a lot of enterprise-level XML parsing, and we get up to 200+ documents parsed per second. Of course, there's a lot of hardware being used - 3 2-processor processing boxes handling the workload, for example. But for a system pushing and pulling messages in and out of a SQL Server database it works pretty well. And these are pretty decently sized documents, doing mapping and using all kinds of functoids and whatnot.

    --
    "On the Internet, nobody knows you're a dog!" - a dog
  31. Cache by vudmaska · · Score: 1

    Get a 64 bit processor with lots of memory and cache EVERYTHING. I'm using an app that caches the entire database(currently about 10 mg).It's lightning fast for hundreds of users. Suppose thousands could be handled.

    --

    my other sig sucks less

  32. Application design? by r4lv3k · · Score: 1

    Can you be more specific about why your SOAP messages are so large? What do they contain? What platform is giving you such poor performance? BTW it makes great sense to have the ability to use XML to communicate across heterogenous application boundaries, but why not use a framework that abstracts the wire format? For example, you could leverage binary-format remoting on .NET or a similar technology on Java (RMI?) that has the capability to communicate via XML, but can also use its own efficient wire format for the app domain boundary. I agree that you should have the capability to communicate in XML for interoperability, but do not always limit your middleware to the lowest common denominator. You might also want to re-examine your system design to minimize the amount of data passing over the SOAP application boundary. But I can't really tell what you're trying to achieve, in terms of application, processing or bandwidth. r4lv3k

  33. dom4j vs. xerces-j by jbrayton · · Score: 1

    If you do end up blaming the parser, change it! (and i don't mean using a different parsing method as most use a sax parser to generate the tree anyway) there are parsers that are 50% faster than those used as standard (xerces isn't the fastest java parser around!).

    I got enormous performance gains by switching from xerces-j to dom4j in one application. I also found its API much more straightforward.

    On the other hand, I have run into a few bugs in dom4j -- but it was simple enough to fix them and submit patches.

  34. We need more information by davidm25 · · Score: 1

    If you have 20 conections and each one is doing 20mB/s I don't think parser issues are your problem. Similarly it sounds likely that your problem is with the XML implementation (poor memory allocators) rather than xml per say. Or is it that xml is bloating your file format? I haven't had any problems with xml performance on clients running on 33 mhz 68 processors.

  35. Taking a couple of the better ideas by horza · · Score: 1

    On person mentioned using a C struct that you can whack directly into memory, and several others suggested using caching the DOM somehow. Of course these combine perfectly together. Get the sender to put a unique id or md5 in the header. If you don't have a file with that name in the cache dir then parse and dump the parsed structure to disc. If the file exists then pull file into memory and send the rest of the incoming byte stream to /dev/null. Of course caching won't help you if all your incoming XML files are unique.

    Phillip.

  36. TIRPC & XDR by Anonymous Coward · · Score: 0

    Take a look at TIRPC and the XDR RFCS. Nothing beats a streamlined binary. Patrick

  37. XML hardware accelerators by Anonymous Coward · · Score: 0

    There are quite a few XML hardware transformation engines on the market now. Has anyone spent any time looking at them? Many claim to process XSLT at wirespeed...

  38. Don't forget CS101 basics when dealing with XML by tangi · · Score: 1
    XML can be very verbose and this could be a problem, especially with parsers doing a lot of copying and langages allocating memory slowly.
    Java strings do a lot of copying, the point is to get yourself as close as possible to a zero-copy xml parser as you can.
    This is true for C++ as well. The std::string(const char *) constructor copies the string. This could be a performance bottleneck with C++ wrapper to C parsing API. I therefore use a patched version of Arabica (C++ SAX2 wrapper to Expat) relying on a ConstString class. It just rocks (used on a top 5 French search engine).

    But that's an evil optimization unless you've already designed your DTD to limit memory allocations. You wouldn't put detailled client information in every order item record when using a database, would you?

    I once attended an international conference where the speaker "proved" XML/XSLT had poor performance... with an example doing a simple lookup in an XML file. The XML data was shamely unstructured and the lookup algorithm was O(n2)!

    Design your DTD with the care you naturally take to databases, design your code to avoid multiple passes over the XML and everything should be OK. Never forget things usually go pretty well with databases only thanks to SQL optimizers: most tables and requests are badly designed.

    • Avoid redundant data: factorize them by using a relational-like XML structure, use entities for constants
    • Get rid of any data you could retrieve another way: just put ids in the XML and store detailled persistent data in a hashtable.
    • Don't misuse XML: avoid over-nested structure, prefer attribute to sub-element for singletons, use the ID/IDREF mechanism.
    With these common sense rules, you can achieve several hundreds of requests per second per cpu. Go to XML Pattenrs for further reading.