Uhm, correct me if I am wrong, but this would only work for one socket since you have only one parser instance which you feed intermingled fragments of XML from different sockets.
Also, there's a problem in your design in that the events generated by the SAX parser have no association with a session (or their originating socket), which means you'll just have a bunch of events that cannot be explicitly associated and you have to rely on the contents of the events for guessing (which is bad if you are writing a server).
At the very least you need one parser instance per connection.
The main problem is that you really have to parse everything. If your only business is to route messages (as would be the case in a server), that is bad because the majority of your work will be parsing information you do not need. With framing XML is still verbose, but you could do much more shallow parsing of the message.
For instance you could terminate parsing once you've determined the destination. Then the rest is a trivial IO. In the above example you can't, and in general you can't. You have to analyze every remaining character to determine when a message ends.
The point isn't really the difference in processing overhead: the point is making the IO subsystem as simple as possible. If you have a trivial way of determining boundaries this will lead to more elegant designs.
Compare the implementational complexity of SMTP versus XMPP for instance. (Only the protocol parts -- now mail is actually stored is a whole other debate).
In the real world, XML is a very verbose protocol, and in most cases it is trivial to store the incoming data in a less space consuming format. Using a SAX parser that is reasonably efficient, the only state you will need to keep track of is namespace declarations and open tags - that is highly unlikely to be much data, and certainly unlikely to get anywhere close to closing the gap between the maximum size of a well parsed data set and the maximum allowed size of a message. As a consequence, a well written server should REDUCE state by parsing as you go, not increase it, and only a complete moron would keep trying to parse the message over and over again from the beginning each time.
Just out of curiosity, do yu have some code which demonstrates this? Just something that shows that for instance a SAX parser in Java ends up consuming less memory than the actual XML would take to buffer for tokenizing a reasonably long stream of Jabber messages (say 100Mb) and fed in buffer-fulls of random lengths with messages spread across buffer boundaries to your XML parser instance. (To emulate network traffic).
(And not with one thread per context. I'd like to see an example of this that can be applied to multiplexing IO)
Indeed, why implement IRC or the Jabber protocol when it has been done before? (Yet people still keep doing it so from this one might conclude they have good reason to do so).
But you are sidestepping the issue. The issue is not why one would reimplement something that already exists, the issue is how the choice affects software quality. To understand how it affects software quality one has to consider what it takes to write correct software.
The idea that you need the complete message before you start doing work on it is flawed - it implies that during sudden bursts of activity, your system will sit mostly idle until complete messages have been received, and then suddenly be swamped with processing, instead of spreading the processing cost over the whole time it takes to receive a message, which could potentially be a "long time" for many clients on slow connections.
Hmm, I can't say that I've seen that happen a lot in practice. I've written a fair share of software that needs to multiplex a large number of simultaneous connections (on the order of 1.000-10.000 per server) and what is usually the killer is "useless context switches" -- either in the OS or locally in within an execution context. I've never seen your theoretical scenario crop up in anything bug degenerate cases (ie. tests).
I'm a bit curious: have you observed that this should be a problem in practical systems? If so, what did they do?
Well, IRC employs "the other efficient means for framing": trivial boundary separators.
An IRC message cannot contain a CRLF sequence because this is the sequence used to delimit messages.
I am not saying that this is a fantastic way to do things, but it is significantly easier to implement correctly. (Note that easy to implement correctly is what we are after -- not simply easy to implement).
Consider the differences in implementation complexity between a correct parser that can:
Recognize a CRLF sequence
Parse a significant subset of well
formed XML documents.
I have written a *lot* of protocol implementations for various protocols, usually with scalability in mind, abd believe me: it is significantly harder to write an XMPP stanza tokenizer usable for multiplexed IO than the equivalent for IRC.
Yes, you have to parse the content in any case, but lack of proper framing makes the IO system depend to a greater degree on understanding what it is transporting because it has to understand when it has a complete message.
The IO code will have to use the parser to figure out if it has 0, 1 or more stanzas that can be consumed.
This results in inelegant, inefficient and ultimately fragile code.
(Which is OK for most people since most people are going to be writing simple applications that only concern themselves with one or just a handful of connections, but it represents a pain in the neck if you want to create a clean and scalable design)
First off: the only interesting transport for XMPP is the one actually being widely used, so using UDP, HTTP or something else which does provide framing (albeit at the cost of their own set of new problems) is somewhat academic.
Second: just because something is possible or doable doesn't mean it is as simple as it should be. Parsing content for discovering stanza boundaries is inelegant and it is more error prone than it should be and it is more work.
The whole point of using XML in the first place is to make it easier to write software -- not to complicate matters even more.
In this case it ends up making everything more complicated.
Exactly, what you ended up doing is what I've seen a lot of other people do (for XMPP and similar protocols) and what most people suggest when presented with the problem.
This is not a problem that is likely to be of much concern in trivial or naive implementations, but it becomes a major pain in the neck if you try to juggle with more than a few connections.
The problem with the solution you had to resort to is:
It is a complex solution. The point of
using XML is to not have to write a parser,
yet writing a parser is what you end up
doing
It is a fragile solution. XMPP is already
on thin ice with regard to XML compliance
(problems shining through...), and having
multiple ways of finding message
boundaries (with different parsers and
parser instances) is really, really bad.
It is not likely that you would consider the entire XML spec (or whatever subset of it XMPP professes to use) when you write your protocol tokenizer, so the likelyhood of making a mistake is pretty high.
And of course, it is ugly.
In proper protocols you can tokenize frames simply by counting bytes. Then when you have a frame you parse it and it is either correct or wrong, but at least you know you're done with that frame.
Free is pointless if it is not good as well, and I am not convinced that from a technical point of view, Jabber is quite what everybody was/is hoping for.
Indeed, it is my observation as well that Jabber is somewhat of bastard protocol in that it uses XML, then doesn't want to be compliant with the spec and then fails to add needed mechanisms for making implementation easier.
As mentioned earlier, I looked at what is today known as XMPP a few years ago, and decided it wasn't really worth the hassle because the protocol didn't really solve the problem at hand well.
Jabber has some good design decisions; the sloppy use of XML and its failure to identify important aspects of protocol design are not among them.
I think a revised spec is called for. Jabber has not built up sufficient industrial momentum for it to have reached a point of no return and if they don't fix the issues now they will never be fixed.
A more easily implementable protocol will help adoption, it will encourage more and better implementations and it will make the threshold for inventing your own protocol higher.
I just threw a cursory glance at the XMPP specification and I still can't see any fixes for the framing problem.
I had a look at Jabber years ago, but what put me off what is now known as XMPP was that it didn't solve the problem of framing stanzas. The only way to determine the borders of a stanza, and thus when you have read enough to successfully parse it, was by parsing the content.
When you write a high-performance multiplexing server (for any protocol) you wish to minimize the state associated with each session or connection. I am not sure this is necessarily easy for Jabber. Its lack of proper framing dictates that you need to do some serious thinking about how to end up not wasting a lot of memory and CPU. Not really important if your server has ~100 clients, but when you want to accomodate millions of clients (as must be the goal for any large ISP when choosing an IM architecture), these things translate into dollars.
As someone else pointed out: BEEP solves the framing problem, as does HTTP.
How do you solve the framing problem in XMPP? How would you go about designing a multiplexing implementation that can handle, say, 1000 connections on a 800Mhz P3 without burning a lot of CPU?
(The figure was chosen because I've observed a hub IRC server handle 7-800 client connections and 4 servers on IRCNet while only consuming about 10% CPU in steady state)
this should take next to no CPU at all since it is just IO and some gui stuff. how about someone who knows the Gaim code could just do some profiling and find out where the excessive CPU usage is?
Yeah, I've noticed that too: Gaim seems to be extremely CPU-hungry. Which is odd, since it doesn't really *do* anything. Is there some busy-waiting going on somewhere?
IE has not kept up with development and all the other browsers are bloated or bloating. for some reason people have a really hard time understanding that a browser should be a browser and it doesn't matter if all the extra features don't really enlarge its footprint (which is mostly rather irrelevant) -- what matters is that it takes focus away from the work that really needs to be done.
besides, if you want extra gadgetry in your browser, Firefox has a lot of nice extensions and they are extremely easy to install(1).
--------
1) Except for the fact that the the response times from the extension download is horribly slow. Do something about it!
The biggest consideration when choosing one of these libraries is how well you can pick it up and understand it.
Indeed. Important observation.
In addition most libraries impose a way of doing things that might not fit your application. In the C/C++-world this situation often arises. For instance you might want to use two libraries together, that both need to be in charge of thread- or signal handling etc, so you end up kludging together the bits you need and it never feels quite right.
Then, of course, it is the eternal C versus C++ choice. A lot of software platforms (like apache) are in C, and thus integrating with C++ is a major pain. Not to mention the pain of absorbing libraries into other programming languages like Perl, Python, PHP or Java.
On another note, it certainly would be nice to get more of a standardized set of cross platform libraries on the scale of the Java API. There's no reason why this can't be done. Most of the pieces are already out there.
I don't think it can be done. At least not without an influx of new, bolder and smarter people than the current custodians of C++.
If you look at Java, you have a rather well-defined runtime environment, the process by which additions to the Java standard library are made is slow and deliberate. There is a *lot* of focus on doing things correctly and making sure it is efficient. Some really brilliant people work on Java, and most importantly: because they are so thorough and their stuff works before they declare it a standard part of Java you can build new things on the shoulders of others.
Now look at C++. Not even the first row of building blocks is properly into place like a proper collections framework. STL isn't really it since for non-trivial uses it quickly gets ugly and even seasoned C++ hackers make beginner-mistakes. Also the static code generation paradigm is inelegant, unflexible and, given enough abstractions stacked on top of each other, really hard to work with when something goes wrong.
Without even having a good foundation to build upon, there is little hope that C++ will ever have higher level APIs as part of the standard APIs. You won't see things like networking, concurrent processing, filesystem interfaces, proper internationalization, encryption, XML parsing as a standardpart of C++. Or even C for that matter. But you will see a lot of initiatives like APR. And glib. and ACE. And mixing and matching the parts you like will be a serious pain in the neck.
In fact, most people would say that these have no business being part of a language, but I think that Java has proven quite well that it needs to, or else, you will severely limit the development and inclusion of further useful abstractions.
It is all in the foundations and the given environment's definition of what is part of it and what is not.
Why does Linux need this? How many people have a connection which is so bad they really benefit from this?
Sure it is always nice to have faster downloads. But is it worth the extra work involved in setting this up both at the distribution point and on the client side?
How fitting: the Mono logo is a profile shot of the retarded cellmate of D.R and Quinch in "The Complete D.R. & Quinch". What was his name again? Pulger or something?
-Bjørn
Miguel's crusade to badly copy where Microsoft has gone before isn't really that productive and it has produced rather a lot of sloppy, unfinished, unpolished software that has more promise than usefulness.
I desperately want this not to be so, but it is.
Microsoft have an important ally in Miguel. It is not necessary to announce vaporware for Linux to frighten off the competition since everyone is already waiting for applications like Evolution to stop sucking so badly.
I've read about 540 pages of Quicksilver now and I have to agree that for the first 300 pages it was a pretty slow read for the most part. The parts with Newton and Waterhouse were very entertaining, but when Stephenson goes off putting things in a bigger historic perspective (or whatever he tries to do), things get a bit boring.
Almost all of book two, where Shaftoe makes an entry, is really good so far. I like Stephenson's way of telling a story. He is good at describing the dynamics of inter-personal relationships and he uses a geeky sort of language that is really funny.
When there's a story to be told, Neal Stephenson is a great writer, when not, you just want to kick him real hard. (Still he is not as bad as le'Carre, who has a nasty habit of drowning good plots in the kind of drawn out, mediocre, masturbatory adjective-slinging, twaddle that my teachers were so fond of.
Still, Quicksilver was seems worth reading now that I'm a bit over half way through, and I have already ordered "The Confusion".
I just hope that the Baroque Cycle has an ending so, like "The young lady's primer", it doesn't just come to a screeching halt like a bad B-movie run out of money.
Newsworthy
on
Skittlebrau
·
· Score: 2, Insightful
Michael, why not devote your time to maintaining
a personal blog instead of this slashdot nonsense. I would hate to think that Slashdot was taking time out of your mission to inform the masses of all the worthwhile news that is out there on the web. This is obviously much more important than the chinese putting a man into orbit.
Also, there's a problem in your design in that the events generated by the SAX parser have no association with a session (or their originating socket), which means you'll just have a bunch of events that cannot be explicitly associated and you have to rely on the contents of the events for guessing (which is bad if you are writing a server).
At the very least you need one parser instance per connection.
The main problem is that you really have to parse everything. If your only business is to route messages (as would be the case in a server), that is bad because the majority of your work will be parsing information you do not need. With framing XML is still verbose, but you could do much more shallow parsing of the message.
For instance you could terminate parsing once you've determined the destination. Then the rest is a trivial IO. In the above example you can't, and in general you can't. You have to analyze every remaining character to determine when a message ends.
Compare the implementational complexity of SMTP versus XMPP for instance. (Only the protocol parts -- now mail is actually stored is a whole other debate).
You do see the difference between developing software and using it, right?
Just out of curiosity, do yu have some code which demonstrates this? Just something that shows that for instance a SAX parser in Java ends up consuming less memory than the actual XML would take to buffer for tokenizing a reasonably long stream of Jabber messages (say 100Mb) and fed in buffer-fulls of random lengths with messages spread across buffer boundaries to your XML parser instance. (To emulate network traffic).
(And not with one thread per context. I'd like to see an example of this that can be applied to multiplexing IO)
But you are sidestepping the issue. The issue is not why one would reimplement something that already exists, the issue is how the choice affects software quality. To understand how it affects software quality one has to consider what it takes to write correct software.
Hmm, I can't say that I've seen that happen a lot in practice. I've written a fair share of software that needs to multiplex a large number of simultaneous connections (on the order of 1.000-10.000 per server) and what is usually the killer is "useless context switches" -- either in the OS or locally in within an execution context. I've never seen your theoretical scenario crop up in anything bug degenerate cases (ie. tests).
I'm a bit curious: have you observed that this should be a problem in practical systems? If so, what did they do?
An IRC message cannot contain a CRLF sequence because this is the sequence used to delimit messages.
I am not saying that this is a fantastic way to do things, but it is significantly easier to implement correctly. (Note that easy to implement correctly is what we are after -- not simply easy to implement).
Consider the differences in implementation complexity between a correct parser that can:
I have written a *lot* of protocol implementations for various protocols, usually with scalability in mind, abd believe me: it is significantly harder to write an XMPP stanza tokenizer usable for multiplexed IO than the equivalent for IRC.
The IO code will have to use the parser to figure out if it has 0, 1 or more stanzas that can be consumed.
This results in inelegant, inefficient and ultimately fragile code.
(Which is OK for most people since most people are going to be writing simple applications that only concern themselves with one or just a handful of connections, but it represents a pain in the neck if you want to create a clean and scalable design)
Second: just because something is possible or doable doesn't mean it is as simple as it should be. Parsing content for discovering stanza boundaries is inelegant and it is more error prone than it should be and it is more work.
The whole point of using XML in the first place is to make it easier to write software -- not to complicate matters even more.
In this case it ends up making everything more complicated.
This is not a problem that is likely to be of much concern in trivial or naive implementations, but it becomes a major pain in the neck if you try to juggle with more than a few connections.
The problem with the solution you had to resort to is:
It is not likely that you would consider the entire XML spec (or whatever subset of it XMPP professes to use) when you write your protocol tokenizer, so the likelyhood of making a mistake is pretty high.
And of course, it is ugly.
In proper protocols you can tokenize frames simply by counting bytes. Then when you have a frame you parse it and it is either correct or wrong, but at least you know you're done with that frame.
It has to be good as well.
Free is pointless if it is not good as well, and I am not convinced that from a technical point of view, Jabber is quite what everybody was/is hoping for.
As mentioned earlier, I looked at what is today known as XMPP a few years ago, and decided it wasn't really worth the hassle because the protocol didn't really solve the problem at hand well.
Jabber has some good design decisions; the sloppy use of XML and its failure to identify important aspects of protocol design are not among them.
I think a revised spec is called for. Jabber has not built up sufficient industrial momentum for it to have reached a point of no return and if they don't fix the issues now they will never be fixed.
A more easily implementable protocol will help adoption, it will encourage more and better implementations and it will make the threshold for inventing your own protocol higher.
Right now: I would not use XMPP.
I had a look at Jabber years ago, but what put me off what is now known as XMPP was that it didn't solve the problem of framing stanzas. The only way to determine the borders of a stanza, and thus when you have read enough to successfully parse it, was by parsing the content.
When you write a high-performance multiplexing server (for any protocol) you wish to minimize the state associated with each session or connection. I am not sure this is necessarily easy for Jabber. Its lack of proper framing dictates that you need to do some serious thinking about how to end up not wasting a lot of memory and CPU. Not really important if your server has ~100 clients, but when you want to accomodate millions of clients (as must be the goal for any large ISP when choosing an IM architecture), these things translate into dollars.
As someone else pointed out: BEEP solves the framing problem, as does HTTP.
How do you solve the framing problem in XMPP? How would you go about designing a multiplexing implementation that can handle, say, 1000 connections on a 800Mhz P3 without burning a lot of CPU?
(The figure was chosen because I've observed a hub IRC server handle 7-800 client connections and 4 servers on IRCNet while only consuming about 10% CPU in steady state)
this should take next to no CPU at all since it is just IO and some gui stuff. how about someone who knows the Gaim code could just do some profiling and find out where the excessive CPU usage is?
Yeah, I've noticed that too: Gaim seems to be extremely CPU-hungry. Which is odd, since it doesn't really *do* anything. Is there some busy-waiting going on somewhere?
besides, if you want extra gadgetry in your browser, Firefox has a lot of nice extensions and they are extremely easy to install(1).
--------
1) Except for the fact that the the response times from the extension download is horribly slow. Do something about it!
Indeed. Important observation.
In addition most libraries impose a way of doing things that might not fit your application. In the C/C++-world this situation often arises. For instance you might want to use two libraries together, that both need to be in charge of thread- or signal handling etc, so you end up kludging together the bits you need and it never feels quite right.
Then, of course, it is the eternal C versus C++ choice. A lot of software platforms (like apache) are in C, and thus integrating with C++ is a major pain. Not to mention the pain of absorbing libraries into other programming languages like Perl, Python, PHP or Java.
I don't think it can be done. At least not without an influx of new, bolder and smarter people than the current custodians of C++.
If you look at Java, you have a rather well-defined runtime environment, the process by which additions to the Java standard library are made is slow and deliberate. There is a *lot* of focus on doing things correctly and making sure it is efficient. Some really brilliant people work on Java, and most importantly: because they are so thorough and their stuff works before they declare it a standard part of Java you can build new things on the shoulders of others.
Now look at C++. Not even the first row of building blocks is properly into place like a proper collections framework. STL isn't really it since for non-trivial uses it quickly gets ugly and even seasoned C++ hackers make beginner-mistakes. Also the static code generation paradigm is inelegant, unflexible and, given enough abstractions stacked on top of each other, really hard to work with when something goes wrong.
Without even having a good foundation to build upon, there is little hope that C++ will ever have higher level APIs as part of the standard APIs. You won't see things like networking, concurrent processing, filesystem interfaces, proper internationalization, encryption, XML parsing as a standard part of C++. Or even C for that matter. But you will see a lot of initiatives like APR. And glib. and ACE. And mixing and matching the parts you like will be a serious pain in the neck.
In fact, most people would say that these have no business being part of a language, but I think that Java has proven quite well that it needs to, or else, you will severely limit the development and inclusion of further useful abstractions.
It is all in the foundations and the given environment's definition of what is part of it and what is not.
Sure it is always nice to have faster downloads. But is it worth the extra work involved in setting this up both at the distribution point and on the client side?
I am not being rethorical. I am just wondering.
How fitting: the Mono logo is a profile shot of the retarded cellmate of D.R and Quinch in "The Complete D.R. & Quinch". What was his name again? Pulger or something? -Bjørn
...yes, but this is slashdot and you are michael, so indeed the "news" on this site is getting less and less newsworthy.
I wrote something about this in my blog a while ago. I think putting your trust in the public is exactly what needs to be done. don't you?
Here's an idea that me and a friend thought a bit about last year.
Miguel's crusade to badly copy where Microsoft has gone before isn't really that productive and it has produced rather a lot of sloppy, unfinished, unpolished software that has more promise than usefulness.
I desperately want this not to be so, but it is.
Microsoft have an important ally in Miguel. It is not necessary to announce vaporware for Linux to frighten off the competition since everyone is already waiting for applications like Evolution to stop sucking so badly.
Almost all of book two, where Shaftoe makes an entry, is really good so far. I like Stephenson's way of telling a story. He is good at describing the dynamics of inter-personal relationships and he uses a geeky sort of language that is really funny.
When there's a story to be told, Neal Stephenson is a great writer, when not, you just want to kick him real hard. (Still he is not as bad as le'Carre, who has a nasty habit of drowning good plots in the kind of drawn out, mediocre, masturbatory adjective-slinging, twaddle that my teachers were so fond of.
Still, Quicksilver was seems worth reading now that I'm a bit over half way through, and I have already ordered "The Confusion".
I just hope that the Baroque Cycle has an ending so, like "The young lady's primer", it doesn't just come to a screeching halt like a bad B-movie run out of money.
Michael, why not devote your time to maintaining a personal blog instead of this slashdot nonsense. I would hate to think that Slashdot was taking time out of your mission to inform the masses of all the worthwhile news that is out there on the web. This is obviously much more important than the chinese putting a man into orbit.