Does the World Need Binary XML?
sebFlyte writes "One of XML's founders says 'If I were world dictator, I'd put a kibosh on binary XML' in this interesting look at what can be done to make XML better, faster and stronger."
← Back to Stories (view on slashdot.org)
For starters, keep Microsoft out of it.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Then what happens, do you base64 the binary xml and wrap it in an ascii xml document?
"Piter, too, is dead."
This will kill XML
Use the Z-modem protocol between Information Superhighway routers to compress the plaintext.
Binary XML = zip file.xml > file.xml.zip
Thats all you need. XML compresses great.
On the face of it, compressing XML documents by using a different file format may seem like a reasonable way to address sluggish performance. But the very idea has many people -- including an XML pioneer within Sun -- worried that incompatible versions of XML will result.
I agree with his point.
What's wrong with just compressing the XML as it is with an open and easy-to-implement algorithm like gzip or bzip2?
I don't need no instructions to know how to rock!!!!
But make it a open source one...
I guess this is another itch to scratch by the community...
Check out CWXML/BXML. Especially significant though perhaps unintuitive is the savings in compression time from the source data being more compact.
looks like the developer in question is a little too close to his prize development. speeding up xml by removing all the bloat, however that would be accomplished, be it compiling xml into some sort of byte code or whatnot, seems like a much better idea from the client and server point of view. why transfer 100kb of text data when you can send 10kb of binary data for the same message?
Now we can have competing formats of Binary XML. Fuck that human readability bullshit, what we need is to make it so that Apple's Binary XML implementation differs from SUN's Implementation and nothing works with Microsoft's, not even their own files!
Binary XML is nothing new, as I wager that many people here are already using it, albeit unknowingly.
One of the earliest projects that has tried to make a binary XML (as far as I'm aware) was the EBML (Extensible Binary Meta-Language) which is used in the Matroska media container.
...but web servers and browsers can use gzip to reduce the size of the HTML going back and forth, why not have something similar where a web service gzips the XML and the consumer decompresses it?
FTFA "The goal of the Fast Infoset project is to generate interest among developers and eventually create a standardized binary format."
I'm not sure why they think that one has to come before the other.
Frankly, make it a standard so I can write proper code to handle it, and you'll have me (joe random developer) interested.
Binary XML would destroy what makes xmal powerful: being able to use vi or emacs to understand its content, no fuss, no adobe reader like software, no nothing.
Somebody fill me in ...
... its called zipping, most webservers have it as an option to zip the data up as it streams to the client browser
i fail to see the need to have a "binary xml" file format when there are already facilities in place to compress text streams
What would Homer Simpson do if he found out about this news in Springfield? Be creative! Best answer gets 2+ mod points. Good Luck!!!!
For all those going to say this? Read this.
Is binary xml not just a stupid idea and clashing with ASN.1.
ASN.1 is already a standard, used heavilly in the smartcard/GSM sim industry.
BER encoded ASN.1 data is just this - a tree structure of values w/ external definitions of data types and structures...
0 0. 01.zip
http://www.insidiae.org/~mike/code/asn1dec1-00.
Programs written in assembly can run faster than programs written in C, but it's easier for someone to open a .c file and figure out what's going on.
I'm sure when C came out, the argument was similar that the performance hit doesn't make up for the readability or cross compatibility. But as computers and network connections became faster, C becomes a more viable alternative.
Text compresses quite well, especially redundant text like the tags. So why not just leave XML alone and compress it at the transportation level with protocols like sending it as a zip, let v.92 modems do it automatically, or whatever. No need to touch XML itself at all.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Would you want to store a .bmp as a series of words like pixel(253,8764) = Black?
Somethings are better left in binary for and if XML is going to be used for data transportation between programs then it needs support binary data.
welcome our binary XML overlords.
But secondly, no, you don't need Binary XML, all you need to do is Gzip it on the wire. It gets as small as Binary XML.
One of the easiest ways to shrink your XML by about 90% is use tags like:instead ofYou can use a transformation to use the short names or long names on the wire.
XML, as implemented today, is often little more than a thin wrapper for huge gobs of proprietary-format data. Thus, any given XML parser can identify the contents as "a huge gob of proprietary data", but can't do a damned thing with it.
Too many developers have "embraced" XML by simply dumping their data into a handful of CDATA blocks. Other programmers don't want to reveal their data structure, and abuse CDATA in the same way. Thus, a perfectly good data format has been bastardized by legions of lazy/overprotective coders.
The slew publications exist for the sole purpose of "clarifying" XML serves as testament to the abuse of XML.
Obliteracy: Words with explosions
Why not simply zip it ?
...
As far as I know, there are programs/library for that format on every platform
Intelligence shared is intelligence squared.
DIME attackments.
You are not a beautiful or unique snowflake -- but you could be if you got off your ass.
Don't we already have ASN.1?
A huff transform will give you entropy +1 compression. Not suitable for larger data sets (dictionary based compression is even better for this). 7z compression (or is it z7?) will give you a neat storage format.
u itcake
Lets talk about where this verbose talk of verbosity is stemming from:
apple
orange
pineapple
this is a data set. Noone knows what it is.
Here it is again with some pseudo xml style tags
I am listing vegetables here
this is a list of vegetables
vegetables are listed on thier own without any children pr parent tags, there can be one or more of them, this is version 1 of the document
here now follows a vegetable
tomato
that was a vegetable
here now follows a vegetable
leek
that was a vegetable
here now follows a vegetable
potato
that was a vegetable
here now follows a vegetable
haddock
that was a vegetable
as you can see, this is (albeit slightly weird looking) list of items called 'vegetables'.
The beauty of XML is two fold, the description of the document format (DTD and schemas) and the abilty to verify a document is valid, for any specified format.
XML is a human readable file specification language, and file format, all in one, written in itself!
A binary format of XML would be nice, you can make it yourself though.
veg:http://slashdot.org/veg.xml
v:tomato
v:fr
v:lemongrass
v:cat
this is a minimal way to represent the same xml like structure, in a less verbose way.
This is undeniable complexity, a binary format is just like a way of saying introduce a standard loosless compression format for XML, without changing what XML is.
I say anything that gets the W3C stamp of 'this is official' gets my vote. After all, 1 bad standard is better than 11 good proprietary solutions in a world of millions of interconnected systems.
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
Given XML's predictable syntax and well-formed requirement it should be relatively easy to create a compression scheme taking advantage of XML then combining that with something like gz or bz2, rather than just compressing XML with gz or bz2. It would be like the difference between compressing a wav file with ZIP and with FLAC. Though with XML the difference would likely only be significant with very large files.
Of course anything like this should be endorsed by W3C before being put into wide use.
1) Isn't the greatest benefit of XML that it can be opened in a text editor, and made sense of?
2) Can't webservers and browsers (well, maybe not IE, but then it's not a browser... it's an OS component, haha) transparently compress XML with gzip or some other?
3) Making it binary won't compress it all that much, using a proper compression algo will.
4) Doesn't something like XML, that makes use of latin characters and a few punctuation marks, compress with insane ratios even in lame compression algo's?
5) In a world moving ever closer to ubiquitous broadband, is a difference between a 10kb html file and a 17kb XML file all that fatal? Surely bittorrent and spam does more to suck up all available bandwidth than XML does (what little is out there).
I've had to work with binary XML for formatting WAP push messages and it is the ghastliest thing ever. Yes, I can see that it has low-bandwidth applications but my opinion is that I'd much rather have less bandwidth than have to deal with binary XML :-)
I would suggest that people seeking fast, standard ways to deliver binary data look at SMPTE KLV (key, length, value) coding. It is SMPTE 336M, and is the standard for metadata coding in television, video, and digital cinema.
I totally drank the XML kool-aid, so don't interpret this as saying that I hate XML or anything. I really love it. However, you don't really get an appreciation of just how slow and bloaty XML is until you see it used in real life a few times. I sometimes wonder if these guys have ever built a system on something that wasn't a top-notch research bed.
I'm not seeing in the article where he submits a solution to the problem, he just said as computers and networks get faster, the bloat won't be slow anymore. There's a very good chance I'll be using the same infrastructure in 3 years, so that is a non-solution for me, and I suspect many other people too.
It's pretty clear to me he's out of touch. Everyone is clamoring for problems they have right now, and he wants everyone to wait for universal gigabit ethernet and 10Ghz CPUs.
The XML guys are funny. First make a text version of binary protocols to make it easy to sell XML them to the mass of "31137 HTML PRogrammers" who feel comfortable "programming" in dreamweaver; and then make a binary version to make it work.
When the XML is in text you still need to parse it. Sounds like an easy job if you're just doing it on your home computer. But a server handling thousands of simultaneous transactions can get bogged down parsing text down to binary when it can just get sent in binary to begin with.
MUCH faster. And you don't have the overhead of compression. Sure, gzip/bzip2 will cut down on network overhead, but what about processor overhead?
Roy Fielding, who is developing the Waka protocol, which is binary, argued at ApacheCon 2000 that as long as the protocol is still understood, binary utilities could be made to decode things for debugging. But the 99.9% of other requests would be more important and benifit more from being in binary.
XML transfer protocol.
Ok, we got a name. Now all we need is one fart smella to design it.
Research shows that 67% of those who use the term "research shows", are just making shit up.
Of course not! That's not XML!
<file=xmlbinary> <baseencoding=64> <byte bits=8> <bit1>0 </bit><bit2>1 </bit><bit3>1 </bit><bit4>0 </bit><bit5>1 </bit><bit6>0 </bit><bit7>0 </bit><bit8>1 </bit> </byte>
<boredcomment>(Umm, I'm gonna skip a bit if y'all don't mind)</boredcomment>
</baseencoding> </file>
Now it's XML!
What the world needs now, it binary XML?
Nope, sorry, those lyrics suck. We're gonna stick with Mr. Bacharach's version.
Another improvement the lisp guys noticed decades ago is instead of redundantly putting the name of the tag in the closing tag, you don't need it.
<Name><FirstName>John</FirstName><LastName> Doe</LastName></Name>
vs
<Name><FirstName>Jo hn</><LastName>Doe</></>
or better
(Name (FirstName John) (LastName Doe))
If you gzip the stream, you save bandwidth, but gunzip on the receiver makes the problem worse. However, bandwidth is usually not a concern within clusters. You want to something with the data you received, right? This takes CPU cycles as well.
What we need is a combination of XML and binary, fixed data streams.
So they instead of JPEGs they use something like this?WTF!?
That's what you get when somebody forgets to choose "BIN" in their FTP client and dumps a bunch of XML to a directory, right?
However, if anything, XML has shown us the power of well-structured information. XML has given the possibility of universal interoperability. Developments in XML-based technologies have led us to the point where we know enough now to create a standard for structured information that will last for several decades.
It's time that we had a new ASCII. That standard should be binary XML.
When I think of the time that has been wasted by every developer in the history of Computer Science, writing and rewriting basic parsing code, I shudder. Binary XML would produce a standard such that an efficient, universal data structure language would allow significant advances in what is technically possible with our data. For example: why is what we put on disk any different from what's in memory? Binary XML could erase this distinction.
A binary XML standard needs to become ubiquitous, so that just as Notepad can open any ASCII file today, SuperNotepad could open any file in existance, or look at any portion of your computer's memory, in an informative, structured manner. What's more, we have the technology to do this now.
them most assembly programers can right.
1010
There are better ways to compress XML.
A little understanding about what a particular XML file is supposed to represent can go a long way.
While I do like Bzip2 and Gzip better, zip is open. There are numerous open source compression/decompression libraries for it.
This is all about different companies trying to get THEIR binary format to be the "standard" with XML.
From the article Images are already binary data. They really don't compress much more (if you've chosen the right format). That means that they will take the same amount of time to download, binary XML format or not.
Yeah, right ! XML binary images... So needed...
of "I told you so!" coming over. Between all the people who jumped on the web services bandwagon without any clue how to handle distributed systems efficiently and the "OMG! It's human readable!" crowd, the architecture de jour has become a bloated PITA. Why this wasn't built into the spec in the first place alludes me. If we can use tools like ethereal to read those binary IP datagrams, why wouldn't the same concept be used for this standard? A standardized, compressed, data format with a standardized API for outputting plaintext (XML), would have allowed this system to be much more efficient.
Didn't anyone remember that text processing was bulky and expensive? Sometimes the tech community seems to share the same uncritical mind as people who order get-rich-quick schemes off late night infomercials. I doubt XML would have gotten out of the gate as is, had the community demanded these kinds of features from the get-go.
Arrogance is Confidence which lacks integrity. -- me
just gzip, and proceed as before. it would require only minimal changes in the work case and none at all in the best case. isn't this how OpenOffice works?
Your CPU is not doing anything else, at least do something.
http://news.com.com/5208-7345-0.html?forumID=1&thr eadID=4163&messageID=23888&start=-1
The design goals for XML are:
1. XML shall be straightforwardly usable over the Internet.
Grade: A
2. XML shall support a wide variety of applications.
Grade: B
3. XML shall be compatible with SGML.
Grade: don't know / don't care
4. It shall be easy to write programs which process XML documents.
Grade: F
5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
Grade: F
6. XML documents should be human-legible and reasonably clear.
Grade: F
7. The XML design should be prepared quickly.
Grade: F
8. The design of XML shall be formal and concise.
Grade: C
9. XML documents shall be easy to create.
Grade: C
10. Terseness in XML markup is of minimal importance.
Grade: A+
SOAP is an excellent technology but it's SLOW. Servers get bogged down doing string processing, and when you are handling thousands of requests per minute it's a big problem. Adding a gzip/gunzip into the mix would make it slower still.
As it happens, most soap requests are NOT human readable. Sure i can sit and figure one out, but unless it's a trivial example, trying to decipher it isn't easy.
A standard binary xml format would allow a standard binary soap variant. Debuggers could hand bsoap->soap translation and everything would get quite a speed boost.
My argument would be that if it's not standardized then people will develop non-standards-complient implementations, which is definitely a bad thing.
Clearly these fundamental tenets have escaped the People In Charge in many places, who are now discovering that their brilliant ideas to represent images, large databases, etc, as XML were, in fact, fucking stupid ideas.
Enter binary XML. This lets the People In Charge save face by saying that the system still uses XML, so they must have been right when they designed/required it to use XML in the first place. And now it's 50% faster!
The article has an excellent point, there will be compatibility problems and we'll "degrade" to different binary XML formats - each best suited to a particular niche. That's exactly where the world was before XML came along - data formats designed for (and reasonably appropriate for) particular applications. Those formats are invariably more efficient than XML, and are often simpler and easier to parse than XML. Binary XML attempts to combine those old-fashioned file formats with XML, resulting in a system that's more bloated (and slower) than the old way but not quite as bad as XML in its current form. So now we've come full circle, except that we've added an extra layer of bloat to something that worked well enough to begin with. Congratulations, Mr. Binary XML person! You fail it!
While I'm at it: If network bandwidth is really the bottleneck, use zlib. XML's best feature is that it compresses really well.
One approach might be to provide an application layer PEP, to transcode the text into binary. Then the impoverished clients can have their binary and the rest of the world can have their text. It could be at the edge of the wireless network, or at the server.
In theory, theory and practice are the same. In practice, they're not.
step one in dealing with the speed issue is to jettison the various slow parsers like SAX - you can get competitive with native serialization and retain the text advantages of xml. see frex http://sourceforge.net/projects/javadata/
i like the comments about the binary->xml->binary full circle. reminds me of how the original ethernet evolved from a coax bus to a point to point switched network.. ether in name only.
There must be 50 posts already saying basically:
"Just compress the XML, duh!"
Compression is already in use on most servers, assuming the clients send the appropriate accept headers. The perceived sluggishness of XML is partly caused by the fact that the XML must be generated by the sender and than parsed by the receiver. Numeric values have to be converted from binary to ASCII-coded-decimal and back, strings have to be embedded and extracted, etc. I think this is the type of inefficiency that these people are trying to prevent.
That said, I think binary XML is a terrible idea. Keep using gzip on HTTP transfers and the technology will catch up shortly.
<?xml version="1.0"?>
<flamebait audience="geeks">
The people complaining about XML being slow are probably using Java anyway. They should be used to their software being sluggish.
</flamebait>
What about using YAML
Stop trying to use XML for inappropriate situations (where large data volumes, good performance are requirements).
DUH!
I think that's where the true problem lies. HTTP.
.gz files, .zip files etc. since that would be pointless).
We need to look towards http 2.0. What I would want:
- pipelining that works, so that it could be enabled for use on any server that supports http 2.0
- gzip and 7zip support.
- All data is compressed by default (a few excludes such as
- Option to initiate persistant connection (remove the stateless protocol concept), via a http header on connect. This would allow for a whole new level for web applications via SOAP/XML.
There are tons of other things that could be enhanced for today's uses.
HTTP is the problem. Not XML
"Binary XML would destroy what makes xmal powerful: being able to use vi or emacs to understand its content, no fuss, no adobe reader like software, no nothing."
Do you know of any format that doesn't require a piece of software as an intermediate between the user and the machine?
As mentioned elsewhere in this thread, it is already possible to use zip compression at transportation.
But there are reasons why XML-specific encoding has chances to be far more efficient. Consider this:
<hello></hello>
For anyone familiar with XML, it translates into:
<hello/>
The '<', '>', and '/' represent the "empty element" aspect of the XML code, and that seems like an overkill. Think of way to represent the notion of "empty element". I'm sure that if all notions of XML were listed, you wouldn't need a lot of bits to uniquely code each of them.
Already, without any statistical compression, we've saved many bytes in my example.
Other advantages of being language-specific is that, knowing the weaknesses of the language, the binary format can make a smart use of redondancy. (Such as: I'd rather lose comments than useful code -- may the comments be coded in the binary-XML)
It's a markup language, it's not supposed to be ideal for general purpose data transfer.
People should stop trying to optimize it for a task it wasn't designed for. Focus on making XML better for markup, and for pity's sake come up with something else that's concise and simple and efficient for general purpose use.
Whence? Hence. Whither? Thither.
So I suppose to call this 'FXML' ?
[self dealloc];
Binary XML would be provide a much better transport for binary data when compared to Base64 or even something like QuotedPrintable. Using extra layers on top of XML just to transport binary data is a waste of resources. What we need are fewer but more powerful standards. Binary XML will do JUST that.
Q: Does the World Need Binary XML?
A: 0
One man's Funny is another man's Offtopic.
form Re: Lisp syntax, what about resynchronization?
Attributes in XML are inherited from SGML and they were thingking markup for textual documents. When you want to represent data it being attribute or not is completely irrelevant.
Deep explanation: From:The horror that is XML
Dyslexics have more fnu.
don't screw up XML because people architect their applications poorly. i've worked on a few applications that use web services only because they *can* not because they should, then people complain about performance, even though we said "using web services will give you a 40% performance hit".
Didn't we just get done talking about the problem with assuming these things will clear up with faster tech? I was surprised to read this from Bray.
Is something burning?
Oh, it's my karma.
Aside from the mistakes pointed out by others, you also forgot to reference the xmlbinary namespace, the xmlbyte namespace, and the xmlboredcommentinparentheses namespace, and to qualify all attributes accordingly. You also didn't include anything in or any magic words like CDATA, and you didn't define any entities. You also failed to supply a DTD and an XSL schema.
This is therefore still not _true_ XML. It simply doesn't have enough inefficiency. Please add crap to it
Whence? Hence. Whither? Thither.
So, does an XML SOAP message encrypted using WSSEC constitute binary XML? If the answer is "Yes", then how would a world w/out binary XML enable encryption? If the answer is "no", then what constitutes binary XML? What about XML wrapped in SSL?
Opaque doesn't always mean proprietary.
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
I thought one of the major points of XML was to keep it ASCII so that it is platform neutral? If you go to binary, then you have to perform byte reversal of the binary XML message if different types of CPU's are involved. We would be right back where we were.
I know the other args: common format, blah, blah. However XML, as Microsoft has proved, can be made very proprietary in the blink of an eye.
I thought about this when XML was first gathering steam. Like everything else it's all about marketing and not about thinking.
XML has some things going for it -- as a markup language for primarily text data (eg web pages) it works fairly well.
/> for example) and end up using a lot of memory and doing a lot of buffer expanding, or complex buffer-chaining stuff.
At a high level, XML is *CONCEPTUALLY* a great idea. I like the DOM programming model -- it is very expressive, and yet even complicated data tends to be understandable when represented as a DOM tree. Unfortunately, the basic text-XML representation that everybody uses is a terrible wire format from an efficiency and ease-of-programming perspective.
The real problem with XML is the massive inefficiency at the lower levels. XML is easily 2x-5x less efficient than comparable wire formats. For example I once worked on a project for an Instant Messaging server which used XML to communicate. I abstracted out the very lowest-level protocol layers so that they used simple XML token-compression and attribute-name compression....the result was a fully 400% increase in throughput through the server! This is primarily because the processor has so much less data to process (less string comparasins, string copies, etc) and therefore the memory bandwidth requirements are significantly lower.
Complexity of parsing is an issue as well. Writing a complete XML parser is full of subtleties and surprisingly difficult -- don't jump in and say otherwise unless you've actually done it. If you don't believe me, go look at how complicated something like the expat source or the dom4j source is.
A primary XML design goal (go read the XML designer's notes) was for ease-of-human-reading: this comes at the expense of efficient machine reading. Because data is not length-prefixed, and of arbitrary length, there are massive inefficiencies in buffering which leads to a lot of copying as you parse. There are never any "hints" in the protocol about what is coming up: and so parsers are forced to buffer things for an arbitrary amount of time (looking for that closing
Additionally, a text-based representation like XML is extremely inefficient for binary data. Having to parse through all of your data and escape/unescape special characters is yet another big performance hit.
A standardized and fully-supported binary XML representation would have a huge impact on the performance of things we use every day -- and it could all happen at the low levels without even touching app-level code.
Here's why:
1. As noted in the article, there are other ways of solving the problem:
a. XML parsing by ASICs in dedicated XML processing hardware.
b. Moore's Law.
2. XML is successful specifically because it's text based and a standard. Just as compiled languages are slower than assembly, and managed code is slower than compiled code, the benefits of text based information is worth the cost.
3. I'm not sure the problem even exists. I've spent the last 3 years specializing in SOAP Web Services, and you know what? None of my (very big) clients actually has a problem with too much XML on the network. They just anticipate having this problem in the future; see point 1.
4. This one's a stretch, and I'm not sure I'm comfortable with it yet, but... If a system is self-contained, even if distributed, then I don't see the value in using XML for communicating between processes. You might as well use the native RPC mechanism, such as RMI for Java apps. If a system is not self-contained, then XML should be used for just the interfaces exposed to the outside world. Internal communication should remain native. In other words, a lot of XML on the network is completely unnecessary.
If a binary XML file is semantically equivalent to its text counterpart and you have good tools to convert between the two, binary XML would be much like lossless (possibly minus beautification) compression for XML. Yet, if done right, it could speed up how long it takes to process XML files instead of slow it down as something like gzip would do.
The real problem with XML is that it adds the extra verbosity of the metadata text tag for EACH INSTANCE of a pice of data even in cases where that metadata is identical for row after row of data. In the case of table data, that is really stupid. There should be some sort of XML means to handle a table of values better. A way to say "Column 1 has the following XML properties: name, etc", then "Column 2 has the following XML properties: name, etc".... and then after that section, a way to syntactically list just the values up until the end of the loop.
This is what made us balk at using XML for storing NMR spectroscopy data, even though it is already in a textual form to begin with. The current textual form is whitespace-separated, little short numbers less than 5 digits long, for hundreds of thousands of rows. That isn't really that big in ascii form. But turn it into XML, and a 1 meg ascii file turns into a 150 meg XML file because of the extra repetative tag stuff.
In another bit of irony, we can't find an in-memory representation of the data as a table which is more compact than the ascii file is. The original ascii file is even more compact than a 2-D array in RAM. (because it takes 4 bytes to store an int even when that int is typically just one digit and is only larger on rare occasions.)
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
I think that Binary XML misses the mark on bringing any real benefits other than transmission compression. XML can be a huge benefit from a human and coding perspective, but it also has drawbacks in transmission (due to size) and in processing (again due to size). A lot of XML data goes thru many different processing systems that never need descriptive tags and the overhead of document size can bog down some very large computers.
I know, try to do an XSLT on a 60 meg file.
One approach that could potentially benefit everyone is to have interchangable namespaces. By that I mean have a human readable namespace that also had a machine friendly name space.
In the Human version you could have those wonderful long tags like [FirstNameOfMyGrandmothersThirdCousin] and have a transform that would make that [ID1001] for maching processing.
You can save a ton of space by swapping out all of the Elment and Attribute names, holding structure, allow for machines to more efficiently process, and then if a human or UI needs descriptive information, you could go grab the friendly Namespace and be back to your large XML file.
The problem is that not everything in a typical XML message is text, so there can be a lot of translation going on between XML text and the binary format that an application needs (e.g., double). In our tests we've found XML to be 100x - 250x SLOWER than other approaches (e.g., JMS MapMessage). (FWIW, the 100x is using the MS parser, the 250x is with Xerces/Xalan). For high-volume, high-performance apps that's simply intolerable. Note that this has nothing to do with size on the wire, which is another consideration entirely.
Binary XML is problematic in that there will be competing standards and in the end, the end user/client will need mutiple XML decompressors in order to read the various XML formats that come down the pipe.
I think what needs to happen is the XML (or html or any data for that matter) needs to be compressed as part of the TCP standard. (I can't believe this isn't happening already). XML as viewed by the server and client is uncompressed (and can be edited by any text editor). XML as viewed by the Internet is tightly compressed.
Does the world need XML when there are s-expressions?
See www.json.org I use this everywhere. (Althoug I use a silghtly dirrerent version 'ron' ruby object notation.
It doesn't tell us what the specific performance problems are with XML. Does it take too long to transmit? Does it take too long to validate? Does it take too long to parse? Does it take too long to format? What's the real problem here?
From experience, I can state that using XML in any high performance situation is easy to screw up. But once you get past the basic mistakes at that level, what other inherent problems are there?
Oh, and just stating "well, the format is obviously wasteful" just because it's human readable (one of its primary, most useful, features) is NOT an answer.
I get the feeling that this perception of XML is being perpetuated by vendors who do not really want to open up their data formats. Allowing them to successfully propagate this impression would be a very real step backwards for all IT professionals.
Please mod this post only if you think others should/n't read this. I have enough ego^H^H^Hkarma. Thanks!
The most relevant thing about xml is that it is a standard for representing structured information. A problem with the current representation is that it requires a relatively large amount of cpu horse power and bandwidth to process and transport xml. This is the price we pay for something we do not really need in cases where xml is used for communication between two programs (i.e. most cases).
Both problems are easy to solve. Ebxml for example is a binary xml industry standard that is used in the mobile domain. It's simply a more efficient way of storing the same information. It uses small binary tokens for tags and attributenames and doesn't include comments. This makes the parsing process simpler and faster. In addition ebxml wastes less bytes, especially when combined with compression. Parsers simply parse the data to the usual internal datastructures with the usual apis (dom, sax, etc) so at the application level there is no difference.
The main problem with ebxml is non technical: it is not a widely adopted standard. What is needed is a w3c endorsed standard similar to ebxml with support in the form of dom & sax parsers and conversion tools for all major platforms and a mime type like binary/xml.
Then stuff like soap, rss etc. can be served up in binary form to applications that can handle binary/xml and text/xml to other applications. The performance gains for soap heavy applications are probably considerable.
The data itself does not change, it's just being transmitted more efficiently so there are no consequences for applications that use the dom/sax apis or higher level apis based on those. Also there's no reason for stuff like xsl processors, schema validators etc to stop working just because the xml data is handed to them as binary/xml instead of text/xml. Of course all applications that spit out hand coded xml need to feed their output through some conversion filter (if binary/xml is required).
IMHO it would also be a great way to serve up XHTML to browsers.
Jilles
I've been working on a free RTS game (0 A.D.) that uses XML for storing most of its data, with Xerces to load those files. It seemed a little slow, so I made a simple binary XML format which eliminates the parsing step and just loads the data directly into memory.
Loading a simple XML file (a couple of hundred bytes of data, plus another few hundred for a DTD), Xerces took about 10ms. The binary format took about 1ms. For a larger file, Xerces took 160ms while Xeromyces (the binary version) took 80ms (of which most was spent in other bits of code, handling the data that it read).
When there are hundreds or thousands of such files that need to be loaded before the game can start, speed is critical, and XML by itself is just too slow. Our implementation of the binary format retains the advantages of XML (such as... erm... I'm sure there must be some), since you can always just edit the original XML, and it upates the cached binary version whenever it needs to; but it greatly reduces the performance problems that are inherent in parsing a text file. So, if you're loading lots of fairly static data files, binary XML is definitely a worthwhile thing to implement.
Dude! Wake up! How often do you open an XML-RPC packet trace with your morning coffee and think 'Gosh, how cool it's in a readble bloated text format and I don't need to parse it with Ethereal !'
:-/
Seriously, the only time readability is needed is when you edit an XML web page with a notepad. Otherwise it's a brain-dead technology that first got popular among scripting developers, which are notoriously afraid of anything binary, and then it got pushed into the areas where it didn't belong.
Unfortunately, the majority of XML zealots are plain ignorant. Should they took time to learn what the byte ordering and TLV encoding mean, we would've not probably have this XML craze now
Don't get me wrong, XML has its place. But it is next to HTML, and not next to RPC or databases!
3.243F6A8885A308D313
Yes, obviously binary == proprietary, sort of. But that isn't the problem with binary xml. A binary data stream of any sort has one glaring usability problem, and that is it is not transparent, but rather opaque I guess. What if I have data lying around from one of your apps that I have to recover? I could use your app, a heavyweight tool to find the data I want, or if I don't have it I might have to reverse engineer it. If you'd used a text format from the start I could have use my favorite editor or grep.
:)
Why would you say that text is a slopy way to "define something"? I find that statement rediculous. Especially when you say that it is only ok for memory hogs. If you ask me, the transparency of a text stream far outways any cost in performance. But the truth is, there isn't much anyway. If you can develop a good parser (not that hard), the cost difference is negligable, if any. Now this isn't true for all cases. For example, it would probably be silly to use a text format for a large, high-traffic database. like postgres or mysql. But for most anything else, there isn't any reason to use binary data formats, unless you want to keep something from your users, or at least most of them
Just get a faster computer.
So is BZIP2:
ASN.1
Remember that text is human readable only because you have a text editor. Any binary format can be human readable if you have a definition file and an editor that can show the data using it. If such a tool is made available on all platforms, all binary formats will become as easy to read and edit in raw form as text is.
They are not easy-to-process chunks.
For any given XML file it isn't clear that you can process it easily in a small amount of memory. It is unclear it can be processed quickly (easily) at all.
XML is a very poor format. You are forced to read large chunks of it character by character.
Binary XML, if done correctly might help with that. But really, for any kind of speed, you need a file format with well-defined chunk lengths so you can read in entire areas of the file at once, not read a character at a time looking for CRs and close tags.
<answer type="emphatic">No</answer>
I suggest IFF. I've been working on an open spec for 3d content based on XML as a hobby project since VRML dropped the ball, and at the same time working on a parallel IFF chunk hierarchy based spec.
This lets me have text XML when I need human readable, and lets me have quickly parsed binary data when I need that - best of both worlds, and trivial to translate between. From experimenting with all the ways to do this, I've found XML/IFF chunking to be a clean map...
An older version of the spec is at http://www.vscape.com/vml/index.html, if anyone wants an example of how this could work.
Three ideas, in order of increasing significance and increasing difficulty:
Stop using bad DTDs. There seems to be a DTD style in which you avoid using attributes and instead add a whole lot of tags containing text. Any element with a content type of CDATA should be an attribute on its parent, which improves the readability of documents and lets you use ID/IDREF to automatically check stuff. Once you get rid of the complete cruft, it's not nearly so bad.
Now that everything other than HTML is generally valid XML, it's possible to get rid of a lot of the verbosity of XML, too. A new XML could make all close tags "</", since the name of the element you're closing is predetermined and there's nothing permitted after a slash other than a >. The > could be dropped from empty tags, too. If you know that your DTD will be available and not change during the life of the document, you could use numeric references in open tags to refer to the indexed child element type of the type of the element you're in, and numeric references for the indexed attribute of the element it's on. If you then drop the spaces after close quotes, you've basically removed all of the superfluous size of XML without using a binary format, as well as making string comparisons unnecessary in the parser.
Of course, you could document it as if it were binary. An open tag is indicated with an 0x3C, followed by the index of the element type plus 0x30 (for indices under 0xA). A close tag is (big-endian) 0x3C2F. A non-close tag is an open tag if it ends with an 0x3E and an empty tag if it ends with an 0x2F. Attribute indices are followed with an 0x3D. And so forth.
How about compressing xml with zip?
I use XML quite extensively and though I love it for my purposes, I do have to admit it slows down things. Some time ago, this became very apparent when I was making a XSLT stylesheet which included about 8 other XSLT sheets which were sent to the user's browser which converted my custom XML schema to some decent XHTML and XUL code. It become _slow_, to say the least.
So, if you ask me, I'm all for a binary XML *standard* which is then supported by browsers, life, the universe and everything. One thing I do ask them is to make it possible to easily convert text XML files to binary XML files and vice versa (thus not losing the original tag names and such). As long as they obey to that, I'm all for it!
Lets try using intelligent compression ... just a thought, but why not use a dictionary compression system for compressing tags as they occur in the output so that for transmissions with less than 255 tags, there would be single-byte tagging in the document.
Building such a decompression scheme into a SAX parser seems mind-numbingly simple as well, and even faster if the parser were run-time optimizable.
Actual content (between tags) could be compressed using any system of course, with a proper marker at the beginning to specify which method was used (or none).
- Michael T. Babcock (Yes, I blog)
What really makes sense, IMO, is something I wrote a couple articles about several years back. Namely, rather than define a custom binary format that every tool needs to understand, simply perform a reversible transformation on XML (i.e. compression) for the storage and transmission steps. So the XML writing application and the XML reading application have no need to know anything about the compression. That's all pipelined, in a way invisible to the ends.
/ x-matters13.html
/ x-matters19/
Of course, standard tools like 'gzip' can do exactly this. But it's also possible to take advantage of the inherent structure and redundancy of XML to get far better compression ratios. But the concept isn't really much different. See these (and also the longer articles for Intel that they link to, but the basics are in the below):
http://www-106.ibm.com/developerworks/xml/library
http://www-106.ibm.com/developerworks/xml/library
Actually, you can find better formatted versions of the Intel versions at:
http://gnosis.cx/publish/tech_index_ids.html
In any case, the concept is the same... (losslessly) futzing with structure can help out standard compression like gzip. But there's no need to build binary or different semantics into the heart of XML.
Buy Text Processing in Python
Well, yeah. My point was that the original poster was wrong to thing that ZIP wasn't.
"If I were world dictator, I'd put a kibosh on binary XML"
a 'kibosh'? Is that like a death sentence? A reward? a what?
What language is he talking?
In the free world the media isn't government run; the government is media run.
In the end its all about the application, if you're using XML to describe an entire website for example, then you can compress it in whatever way you want (remember, no matter how long tag names are or how much they are repeated, a good compression system will see this redundancy) and if its done right, you can even process it while its compressed! (im looking at you Huffman!) Yes I did RTFA, the point is XML isn't about this layer, its about the overall way of storing data in its most natural form, which isn't going to be the smallest way. XML is supposed to be big and wasteful of memory, its like maths, it doesn't care about the logistics.
Obviously people using it do care about the logistics and there are going to be cases where you don't know in advance if something can handle a particular compression or binary format, hence you need a way to tell the other system what you are trying to send or what you can send: eg.. an XML exchange format (in _raw_ XML) which basically says 'this stream of bytes is an XML file compressed in gz format' etc.. and a way for two machines to negotiate - ie the first says 'i understand these formats' and the second sends it in the best understood format/compression scheme. Thats almost certainly been done already, in fact I know SVG browsers are ment to be able to accept gzipped SVG for example.
Actually technically any given XML stream is already in a binary format technically, you have to know how each character is stored before you can read it...
This comment does not represent the views or opinions of the user.
If you ask me, the transparency of a text stream far outways any cost in performance.
It far outweighs it huh? I guess you have never heard of a large segment of the computing world refered to as embeded systems.
If you can develop a good parser (not that hard), the cost difference is negligable, if any.
This is simply untrue, development of a good parser is easy, but it's added bloat that isn't negligable for many computing devices outside of the PC/Server realm. Not to mention the added network traffic that uncompressed text yeilds (embeded devices don't always have the fastest I/O). Some say that the solution to reducing the network overhead of XML is compression. Compression takes CPU power, another thing lacking in may embeded devices.
My point is that there are actually a lot of applications where XML is just not well suited.
ZapThink, a research firm specializing in XML and Web services, echoed concerns over binary XML, notably the possibility of proprietary implementations. ZapThink analysts also noted that an XML message can touch several different pieces of software and hardware, such as security systems, all of which would support any binary XML standard.
I think this is totally unfounded for two reasons:
1. Proprietary binary versions of XML will be created anyway if needed, you really can't get around that.
2. The need for binary versions of xml is in the need for faster transmission. On the receiving end you could translate to text format and then pass this text version to your other applications, so no you would not have to have binary XML support in every application that supports XML.
But this brings up a valid point, we already have compression formats that we can use for transmission over pretty much any format, do we need to incorporate binary transmission of data directly into web services? Or should those that are in need of better performance simply wrap up their large datasets inside XML payloads and use the current format?
Not to start a flame war or something, but when I was looking into SOAP and XML-RPC, I came across this newsgroup post by Michi Henning (co-author of Advanced CORBA Programming in C++) that makes me really, really think about using XML as an RPC mechanism.
I like using XML and all, but reverting back to a "binary" XML format for RPC is like going back to CORBA and COM. It just does not make sense! XML has it's uses and I really do not think RPC is one of them, IMHO.
Coderz 4 Life
Ehh...... are they encoding images in xml? Is the reporter just typically wrong in the techno babble translation, or are people really doing that?
WWJD? JWRTFA!
The main problem with XML is:
* verbosity. Those tags take up space, and for small amounts of data the tag volume is larger than the actual data. The verbosity also causes problems on smaller devices with less available memory and bandwidth.
* parsing. String parsing is expensive compared to binary parsing. It's easier to parse through a TIFF file than it is to parse through a small XML document.
The human-readable aspect is nice, but with a good editor you don't need human-readable tags. You need well-defined tags.
For well-defined DTDs why use text at all? Substitute binary for the tags, and provide a binary->text mapping. Suddenly editors will appear that automatically display text tags, but save as binary tags.
Human readability is nice, but as someone else has asked, how often do you really read XML? When I sniff packets, my sniffer decodes everything for me. I could decode the packet headers myself, but why*? That tedious stuff is what software is for.
BinaryXML as an alternate representation of XML would be welcome. It'd complicate matters for existing parsers, though.
You could also unofficially do it by sticking a textXML->binaryXML translator on the end of both of your pipe. That would take care of the small device problem, sort of.
* note: I tend to end up decoding the packet payload anyway, but that's because I'm too lazy to write a plugin to decode it for me.
From a communications standpoint, this seems to me to be shortsighted. Bandwidth is getting cheaper even faster than storage, and storage is getting cheaper faster than processing. Compression is a solution to an old problem; one that is rapidly going away.
The real problems (IMHO) are the lack of fine grained security, and the hierarchical (tree) structure that is usually imposed on relational (networked meanings) data. The value of data is often proportional to how many connections it has, and how well we can protect it.
Lets look at it in terms of old technology. There once were slow modems. Lets no go to slow, but lets say 2400bps. These were dialup modems. When we used to connect to the old world BBS, we would use download protocols to transfer files. One such protocol was Z-modem. Now the last time I checked Z-modem could compress data over that 2400bps modem resulting in speeds often double that of 2400bps. Move ahead to ethernet. No compression on the network layer. Simple solution add compression protocols to webservers/and other servers that support XML. I know that webservers and clients already zip some stuff up for communications. Why not all of it. Add the compression at the client/server level and be done with it. Hence in the old days, if you wanted to use Z-modem to make things faster it had to be setup on both ends. Same thing today, just with a diffent protocol
Regards,
Ryan Pritchard
Fun Extends All Basic Life Expectancies
"Please remember that not all XML data is transmitted by HTTP however (thank god)."
I believe that Jabber is an example of that.
What is slightly annoying about IFF though is that it is based on the 68000 chip so you're supposed to align stuff to 16 bits and put the bytes around backwards. Naturally Microsoft ignored those parts of the spec when they wrote .wav files.
At the very least, the XML closing tag name should be nixed - it serves no purpose whatsoever and only wastes bytes. "</>" is everybit as expressive as "</tag>". Instant 30% reduction in XML file size.
Had data to be delivered to client, dumped from a database. As flat files they were ~20mb in size as flat files. That bloated ~120mb after conversion to XML.
Client attempted to open in a DOM based application which I suspect used recursion to parse the data (easy to code, recursion). Needless to say it brought their server to its knees.
We switched to flat files shortly there after.
In my problem domain, where 20MB is a small data set, XML is useless. XML seems does not scale well at all (though using a SAX parser helps at times).
YMMV.
putting the 'B' in LGBTQ+
I think the point of this is, you'd like to have a random access version of XML. Right now there's no way to say, seek to the next sibling node without reading all the intervening characters. DOM and other higher-level API's hide that fact, but it's still there.
"I compare [open source vs. non-open source] to science vs. witchcraft." linus
And, of course, decompressing a 10 meg file into a 150 meg file and processing that first into an enormous tree with a bloated XML-parser, then accessing that through a complex object-oriented interface, will cause no performance hit over just processing a 1 meg file directly.
XML sucks.
It seems to me that the problem isn't with XML, it's with what people are using it for. I read some complaints here from people saying "I tried to use XML for BLAH and it was too slow." However, if they'd thought about it, BLAH would have been better served by some binary format in the first place. The article also discusses the fact that mobile devices need something less cumbersome for transferring pictures/media. Why are they using XML for that at all? One of the benefits of XML is that it's human readable, but in those applications you don't need that benefit, so don't use XML. Instead of coming up with a binary XML standard, come up with a generic binary standard that does exactly what you want. Too many people have been given the hammer of XML and now everything looks like a nail.
First off: ASN.1 (X.680) is not a fringe technology and it is alive and kicking. ASN.1 is dead == BSD is dead. In fact ASN.1 and the binary wire presentations (CER/DER/PER/XER) are at the core of many important services we use daily including but not limited to:
PKIX / X.509 / PKCS (Public Key Cryptography)
Kerberos authentication
SNMP / CMIP
X.500 LDAP / DAP directory services
X.400 messaging
Voice over IP: H.323 T.38
The 3GPP specifications (GSM / UMTS mobile phones)
OSI layer 7 protocols (FTAM.. etc.)
RFID
In comparison to XML, ASN.1 is a huge bandwidth saver, in fact the PER (Packed Encoding Rules) were designed for saving bandwidth. There is even a way for encoding data in XML using the XER (XML Encoding Rules) specification.
Last but not least there is finally a worthwhile opensource ASN1 to C compiler out there: Get ASN1C here.
New to ASN.1?? Visit this site and be sure to pick up the excellent free book on ASN.1!
Well it depends if you are using signed or unsigned.
10 Unsigned is 2 Dec
10 Signed is -2 Dec (Assuming you are using 2 bit numbers)
1010 Unsigned is 10
1010 Signed is -6 (assuming that you are using 4 bit numbers)
Now to solve this problem you just give a leading 0
so There are 010 Types of People.
(What is 10 in binary?)
01010
Thisway solves any confustion.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
You had me until then; no self-respecting engineer would ever use those terms.
Dewey, what part of this looks like authorities should be involved?
A good binary XML specification could be an extremely good fit for us.
And, don't suggest that we just compress XML and send that. Here's why: first we have to expand all that digitized data into some sort ASCII encoding, which is then compressed. End result: no gain and a possible loss of precision in the data.
A real, live, useful binary XML spec could help us immensely. I say BRING IT ON!!!!
BTW, wasn't DIME supposed to address these problems? What happened to DIME, anyway?
In the course of every project, it will become necessary to shoot the scientists and begin production.
<noun val="point"/>
<prep val="of"/>
<noun val="XML"/>
<verb val="is"/>
<contraction val="it's"/>
<adjective val="easy"/>
<infinitive val="to"/>
<verb val="read"/>
<period val="."/>
Wbxml is very compact, easy to parse and it's standardized too. Have a look at http://www.w3.org/TR/wbxml/ .
take an example on microsoft XML formats. Word, or the MSN messages format... they're _NOT_ xml. They're proprietary formats DISGUISED as XML.
If Microsoft doesn't respect text-only XML, what do you think will happen when^H^H^H^Hif binary XML is out?
to make inaccurate interpretations of the data and not using proper and accurate specifications.
:(.
Many people claim that XML is so great because you can "just read and understand it" without having to use cumbersome and hard to understand specifications. This exactly is what makes XML, indeed, nice for typesetting purposes like HTML, maybe as an alternative for simple configuration files etc, but indeed NOT for RPC and databases as you write. I couldn't agree more.
I have seen so much time and money lost due to intuitive but false interpretations of XML schema's. People think that because its human readable with "meaningful" tagnames that they don't need a proper spec no more. Well I guess it fits in nicely with todays "cut and paste" programmers who don't really know what they're doing
When I said MS Word format, I meant "MS Word HTML output format".
If your boss insists on XML, write a documented binary interface and then a converter that reads the documented binary interface and outputs XML. Most of the time that would voilate "you ain't gonna need it" but very few projects every really spend much time doing good design.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
I've been saying this since the begining: Why send ten times the bytes for the same amount of data? Sure, its human readable and writeable, but how many times do humans actually read or write XML (I'm talking about web users here, not us /.ers ;)). It seems to me that if something is primarily machine read and written, a binary format makes much more sense: its more compact and can be interpreted by the machine much faster.
Advances in networking and processing power go a long way in addressing performance concerns, though perhaps not on battery-constrained mobile phones, he said.
And that quote exemplies the reason why we have a whole lot faster machines, but still feel bogged down doing the same things. The speed advantage is largely negated by ineffecient coding and data storage formats such as XML. You cannot always assume the next round of hardware will make things fast enough. I'll be glad when we reach the limits of silicon and Moore's Law is put to rest, because it will force people to stop thinking of fast hardware as an excuse for sloppy coding and bloat.
I've discovered a remarkable proof, but this margin is too small to contain it...
a) Use an internal representation of the DOM tree.
b) Publish the specs
c) DON'T call it XML. Try "Extensible Tree Based Binary Format" or something. Just because XML is a standard people want XML to devour everything as some kind of spec blob.
In other words:
Don't like XML? DON'T USE IT!
The answer to your question posted on slashdot on January 14 2005 written in plain English, containig 31 letters and no binary data or images and entitled "Does the World Need Binary XML?" is given to you after long deliberations and consultations with higher authorities and spelled in plain English as "Yes"!
>
You seem to have a marked lack of appreciation for the intelligence, knowledge, and experience of the folks behind XML. May you be cursed to use ASN.1 until your appreciation improves.
I don't get it. If you really want to hide a binary blob in XML, why not just call it UTF-8 and decode it as you wish?
You can put e.g. JPEG data bytes in and call it UTF-8 if your parser knows what to expect.
XML does support 8 and 16-bit encodings, right?
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
What is text? It's a binary code that a computer translates into graphical glyphs. Is it proprietary? Not any more. Your computer is what turns that binary code into something that means something to you. It doesn't even mean something to everyone (in fact iirc the first line in an xml file identifies a code page for using the intended symbols). So firstly, opaqueness even on "text" is not quite black and white. Second, what is transparent to YOU may be completely opaque to software, I'll elaborate on this later.
So what could binary XML be? It's a binary code that translates into XML syntax. Except it's easier to deal with for software, there's no processing. Let me present this example, which I will endeavor to use over and over. . I could write this in binary as 0001010203, obviously to do that i'd have to store the strings "mytag", "is", "simple" in a string table elsewhere, but this is just a simple example. I made "0x00" mean "a tag", the first "0x01" mean 1 attribute, and the rest are string references. Reading this tag would be very simple, fread(buffer,10,1,file) (i picked the two middle numbers out of the air since we have not really defined this format).
Saying that binary is proprietary makes absolutely no sense. Proprietary means property of an owner (usually a business). A file can't be proprietary. It's contents, the format of it's contents, certainly. But a binary file is atomic, it's like the sky. It just is what it is. Binary XML COULD become proprietary, but it will not NECESSARILY happen. Nothing is inherently proprietary about a binary file. If the binary XML format satisfies the constraints of a standard in my first post, it will absolutely not be proprietary by construction, or so I think. I don't work in standards groups, people more experienced with their goings on may point out additional refinements.
Your next point, recovering data. What do you use to read XML files? A text editor usually. What does that do? It reads a binary file (uh a text file!), applies some understanding of what the ASCII code (as an example) means, and displays it to you. Most of it is usually character data, but not always, there's a bunch of special characters that text editors often respond to for formatting or other things. Unix and PCs can't agree even on how to terminate a line. The point being even right now you can't totally say plain text XML is transparent, magic happens for you to just see it. Nothing about its presentation is defined (nor should be, imho). So what could binary XML be viewed with? Only slightly more overhead. You could "textify" as a preprocessing step to be viewed in a text editor. bin2txt myfile.bin.xml > myfile.txt.xml as an example. Or you could write your xml in plain text, and do the opposite. It's one to one, no loss. XML is just a syntax.
Now as for processing, I'll admit to waving my hands and skipping a few pieces. The XML syntax is defined clearly, there's no ambiguity (that i know of). However the step of choosing text-like strings to declare the syntax elements is where it gets hairy. Your first step in writing a parser is to grab the syntax elements out of their native text string. This is disgusting as compiler writers, language developers, etc. understand. You have to make lexx/yac scripts or workalikes to generate code, or worse, write your own (no one should do this but that's purely my opinion and not defendable). Theres a complicated state machine, some funny thing called LRM, and some other gotchas. All this just to take and break it into it's constituent elements. Usually then you have a tree structure or some hierarchy that a computer can understand.
Take a look at some common XML libraries: xerces, libxml, a few others I can't remember. They're pretty damn big. Mostly, I argue, due to the text nature of their data. A lot of work goes into making text files useable by a program. A lot (but not all) of cruft can be cut by adopting a format that is simpler for softare to understand.
Sure, people who write MS Word (i.e.
I am currently writing a xul client/server application. I am using the xmlhttprequest function. however instead of processing xml data which is slow, especially when you need to parse a data set several times a second, i started sending javascript code and data stuctures. In addition the server code is written in perl so for storing status and configuration information, I used serialized perl data strucures processing requirements fell dramatically. I still have the clear text editing and inspection capabilities without the speed and space issues.
It seems like serialized script code, such as perl, python, java provides the benefits of xml without the headaches.
gah i got mungified
(angle) mytag is='simple'(angle)
Go find another standard to pollute, you asshats. We get to keep this one, and use it to reform HTML into something near its original goal: interop. DRM, compression, db, platform specific RPC, endian and everything else were all left alone so jackasses could fuck those standards up all they want. Throw us a fricken' bone and read the 10 guiding principles of XML again, douchebags.
DNS is binary; does that make it proprietary? Not at all. It is a published open standard in RFC 883 and later documents. Other examples include ASN.1/BER as used in SNMP. It's not whether it is binary or text that matters; it's whether it is openly documented and unencumbered by intellectual property claims (a separate issue some of XML has).
The decision of binary vs. text for a format should be the result of specific needs. XML is verbose. XML can be compressed for transmission purposes, but it still has to be uncompressed to its verbose form for parsing. If speed in parsing is necessary (it might be as I have noticed quite many XML based progams are rather slow), a binary format can have things like length prefixes and continuation tags, instead of having to detect and verify collection of characters whose position is unknown. A parser that does not recognize a given tag, or does not need to process it, in a binary format can simply skip it by jumping the specified number of bytes. Binary format is very optimal for machine processing.
The usual argument for a text format spans the range of permitting humans to create the content for most things directly in an editor like vi or emacs (no wars here, I listed my favorite last), or reading that content directly, such as to diagnose the real cause of misunderstood errors. XML is too utterly complex for human creation or interpretation to be effective on a direct basis. There may be some argument that it can still be effective for diagnostic purposes (I have in fact needed to do so many times). Given that it is the powerful tools of XML that are used as the basis for the benefit of XML and promoting it, then what does it really matter what format is underneath as long as it is open and unencumbered?.
A binary format for XML will absolutely not kill XML. DNS is obviously not dead (and you'll love it even more when IPv6 rolls into your network). What a binary format might do is weed out some of the weaker programmers who are sticking their fingers a bit too deep into the inner workings of some applications and tools.
now we need to go OSS in diesel cars
What you're describing is not a data format. You are describing a prepended index. Very different animals.
:)
Nevermind that binary formats are soooooooo easy to algorithmically validate for correctness. Oh, look! A valid pointer! I sure hope it points to something useful... Oops! Unexpected NULL value. Better fix that... Okay! Now it works!
user: could you add this feature?
Hmmm... Gotta add it to the data structure... Okay, I've got to make sure the client and server protocols match by version. Damn. Gotta rework that validation code because my offsets have changed. (Etc. etc. etc.)
Pop quiz: what is the binary representation of the string "my pretty little lamb"? How does it differ from the "text" representation of the same? How do you mark hierarchy? Do those markers use up less space than the one-byte '<' character? How do you allow for optional values as well as allow for modification and future expandability with a binary format? How much more efficient is binary parsing with validity checks for structure and data correctness when compared with text (XML) parsing?
And finally, for fifty points, how expensive is your time as a developer as compared to hardware processing time as a dollar value?
If your time writing, parsing, validating and debugging a binary format is cheaper over the course of a year than the same amount of money used to purchase server hardware, then you have made the right choice with a binary format.
Oh! And don't forget to comment your code and document your binary format. Those really suck for future code maintainers to reverse engineer.
Have a nice day!
- I don't need to go outside, my CRT tan'll do me just fine.
I don't know that I care about or for "binary XML". I don't terribly worry about the efficiency that might be gained by converting a textual integer like 3,000,000,000 into a 32 bit binary integer.
/a/b/c /a/b/c || byte_position=5454786
/a/b/c. It would be bad if it was not. But I think that sort of thing could be checked in the same way that we check DTDs.
However, I might be interested in a "Pointer XML" - in an XML that allows me to use lseek like operations to efficiently move around a document.
XPaths conceptually require parsing lots of the document. It's hard to skip over pieces - you have to process all of the byres from the start of the document to the first place where the XPath matches.
Most of the "optimized XML" formats create a hash table from Xpath to file location or binary. But this is still at least O(length of Xpath string).
If there was a way of providing the link as a textual integer, and then lseeking to this, it's O(lg NbytesInXmlDoc). That might be a saving.
(Adage: don't worry about constants like 2X or 4X. Do worry about changing the O() efficiency.)
There would be no reason that such a "Pointer XML" could not remain entirely textual. It might simply be an extra syntax or modifier to an Xpath:
Instead of linking to xpath
Link to
The lseek positions would have to be in bytes, not characters, and would get confused if the coding system were changed. But they would at least be useful an usable if the coding system were not changed.
The hard part would be ensuring consistency. E.g. in the example above, you would want to ensure that the element at byte_position=5454786 really was the xpath
Also, some minor annotations, such as placing anchors at the lseek-ed to byte position, might help in maintaining such consistency.
Moreover, I would never advocate abandoning XPaths - I would just be suggesting including the lseekable byte positions as a performance hint. It should also be correct to ignore the byte possitions and just use the XPath links.
By adding padding (blanks, whatever) you could avoid the need to change all of the lseekable byte position hints whenever you changed an element value.
...the world doesn't need XML let along binary XML.
USE='-xml -xml2'
XML is ugly and totally not needed except by those 'dot-com all day long' fucks.
I never said they couldn't use it afterwards. Only that they should be kept away from the design process so that they cannot warp it to their own ends.
I would hope they would use it aftewards, along with everyone else in the very same way. Their track record with standards is, however dismal by anyone's definition.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Why not use s-expressions if you need a sleeker representation than what XML gives you ? Admittedly, s-expressions are poorly suited to document representation (ie, openoffice documents), but for places like wire protocols, they seem ideal.
XML is plain stupid as an idea. Before you flame me, here are the reasons:
1) XML is almost unreadable by humans, especially if the XML document is complex and long.
2) XML is unreadable by computers. Computers need to parse XML, then convert it to binary data, then back to XML.
3) XML is not object-oriented. We made such a fuss the previous decade about object-orientation, and now our data are not object-oriented! XML applications must know apriori what to do with an XML document.
4) XML is easily parseable by computers, they say. So what? binary data are just as parseable, even more than text, and they have been so from the beginning of the computer era.
5) XML is editable with a text file, they say. Yet nobody uses a simple text editor to make XML files...we all use GUI apps.
What we need is a binary data format, that is structured and treated just like XML. It would be best if the format was object-oriented, i.e. a computer could ask some form of code to accompany the data (maybe p-code). Nowadays all computers are at least 32-bit...from the smallest handheld or cellphone, to the mightiest mainframe, all computers can easily handle 32-bit data. There is no excuse for the lame XML format.
it is called CORBA
I though this was what ASN.1 was for, but the XML folks didn't like it because it wasn't human readable.
What can binary XML do that isn't already solved by ASN.1?
--it's how you use it. I know, I hear it all the time...
Sincerely,
Pan Tarhei Hosé, PhD.
"Homo sum et cogito ergo odi profanum vulgus et libido."
Huh? It should be trivial to devise a more compact memory structure for that data. Given only a few minutes of thought I came up with this:
This should be about 1/4 the memory requirement of a simple 2D array of integers (assuming 32-bit integers), so long as most of the values in data set can be represented in 1 byte.
Other, more exotic, data structures could be used to get even better memory efficiency: they are called sparse matrices and have been well understood for decades. Do a google search, go to the library or ask a computer scientist for advice.
(Thank-you, thank-you. I'll be here all week. I do weddings and bar-mitzvahs and am available for hire)
I cannot believe that your naieve post was modded up to a 5. FWIW the answer to all of your above questions is a resounding "Yes!", although some deserve a stronger "Yes!" than others. Let me state for the record that, from your newbie questions, you are XML-ignorant. And you apparently did not take compiler theory, where you would have learned how computationally expensive parsing was. But you are hardly alone; the industry is full of dumbasses who don't understand what's happening. I, on the other hand, predicted these problems four years ago and have yet to receive my Nobel Prize.
XML is a cluster fuck for the following reasons. Any message must be:
Note that at every step XML requires more CPU, more memory and more bandwidth. This is true for every component of the network! There is no way around these problems other than sheer computing power and throughput. So, one might say, the problem will disappear if we merely wait a few years. Unfortunately other factors are loading the Internet even more than XML, sapping Moore's Law.
And that's without considering the problems of the W3C's various XML committees! But don't get me started.
Yessssss! Finally someone who understands XML. (Well, nearly.)
It's for storing text for publishing. Remember publishing? Books? Articles? Reports? Text documents? Sure, you can stuff it with spreadsheet data and send it over the wire. Sure, you can generate it from a database with element names 25 yards long and containing 1 byte of data. You can also try to open a sardine can with a banana.
But if you edit XML with Notepad, you a) haven't understood XML and b) deserve everything you get. It's by no means brain-dead, and it had nothing to do with scripting developers (a fine red-herring, that). It certainly has been pushed into areas where it didn't belong. It's a tribute to XML that it has actually performed well in those areas in circumstances where the document type has been carefully designed, but they are rare.
what about:
... --></column> ... --></column> ... --></column>
<column name='column 1' etc='some other stuff'>123 456 789 135 458 432 <!--
<column name='column 2' etc='some other stuff'>789 135 458 432 <!--
<column name='column 3' etc='some other stuff'>123 135 458 432 <!--
XMLSchema has support for list types.
see http://w3.org/TR/xmlschema-0/#ListDt
Now, DOM/SAX/etc have no useful/performant way to separate this data, but that's your fault for using DOM/SAX as an API.
"I guess you have never heard of a large segment of the computing world refered to as embeded systems."
Since we are discussing xml, I never considered embedded systems. I was simply responding to his statement that text was a sloppy way to store data. I wasn't refering to xml, just that statemtent.
I personally think that xml is well suited for very few applications. I've never found a good reason to use it over other, more conventional formats.
I completely agree with you, but that wasn't my point. I guess I should be marked off topic.
You are correct that your scheme would work to fix the problem I described. I made the mistake of describing the problem as being a lot simpler than it really is, however.
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
that's your fault for using DOM/SAX as an API
The biggest reason for considering using XML is that it "bought" us access to some useful standard libraries and tools that others could use to look at our data. Get rid of that, and there's also no reason to bother with XML anymore and we might as well go with something we invent on our own.
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
Regular compression (like gzip) helps the file size issue, but it does not allow for random access of the XML.
Wait a minute, XML is text and you cannot randomly access it anyway. Well that's the point of binary XML. The focus on XML compression seems to be missing the key advantage of binary XML. That is, a binary XML format could allow indexes of elements and attributes for fast access of complex pointer-rich data structures.
Random access of a text format simply cannot be done in a sensible way. gzipping XML doesn't help give XML random access.
You left out so much there it's not funny. Let's try for some *real XML, shall we?
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE binaryxml [
<!ELEMENT binaryxml (encoding? bytes)>
<!ELEMENT encoding EMPTY>
<!ATTLIST encoding base CDATA #REQUIRED>
<!ELEMENT bytes (byte*)>
<!ELEMENT byte (bit+)>
<!ATTLIST byte bits CDATA "8">
<!ELEMENT bit (#PCDATA)>
<!ATTLIST bit seq CDATA #REQUIRED>
]>
<binaryxml>
<encoding base="64"/>
<bytes>
<byte bits="8">
<bit seq="0">0</bit>
<bit seq="1">1</bit>
<bit seq="2">1</bit>
<bit seq="3">0</bit>
<bit seq="4">1</bit>
<bit seq="5">0</bit>
<bit seq="6">0</bit>
<bit seq="7">1</bit>
</byte>
<!-- snip -->
<byte bits="8">
<bit seq="0">0</bit>
<bit seq="1">0</bit>
<bit seq="2">0</bit>
<bit seq="3">0</bit>
<bit seq="4">0</bit>
<bit seq="5">0</bit>
<bit seq="6">0</bit>
<bit seq="7">0</bit>
</byte>
</bytes>
</binaryxml>
Your question should be "Why would the want to store images in XML?"
And the answer is that, much more often than not, you need to store meta-information or additional information related to the the image; and that is typically in an extensible, self-defining, hierarchial, tag-value format.
Note, by the way, that the question applies equally well to audio.
There are several completely different formats in which to store a picture or audio along with additional information. It's been often noted that this information is exactly the kind of information that XML holds quite nicely. The actual pixel or audio data, however, does not fit in XML well. These are typically stored in binary, because they tend to vary from large to gigantic; for this reason compression is quite common. Size is very much a concern for these data. To store in XML, these need to be converted to character, adding an often-unacceptable increase in size. XML is not an option.
There is a lot of attractiveness in having a single, format that could be used for binary data with much of the benefits that XML has for text. Experience with existing applications point out viability and benefits for future uses, even if existing applications remain with the current standards.
Note: I've heard the suggestion that the additional or meta-content could be in XML, and have the XML reference the binary content. Although that works for the relationship of HTML and images on the web, for this purpose it's a non-starter. These formats are designed to contain the data, not just describe it; in this model, you would still need to define some format to hold the data. (And, then, why not just stick with the current?) Managing the two parts separately introduces possible mismatch risks not possible when they are a single unit; they could get separated, become different versions, etc. An absolute location reference mechanism can prevent copying the data from place-to-place, and even a relative reference mechanism complicates using an instance. Just not going to fly.
What's wrong with just compressing (bzip2/gzip) the files to speed downloads... and save space? I doubt all the string handling is that big a bottle neck on modern processors. It's not like they are using XML to render fur or anything.
Can anyone here set me straight?
#6495ED - cornflower blue
I can think of one simple thing that could help a bit: make the closing tag name optional. In other words, instead of <tag>data</tag> you could simply use <tag>data</>.
I've taught basic XML to a number of people, and in almost every case, they have the same reaction: "Why does the field name have to be duplicated like that?". I think it's a question that's deserving of some serious consideration.
Except now people think you're talking in octal.
010 is eight.
It seems that a solution to both problems would be to create a standard xml-stream or xml transport protocol that could be utilized when you need to extract information from a database yet still be able to render it in xml form on the receiving end for maximum flexibility.
This would offset a large portion of the parsing and DOM work to the client side, which would be ideal for web services that are currently overburdened by having to generate the markup required by XML.
The transport protocol could minimize the redundant information by first defining the document's structure and then transferring the data in a more compact form. The receiving side could then either recompile it into standard xml or if it already knew the final destination of the information (such as a database) it could bypass the extra parsing and directly access the required data.
I don't think so. Marked-up binary similar to what's in EBML has been around in the telecommunications industry for a while. There's ASN.1 (complete with standardised XML encoding), and also WBXML (oh, and this *is* a W3C standard). Still, their design is at odds with many of the principles behind XML, but they're extensible and contain tag-like metadata.
Eh, you stuck this on the wrong post, I pointed out ASN.1 is not dead. Or have you never heard of GSM?
...but I thought that the strategic goal of XML is to sell more hardware.
We should rejoice, buy more CPUs, and move the problem from XML, to languages with poor concurrency support.
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
Universal Binary Format has been around for a few years now, and it includes everything binary XML would have, but in a cleaner, more well-thought out form, in addition to having an extra higher-level protocol for inter-machine transport and security issues.
In the great CONS chain of life, you can either be the CAR or be in the CDR.
Gzip the XML document and it will be even smaller than the original notation. Gzip removes the extra verbosity of the tags and of your data, and you gain standard data representation for free (accounting only storage space).
Rethinking email
To complicate matters further, there are ASCII text files, UTF-8 text files and Unicode text files (and of course EBCDIC on IBM mainframes). If you only have an ASCII text reader, you won't be able to read Unicode and you may not be able to read a UTF-8 file if it has characters in it that take more than one byte. So "plain text" doesn't really mean one thing anymore.
My thoughts on this.
The problem is that high performance microprocessors are simply too complex. How many assembly hackers actually understand out-of-order execution (which pretty much all desktop processors do, with the notable exception of Transmeta) ? How many are aware that branches can sometimes severely degrade performance and thus do tricks like predication with conditional moves or loop unrolling ? How many perform loop pipelining, to eliminate data stalls/waits ?
And yeah, a good compiler does all this.
The Raven
I fail to understand why parent is modded down, he is absolutely right.