Google Open Sources Its Data Interchange Format

An order of magnitude over XML? by Anonymous Coward · 2008-07-08 08:10 · Score: 5, Funny

So is, well, just about anything.

Re:An order of magnitude over XML? by dedazo · 2008-07-08 08:34 · Score: 5, Interesting

Looks like Google just invented the IIOP wire protocol, which is also platform agnostic and an open standard.
I guess the main difference here is that their "compiler" can generate the actual language-domain classes off of the descriptor files, which is a definite advantage over "classic" IDL.
"Google protocol Buffers" is cooler than the OMG terminology, but this kind of thing has been around for 20 years.

--
Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
Re:An order of magnitude over XML? by alexgieg · 2008-07-08 08:45 · Score: 4, Funny

An order of magnitude over XML? So is, well, just about anything.
Well, let's also not forget that the meaning of the expression "an order of magnitude" depends strongly from the numeric base you're using.

--
Conservatism: (n.) love of the existing evils. Liberalism: (n.) desire to substitute new evils for the existing ones.
Re:An order of magnitude over XML? by jellomizer · 2008-07-08 08:57 · Score: 1

But the Slashdot Add above the message says XML combined with Java is fast. And the slow part is the Database server. Could I be mistaken.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Re:An order of magnitude over XML? by kriston · 2008-07-08 09:22 · Score: 3, Funny

Oh, I'm a little ashamed that I recognize this message as CORBA flamebait.

--
Kriston
Re:An order of magnitude over XML? by jd · 2008-07-08 09:53 · Score: 2

Given how evil Google can be at times, we can assume they are working in base 13.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:An order of magnitude over XML? by sconeu · 2008-07-08 09:59 · Score: 4, Funny

Nobody makes jokes in Base 13!

--
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
Re:An order of magnitude over XML? by jd · 2008-07-08 10:12 · Score: 4, Informative

Technically, you are correct - platform-agnostic data transfer has been possible since Sun's earliest RPC implementations. However, this seems to be considerably lighter-weight (although so is Mount Everest) and because order is specified, it's going to be much simpler to pluck specific data out of a data stream. You don't need to have an order-agnostic structure and then an ordering layer in each language-specific library.
There have been all kinds of attempts to produce this sort of stuff. RPC, DCE, Corba, DCOM, etc, are programmatic interfaces and handle function calls, synchronization, etc. OPeNDAP is probably the closest to Google's architecture in that it is ONLY data. It's more sophisticated, as it handles much more complex data types than mere structures, but it has its own overheads issues. It isn't designed to scale to terabyte databases, although it DOES scale extremely well and is definitely the preferred method of delivering high-volume structured scientific data - at least when compared to the RPC family of methods, or indeed the XML family. I wouldn't use it for the kind of volume of data Google handles, though, you'd kill the servers.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:An order of magnitude over XML? by mikecarrmikecarr · 2008-07-08 10:13 · Score: 1

You wouldn't believe how fast my NOOP protocol is... There's (almost) no I/O wait at all... :)

--
ID-10-T is a way of life
Re:An order of magnitude over XML? by DarkOx · 2008-07-08 10:16 · Score: 1

Yes, I have already looked over the code and made some modifications, most to the comments. Its now 1^9999 times faster then googles original honest.

--
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Re:An order of magnitude over XML? by Penguin+Programmer · 2008-07-08 10:29 · Score: 1

XML is an order of magnitude faster than XML.
In base-1.
Re:An order of magnitude over XML? by vrmlguy · 2008-07-08 11:23 · Score: 5, Insightful

Technically, you are correct - platform-agnostic data transfer has been possible since Sun's earliest RPC implementations. However, this seems to be considerably lighter-weight (although so is Mount Everest) and because order is specified, it's going to be much simpler to pluck specific data out of a data stream. You don't need to have an order-agnostic structure and then an ordering layer in each language-specific library.
Actually, XDR (used for Sun's RPC) is very lightweight, arguably lighter than PB. (Yes, I forsee a Java implementation called PB&J.) XDR is potentially more compact, since it doesn't encode field identifiers, but it's also big-endian, which made it less attactive as little-endian computer archtectures took over the world. Also, while XDR demands a fixed ordering of fields, field order in PB *isn't* specified; the field identifiers allow you to order the fields anyway that you like.
Overall, I like it. It's obvious that the developers were familar with the flaws of older protocols, and found ways to fix most of them. The only obvious thing I see missing is a canonical way to encode the .proto file as a Protocol Buffer, to make a stream self-describing.

--
Nothing for 6-digit uids?
Re:An order of magnitude over XML? by jd · 2008-07-08 13:48 · Score: 1

The Mayans might have. (ref: Same article as parent)

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:An order of magnitude over XML? by vrmlguy · 2008-07-08 15:25 · Score: 3, Informative

The only obvious thing I see missing is a canonical way to encode the .proto file as a Protocol Buffer, to make a stream self-describing.
A-ha! I found it! "Thus, the classes in this file allow protocol type definitions to be communicated efficiently between processes."
Why do you need this? Well, you may not. "Most users will not care about descriptors, because they will write code specific to certain protocol types and will simply use the classes generated by the protocol compiler directly. Advanced users who want to operate on arbitrary types (not known at compile time) may want to read descriptors in order to learn about the contents of a message."

--
Nothing for 6-digit uids?
Re:An order of magnitude over XML? by CedgeS · 2008-07-09 02:51 · Score: 1

Try this on for size. Additionally when making this I noticed a few quirks in the protocol: VARINT available at wire level, not in definitions, ENUMs capped a 2^31-1, VARINT at wire level, Field numbers capped at 2048? VARINT at wire level. These seem like imlementation limitations. The uint64s below should probably be a new VARINT type.
FUCK THE LAMENESS FILTER:
message Proto { optional string package = 1; repeated string import = 2; repeated Proto imported = 3; repeated Message message = 5; repeated Extension extension = 6; repeated Enumerator enumerator = 7; repeated Option option = 8; repeated Service service = 9; } message Message { required string name = 1; repeated ScalarField scalar_field = 2; repeated EnumeratedField enumerated_field = 3; repeared MessageField message_field = 4; repeated Message message = 5; repeated Extension extension = 6; repeated Enumerator enumerator = 7; optional ExtensionRange extension_range = 16; } enum ScalarType { DOUBLE = 0; FLOAT = 1; INT32 = 2; INT64 = 3; UINT32 = 4; UINT64 = 5; SINT32 = 6; SINT64 = 7; FIXED32 = 8; FIXED64 = 9; SFIXED32 = 10; SFIXED64 = 11; BOOL = 12; STRING = 13; BYTES = 14; } enum FieldRule { REQUIRED = 0; OPTIONAL = 1; REPEATED = 2; } message ScalarField { required string name = 1; required ScalarType type = 2; optional FieldRule field_rule = 3; optional uint64 field_number = 4; optional string default = 5; } message EnumeratedField { required string name = 1; required string enumerator = 2; optional FieldRule field_rule = 3; optional uint64 field_number = 4; optional string default = 5; } message Enumerator { required string name = 1; repeated EnumeratorConstant constant = 2; } message EnumeratorConstant { required string name = 1; required uint64 value = 2; } message MessageField { required string name = 1; required string message = 2; optional FieldRule field_rule = 3; optional uint64 field_number = 4; optional string default = 5; } message ExtensionRange { required uint64 min = 1; optional uint64 max = 1; } message Extension { required Message extension = 1; } message Service { required string name = 1; repeated RPC rpc = 2; } message RPC { required string name = 1; optional string takes = 2; optional string returns = 3; } message Option { required string name = 1; optional string value = 2; }
Re:An order of magnitude over XML? by Sentry21 · 2008-07-09 03:12 · Score: 1

Looks like Google just invented the IIOP wire protocol, which is also platform agnostic and an open standard.
For a second, I read that as 'the IHOP wire protocol', which sounded hopelessly delicious. Imagine my disappointment. :(
Re:An order of magnitude over XML? by Nynaeve · 2008-07-09 04:01 · Score: 2, Informative

I got a 404 on your link. try this one.
Re:An order of magnitude over XML? by iwein · 2008-07-09 06:55 · Score: 1

Are people still using 1 based? otherwise it would be at least twice as fast, which is nice, but not impressive in comparison to xml.

--
Show a man some news, distract him for an hour. Show a man some mod points, distract him for the rest of his life.

Why another encoding scheme? by gladish · 2008-07-08 08:13 · Score: 1

Isn't xdr compact enough?

Re:Why another encoding scheme? by MightyMartian · 2008-07-08 08:48 · Score: 2, Informative

It's not hard because XML has to be the most bloated (and yet still, ironically, nowhere near human-readable) format ever invented. That it has not only not been discarded, but is now being used to store binary blobs by guys like Microsoft and OO.org is testimony to the sheer overwhelming stupidity of a lot of developers.

--
The world's burning. Moped Jesus spotted on I50. Details at 11.
Re:Why another encoding scheme? by QuoteMstr · 2008-07-08 08:50 · Score: 4, Insightful

This is just yet another way in which Google demonstrates that it is suffering from NIH syndrome. Instead of improving existing tools, they have to go off and re-invent all the bad mistakes of past, including non-relational databases, clunky binary encodings, and a bizarre non-POSIX filesystem.
Just imagine how far we ahead we would be today if Google had put the same effort into creating tools the rest of the SQL-writing, open(2)-using world could use.
Re:Why another encoding scheme? by QuoteMstr · 2008-07-08 09:15 · Score: 1

I'm not trolling. I genuinely believe what I've written above.
Re:Why another encoding scheme? by FunkyELF · 2008-07-08 09:19 · Score: 1

As long as they're open sourcing their software who cares if it does the same thing as another open source piece of software.
Should we flip a coin and have all people who work on MySQL or PostgreSQL switch to just one of them?
How would anything ever improve?
As far as their filesystem goes, any big company that deals with huge amounts of data does the same thing to fit their needs. I think DreamWorks, Pixar, and Facebook have their own filesystems.
Companies can't wait for the maintainers of software to accept their code.

It doesn't seem like Google is suffering from HIG syndrome to me. Their GWT uses Java and Eclipse...they created a new framework when there was nothing else like it out there and they used existing "Not invented at Google" technologies like Java and Eclipse. They didn't create their own language or IDE.

And look an android, its a framework that they invented, but it uses Java which they didn't.

It could be worse like Apple which goes and uses a language which only they use (Obj-C), and proprietary IDEs which only they use.
Re:Why another encoding scheme? by Abcd1234 · 2008-07-08 09:33 · Score: 4, Informative

You think? Take BigTable. Wikipedia describes it as: '"a sparse, distributed multi-dimensional sorted map", sharing characteristics of both row-oriented and column-oriented databases'. Sounds, to me, like a specialized solution to a very specialized problem, a problem that, I presume, didn't fit with any existing solution. Same goes with GFS. After all, do you really think they didn't evaluate existing solutions before embarking on building an entirely new distributed filesystem? Do you really think they're that stupid?
As for Protocol Buffers, given the existing solutions out there (such as ASN.1 and CORBA) are generally ugly and/or over-engineered, it sounds to me like they're simply addressing a gap in the industry... after all, XML and SOAP aren't the end-all and be-all of generic object-passing protocols.
Re:Why another encoding scheme? by miffo.swe · 2008-07-08 09:41 · Score: 4, Insightful

I dont think its NIH syndrome. They no doubt tested other solutions before doing their own thing.
Dont forget this code is in widespread use and works very well. Googles server farm aint exactly small and the load they see is probably second to none.
A couple of percents of better efficiency for Google probably means millions in saved costs. Tossing a couple of months on development on something like this is money well spent.
I guess if all you have is SQL everything is a SQL SELECT no matter what you want to achieve.

--
HTTP/1.1 400
Re:Why another encoding scheme? by hattig · 2008-07-08 10:15 · Score: 2, Interesting

Looking at the ProtoBuf documentation (lightly) it looks like stuff that any lazy programmer has implemented to make their life easier. For instance I have written code that will take a description file* (like a .proto file) and generate (a) the Java class file, (b) the SQL schema, and (c) the DAO code in-between. It did the camel-case conversion just like this .proto thing, etc. I'm sure the Google thing is far more polished and proven, of course, but hey ...
Adding on custom binary serialisation probably wouldn't take that long, although if I was to do it I would probably mimic ProtoBuf 'cos why reinvent the wheel (Google, take note). On the other hand, generating XML from an object is as simple as appending to a StringBuilder with some utility methods to do tags and attributes, and SAXParsers aren't the most inefficient things either.
However it clearly solves a problem for Google, and it looks simple to use.
(* well, actually I used Java 5 annotations on a barebones class object rather than having to parse a text file)
Re:Why another encoding scheme? by CoughDropAddict · 2008-07-08 11:46 · Score: 4, Insightful

You think it's a "mistake of the past" that Google wrote things like GFS and BigTable that run on commodity hardware, scale basically horizontally (eg. you can just throw machines at the problem) and survive machine failures without human intervention?
You don't "improve" on an existing tool like a relational database by adding a "feature" like fault tolerance. You have to redesign from the base up with those assumptions.
Re:Why another encoding scheme? by Bill,+Shooter+of+Bul · 2008-07-08 12:17 · Score: 1

I agree with you on all accounts except protocol buffer. From the comments to the announcement, it seems as if they didn't really evaluate many popular solutions before inventing their own. Its clear they needed an alternative to XML. Its not really clear if they could have just used any of the other existing solutions to the problem. It does sound like they have been using this for a long time. So they may have invented it before many of the alternatives were released.

--
Well.. maybe. Or Maybe not. But Definitely not sort of.
Re:Why another encoding scheme? by joelwyland · 2008-07-08 12:24 · Score: 5, Interesting

Just imagine how far we ahead we would be today if Google had put the same effort into creating tools the rest of the SQL-writing, open(2)-using world could use.
We wouldn't be ahead at all. We use different tools than they do because they are dealing with different volumes of traffic, data and demands. Let's take a moment and look at your specific complaints. You say Google suffers from NIH syndrome. Having previously worked at Google, I think you are half right. The difference is that Google both benefits _and_ suffers from NIH syndrome. Sometimes the company spends too much time reinventing the wheel, but sometimes the tools out there aren't (and shouldn't be) useful to Google. Apache shouldn't be changed to support the kind of traffic that Google handles because then it wouldn't nearly as good for all of the rest of the world. General software is great because it solves so many problems. However, general software isn't the right solution for all problems, especially extreme ones. Just about all of Google's needs are extreme ones due to the volume of traffic. You dislike the idea of BigTable. Why not use the right tool for the right job? BigTable is a ridiculously fast database system that works beautifully with petabyte sized databases. SQL isn't the right answer to all solutions. They DO use SQL... but when it is the appropriate solution. They have some really sexy internal tools for dealing with SQL and such and I'm hoping those are coming down the open source pipeline soon. :) You claim the Protocol Buffers are clunky. I've used them and developed with them extensively. They aren't clunky at all, they are actually quite elegant and easy to use. They streamline development, are incredibly reliable, and are incredibly fast. You obviously are confused by GFS as well. The system is transparent to the application by using standard i/o stream classes. It is inherently redundant to ensure data security. It is so fast in its response time that Google search is the fastest of any major player. The list goes on and on. I don't really see how you can be upset at Google for making awesome software and then giving us access to it.
Re:Why another encoding scheme? by osu-neko · 2008-07-08 12:31 · Score: 2, Interesting

Definitions based on observed usage:
NIH syndrome (n): A condition suffered by individuals or organizations that roll their own solutions tailored specifically for their needs, rather than using the most recently hyped hammer on every nail.

--
"Convictions are more dangerous enemies of truth than lies."
Re:Why another encoding scheme? by SnowZero · 2008-07-08 22:31 · Score: 2, Insightful

~2000-2001 (I think the reference is in this video). Even if something newer is a bit better, we're not going to go back and port everything. Some future Google APIs will probably have an optional PB interface, because that's what it was being converting to internally anyway, so everyone might as well benefit from the compact over-the-wire encoding.
Re:Why another encoding scheme? by Bill,+Shooter+of+Bul · 2008-07-09 03:05 · Score: 1

Yeah. Its nice that they've open sourced it and it really does look like a good solution to the problem, but the fact that so much time has passed since it was created before its release may hamper adoption. The weight of having Google behind it may overcome some of the problems, but for now its just another xml replacement that will crop up from time to time and cause some confusion and frustration for people. As a web service developer, it won't be long before some one now complains that they can't access the service with it. *Sigh*.

--
Well.. maybe. Or Maybe not. But Definitely not sort of.
Re:Why another encoding scheme? by The+Slashdolt · 2008-07-09 04:54 · Score: 1

Having worked in organizations that suffer from this syndrome, I don't find this definition to be true at all. It's usually a case of arrogance and assumed intelligence over the rest of the world. The people who do this kind of thing have also, at some point in their career, attempted to write their own programming language that is "better". And usually have visions of one day creating their own kernel and/or OS.

That said, there are also times when NIH is required. Google being one example. They deal with more data than anyone. A custom-tailored solution that is only 1% more efficient in terms of time or space is a significant savings for them. So I find this justified in their case. I also commend them on giving their work back to the community.

--
mp3's are only for those with bad memories
Re:Why another encoding scheme? by Tetsujin · 2008-07-09 04:58 · Score: 1

This is just yet another way in which Google demonstrates that it is suffering from NIH syndrome. Instead of improving existing tools, they have to go off and re-invent all the bad mistakes of past, including non-relational databases, clunky binary encodings, and a bizarre non-POSIX filesystem.
What exactly is clunky about defining an encoding for variable-length integers? I mean, if you're looking at the example and just writing one integer, it does seem a bit ridiculous to start writing from the least-significant bit, seven bits at a time, with the high bit set on each byte except the last... It seems simpler to just say "here's a four-byte field representing an unsigned integer" and encode it in network byte order...
But you have to consider how that scales for certain cases. Maybe you'll find yourself working with a lot of data sets containing lots and lots of really small numbers, but a few large ones, too. You can serialize that more efficiently if you make it cheaper to write out small numbers, and only moderately more expensive to write out larger ones.
Unicode does roughly the same thing, IIRC. It's just sensible.

--
Bow-ties are cool.
Re:Why another encoding scheme? by shutdown+-p+now · 2008-07-09 05:17 · Score: 1

That it has not only not been discarded, but is now being used to store binary blobs by guys like Microsoft and OO.org is testimony to the sheer overwhelming stupidity of a lot of developers.
No, it's a testimony to the fact that good infrastructure to build your solutions on is more important than trying to pick (or invent) a new thing tightly optimized for your case when you don't actually need that optimization.
Google guys needed the extra speed, so they came up with this; good for them. Meanwhile, MS and Sun use existing mature and well-documented XML parser and serialization libraries where they are convenient to use, regardless of how "smart" you might think it is, simply because they don't want to waste their time on such trivial matters.
Re:Why another encoding scheme? by shutdown+-p+now · 2008-07-09 05:19 · Score: 1

You don't have to use the most recently hyped hammer; rather, the one that has been tested by time and proven to work just good enough. That's why Java is still going strong.
Re:Why another encoding scheme? by MrResistor · 2008-07-09 06:08 · Score: 1

The thing that you seem to be missing is that Not-Invented-Here is sometimes the same as Not-Right_For_Our_Application.
Yes, a variety of filesystems exist, but it could easily be the case that none of them are optimal for what Google needs to do. Google pushes around A LOT of data, and I'm guess it's mostly of a similar type (format, file size, etc), so it makes sense for them to create a filesystem that is optimized fort exactly the type of files they use. Most filesystems are general purpose, which means they are a compromise. They won't get optimal performance for any given file type, but the performance also shouldn't completely blow.
Similarly, relational databases are not the ultimate magic-bullet database solution. They are pretty good in most cases, and thus they make an excellent general purpose solution, but if you know, or even better can control, exactly what types of data you will be putting into your database, you can create a database engine that will easily outperform existing generalized solutions.
For the amount of data Google has to deal with, I'm willing to bet that the efficiency they're gaining from these customized solutions is well worth the effort of developing them. Let's say their non-relational database means they can get the job done with 10% fewer CPUs. How much money do you think that saves them in a year?

--
Under capitalism man exploits man. Under communism it's the other way around.
Re:Why another encoding scheme? by MrResistor · 2008-07-09 06:15 · Score: 1

Looking at the ProtoBuf documentation (lightly) it looks like stuff that any lazy programmer has implemented to make their life easier.
I think the question is, then, how do you know this isn't just a cleaned-up, better engineered version of the same stuff all their programmers already had in their own toolboxes? It might not have taken that much additional work to put it together. IIRC, Google encourages their employees to spend some time every week working on their own stuff, so this could just be one of the many cool things to come out of that policy.

--
Under capitalism man exploits man. Under communism it's the other way around.
Re:Why another encoding scheme? by vidarh · 2008-07-09 09:54 · Score: 1

The variable length integer encoding is also not something Google invented. I first saw that specific encoding in a paper on writing compact symbol files for Oberon in the mid 90's, and I'm sure it's much older than that.
Re:Why another encoding scheme? by Bill,+Shooter+of+Bul · 2008-07-09 12:44 · Score: 1

Also, its not like this is the first time people have reinvented wheels. My first job as an intern I recognized a need for a finding text in a file program. Dos 6.2 didn't have such a program with the flexibility I needed. So I took a day and wrote it in pascal. Basically a poor man's grep -- with wildcards! I told my supervisor what I had done at the end of the week and he looked at me kind of funny and said you mean "you rewrote grep ? Its included in turbo pascal for dos" I rewrote a lot of software that summer that had already been invented. Good times, Good times.

--
Well.. maybe. Or Maybe not. But Definitely not sort of.

Likely story! by TheRealMindChild · 2008-07-08 08:13 · Score: 4, Funny

"Google's blogger claims, "And, yes, it is very fast -- at least an order of magnitude faster than XML."

That is just because they aren't using enough XML!

--

"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson

Re:Likely story! by caerwyn · 2008-07-08 08:26 · Score: 3, Informative

Are you serious? XML is great for certain applications, but the one thing it *isn't* is fast. It's very believable that something like this could be an order of magnitude faster.

--
The ringing of the division bell has begun... -PF
Re:Likely story! by jandrese · 2008-07-08 08:27 · Score: 4, Funny

Yeah, I mean XML didn't earn its reputation for being lightning fast and byte efficient for nothing...

--

I read the internet for the articles.
Re:Likely story! by cduffy · 2008-07-08 08:32 · Score: 5, Insightful

Being 10x faster than XML to work with is entirely believable: If you're serializing directly to binary structures, those structures can be directly manipulated without any parsing at all... and if you need to do some byte-swapping and alignment adjustments to get them into and out of native form for your current processor, those are still operations which can be performed in a matter of a few CPU instructions, rather than through a few hundred KB of libraries.
I drink the XML kool-aid plenty -- but there are things it's good for, and things it's not. Serializing and parsing truly massive amounts of data is part of the latter set.
Re:Likely story! by Reality+Master+101 · 2008-07-08 08:40 · Score: 1

To anyone who seriously believe's google's protocol is an order of magnitude faster than XML, I have two words for you: No.
You're right -- if it's less than two orders of magnitude faster, I would be very surprised.

--
Sometimes it's best to just let stupid people be stupid.
Re:Likely story! by dedazo · 2008-07-08 08:41 · Score: 2, Insightful

The 10x does not refer to the transmission speed (you're not getting that for a 100KB XML string vs. a 80KB binary blob), but the speed at which the [de]serialization occurs.
In fact this approach is even faster than runtime-specific stream serialization like cPickle in Python or the built-in binary formatter in the .NET CLR, because those use reflection.

--
Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
Re:Likely story! by lgw · 2008-07-08 09:18 · Score: 1

I've written an XML parser that was an order of magnitude faster than XML! Seriously, most XML parsers are horrifically bloated, making the slowness of XML an order of magnitude faster than it needs to be. Their claims of 40-100x faster are believable, when compared to a typical XML parser.

--
Socialism: a lie told by totalitarians and believed by fools.
Re:Likely story! by Anonymous Coward · 2008-07-08 10:33 · Score: 1, Funny

Hi Google,
Is there any library to serialize that Protocol Buffers thing to XML?
Thanks.
Re:Likely story! by dgatwood · 2008-07-08 11:02 · Score: 1

Seems to me that if deserializing from XML is 10x slower than deserializing from any other format, unless you are either dealing with large binary blobs (in which case the performance benefits of being able to say "look for the next tag after 453 bytes" might make an order of magnitude difference) or are managing to transport lots of floating point numbers in a binary fashion, it probably isn't XML that sucks, but rather either your parser or your choice of tag names. You shouldn't be building up a full parse tree with a full XML parser and walking the DOM tree under those circumstances. If you use a lightweight XML dialect with short tag names and only entities for < and &, you can use a very fast, lightweight parser, e.g. something like the following (uncompiled, untested) code:
int curstate; char *pos = ... char *curtag = NULL; ... if (*pos == '<') { pos++; curstate = parse_tag_contents(&pos, &curtag); } ... int parse_tag_contents(char **current_position_in_file, char **tagname) { char *startpos = *current_position_in_file, *endpos; int retval = ENTERING_TAG_CONTEXT; if (*startpos == '/') { retval = LEAVING_TAG_CONTEXT; startpos++;} endpos = startpos; while (*endpos && *endpos != ' ' && *endpos != '\t' && * endpos != '\r' && * endpos != '\n' && * endpos != '>') { endpos++; } char terminator = *endpos; *endpos = '\0'; asprintf(tagname, "%s", startpos); *endpos = terminator; while (*endpos != '>') { endpos++; } endpos++; *current_position_in_file = endpos; return retval; }
And you've just done everything you need to do to parse the guts of an XML tag in such a minimal dialect....

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
Re:Likely story! by cnettel · 2008-07-08 12:13 · Score: 4, Informative

The problem is that, in my experience, it is easy to write a 99 % XML-compliant parser that is 10 times faster. That last percent, though...
Re:Likely story! by virgil_disgr4ce · 2008-07-08 14:30 · Score: 1

Plus, no one's mentioning the kind of scale we're talking about here: beyond big, beyond gigantic, into "WTF" levels.

--
Limina.Log
Re:Likely story! by NotZed · 2008-07-08 17:15 · Score: 1

Even the best and fastest xml parser will be much slower than a binary one (even a pretty badly written one). and that's even assuming it isn't trying to implement all of the stupid little rules you need to implement to make it fully compliant.

--
_ // `Thinking is an exercise to which all too few brains
\\/ are accustomed' - First Lensman
Re:Likely story! by indifferent+children · 2008-07-09 00:20 · Score: 1

Are those metric WTFs, Imperial WTFs, or Troy WTFs?

--
Censorship is telling a man he can't have a steak just because a baby can't chew it. --Mark Twain
Re:Likely story! by lgw · 2008-07-09 07:58 · Score: 1

Not one with the flexibility of XML to represent data "trees". If your data fits nicely into rows and columns, and the meta-data doesn't change much over time, XML is not a great way to represent that data. If you need to serialize more graph-like data, or backwards-and-forwards compatibility is key, XML just isn't that bad.
XML is also (barely) human readable, which is really nice when debugging (again, if the meta-data is stable, then you can just learn to understand the binary data with a hex editor, but if it changes constantly XML is quite handy).

--
Socialism: a lie told by totalitarians and believed by fools.
Re:Likely story! by cduffy · 2008-07-11 05:27 · Score: 1

If you're building an easier-to-parse and more-space-efficient language that isn't XML, why bother trying to make it look vaguely like XML? All you're doing is adding confusion and setting up for failure when someone tries to feed in a document that isn't compliant.
Part of the point of using XML is being able to use any generator with any parser and rely on third-party validation tools; if you're dealing with a restricted subset, that's off the table, and you might as well do something completely different (like what Google is here).

I bet ... by Anonymous Coward · 2008-07-08 08:15 · Score: 5, Funny

... it requires piping data through google's servers for data mining and ad injection purposes.

Re:I bet ... by eddy · 2008-07-08 08:55 · Score: 2, Funny

Hey, that's a pretty cool concept.
$ cat spanish.txt | http://google.com/language_tools/tr?ESEN | grep "terrorist"
I'm sure I'm years late to the party. <sigh>

--
Belief is the currency of delusion.

What? by Yvan256 · 2008-07-08 08:15 · Score: 1

Is that like PHP's serialize?

Re:What? by psergiu · 2008-07-08 08:33 · Score: 1

More like the Oracle SQLLoader ...
Or the VMS Fixed Record Length/Indexed or VFC files ...
I think Google might just receive a visit from the patent fairy ...

--
1% APY, No fees, Online Bank https://captl1.co/2uIErYq Don't let your $$$ sit in a no-interest acct.
Re:What? by Foofoobar · 2008-07-08 08:59 · Score: 1

No. This is more along the lines of a hashmap or a multidimensional array. With serialize in PHP, you still have to unserialize which takes time to parse. With a multidimensional array, it's already in a usable state; no additional parsing is required. And you can add on or remove variables whenever you want without having to reparse.

--
This is my sig. There are many like it but this one is mine.
Re:What? by merreborn · 2008-07-08 09:00 · Score: 2, Informative

1) It has a binary format, far more compact (and faster to unserialize) than PHP's text-based serialized format.
2) It handles multiple versions of the same objects (e.g., your server can interact with both PhoneNumber 2.0 and PhoneNumber 3.0 objects relatively trivially)
3) It generates code for converting each format into objects in their 3 supported languages.
So, no, not really.
Re:What? by six · 2008-07-08 10:02 · Score: 1

It has a binary format, far more compact (and faster to unserialize) than PHP's text-based serialized format.
I won't say *far* more compact unless you are just serializing a bunch of boolean and ints, for an average string-filled packet the gain of going binary won't exceed 10-20% in size. OTOH, you are right about performance.
It handles multiple versions of the same objects (e.g., your server can interact with both PhoneNumber 2.0 and PhoneNumber 3.0 objects relatively trivially)
This is only an issue when you need to share definition files (.proto for google protocol buffers) between clients and server. Many serialization formats (including PHP's) don't need that, because they include the definition with the serialized data (slower, bigger, but simpler for developpers)
It generates code for converting each format into objects in their 3 supported languages.
Well, not needed with PHP either, a call to unserialize() will create and return all the needed objects.
So, no, not really.
Yes it is, it's just serialization which was - unlike PHP serialize() or XML - designed for small size and performance.

Faster than XML? by sdsucks · 2008-07-08 08:17 · Score: 1

I must say - I'm amazed.

No PERL API ??!!?? by Proudrooster · 2008-07-08 08:18 · Score: 4, Insightful

C++
Python
Java

what about PERL ? :]

We love the sight of power by heroine · 2008-07-08 08:20 · Score: 1

Just think of the kind of power it took to make millions of employees standardize on the same format for their data interchange. Humans just gravitate to power wielding forces. Wonder what format they require for their surprise blog posts.

Re:No PERL API ??!!?? by Anonymous Coward · 2008-07-08 08:21 · Score: 4, Insightful

Go out and write one, sonny!

That's the beauty of open source.

How about C? by microbee · 2008-07-08 08:24 · Score: 1

SunRPC is old and awkward. Always want something better.

Re:How about C? by AuMatar · 2008-07-08 08:51 · Score: 2, Insightful

They gave you C++. If you can't translate C++ to C, please turn in your keyboard and leave.

--
I still have more fans than freaks. WTF is wrong with you people?
Re:How about C? by vigmeister · 2008-07-08 09:11 · Score: 3, Funny

Well, I can't translate C++ to C until after it is DECLASSIFIED...
*rimshot*
Cheers!

--
Atheist: Buddhist in a Prius
Re:How about C? by pavon · 2008-07-08 09:34 · Score: 1

They gave you C++.
Well, not really - they didn't provide a C++ library. They provided a compiler that will translate a data structure definition (.proto file) into C++ classes to parse the data for you.
If you have to manually translate C++ to C each time you create a new data structure, then you have lost nearly all of the advantage that this tool has over SunRPC or CORBA/IIOP - namely the simplicity.
What is needed is a new proto to C compiler, which is very doable since they provided all the source, but not just a matter of translating C++ to C.
Re:How about C? by gbjbaanb · 2008-07-08 10:24 · Score: 1

It reminds me a little of gSoap (only without having to piddle about with XML documents), in that it creates c++ classes to parse the data for you.
I think that's the least of the benefits though, the serialisation format would give it speed, the data format gives it smaller payloads. Both are benefits over XML, so this should be superior even if you read the data using a generic parser that worked through metadata at runtime only.

Now just release Goobuntu... by mdm-adph · 2008-07-08 08:26 · Score: 1

...and we'll be happy.

--
It is by my will alone my thoughts acquire motion; it is by the juice of the coffee bean that the thoughts acquire speed

Re:Now just release Goobuntu... by Anonymous Coward · 2008-07-08 08:42 · Score: 1, Insightful

Trust me, you won't.
Typed on a Goobuntu machine.
Re:Now just release Goobuntu... by fph+il+quozientatore · 2008-07-08 08:48 · Score: 2, Funny

Here, fixed the typo for you:

Now just release Boobuntu...
... and we'll be happy

--
My first program:
Hell Segmentation fault
Re:Now just release Goobuntu... by mdm-adph · 2008-07-09 05:10 · Score: 1

Oh, come on, anonymous Google employees -- is it really that bad...?

--
It is by my will alone my thoughts acquire motion; it is by the juice of the coffee bean that the thoughts acquire speed

Back to the 70's night? by Madball · 2008-07-08 08:27 · Score: 1, Insightful

But here, in an unusual departure from the norm, the default values for these members are set to digits (for strings or literals) or values (for numerals) that define their place in a sequence -- where they fall within a record.

Wow! They've invented fixed position data files. What will they invent next, a cool new programming language called RPG?

Re:Back to the 70's night? by Temporal · 2008-07-08 08:32 · Score: 3, Insightful

Wow! They've invented fixed position data files. What will they invent next, a cool new programming language called RPG?
The article is actually completely wrong there. The protocol buffer binary format uses tag/value pairs, not fixed positions. Parsers simply ignore any tag they don't recognize and move on to the next.

Re:No PERL API ??!!?? by jandrese · 2008-07-08 08:28 · Score: 1

I'm sure it won't take long for the module to show up on CPAN.

--

I read the internet for the articles.

As a former user of CORBA by Anonymous Coward · 2008-07-08 08:28 · Score: 5, Interesting

It looks like Google has taken some of the good elements of CORBA and IIOP into its own interchange format.
While CORBA certainly is bloated in a lot of ways, the IIOP wire protocol it uses is vastly faster and more efficient than any XML out there.. and yes it is just as "open" (publicly documented and Freely available for use in any open source application) as any XML schema out there. J2EE uses IIOP as well and its is technically possible to interoperate (although the problem with CORBA is that different implementations never really interoperated as they were supposed to).
As a side note, I'd rather write IDL code than an XML schema any day of the week too, but that's another rant.

compare to thrift ( from facebook) by Anonymous Coward · 2008-07-08 08:29 · Score: 5, Informative

both really from the same design sheet, but thrift has been opensource'd for over a year, and has many more language bindings. its been in use in several opensource projects (thrudb comes to mind), and has much more extant articles/documentation.

http://developers.facebook.com/thrift/

Fast by JamesP · 2008-07-08 08:30 · Score: 5, Interesting

"And, yes, it is very fast â" at least an order of magnitude faster than XML."

Just wait for the XML zealots to come crashing and not believing that XML is not the fastest, best, solution to all the world's problems (including cancer) and of course people at Google are amateurs and id10ts and WHY DO YOU HATE XML kind of stuff.

Or, as Joel Spolski once said: http://www.joelonsoftware.com/articles/fog0000000296.html

No, there is nothing wrong with XML per se, except for the fans...

--
how long until /. fixes commenting on Chrome?

Re:Fast by shutdown+-p+now · 2008-07-09 04:58 · Score: 1

The only thing that goes for XML is the fact that it has plenty of optimized libraries and tools for virtually every aspect of XML processing; but, it's a big thing. That's why it's touted as "the format for data exchange" - it may be slow, but you know that it can be fed to virtually anything either directly, or with a bit of XSLT juggling.

Smart move by ruin20 · 2008-07-08 08:32 · Score: 5, Insightful

Since they're Google people will clamor over this (as we're doing here) and the result will be at least a handful of folks will learn and use it. Google's key to success has always been finding fresh talent and removing barriers from their contributing and advancement so what I've seen they've done is A) help train potential employee's on how they're tech and thought process works, and B) provide themselves a filter by which to gauge the ability for a potential employee to understand they're system.

And as a bonus, they help undermine opponents who use competing technologies by helping train the workforce away from their practices. Overall I think it's very intelligent and well done strategic move.

--
Oh honey look... How cute... an angry slashdotter!

XML was not created for speed by UseCase · 2008-07-08 08:35 · Score: 1

Binary encoding, none hierarchy based string list, and simple file serialization are all faster than XML. XML was created flexibility, commonality and human readability not speed. XSL, XQuery, and XPATH along with the DOM or SAX supply out of the box query, transformation, and manipulation capability.

Re:XML was not created for speed by ORBAT · 2008-07-08 09:40 · Score: 1

XML is "human readable" in the same sense as TECO programs are.
Re:XML was not created for speed by Ant+P. · 2008-07-08 12:56 · Score: 2, Informative

XML was created to look like SGML but with more strict parsing rules. The rest of those TLAs you list were created out of sadism.
Re:XML was not created for speed by UseCase · 2008-07-09 04:05 · Score: 1

From the IDL documentation:
"However, protocol buffers are not always a better solution than XML â" for instance, protocol buffers would not be a good way to model a text-based document with markup (e.g. HTML), since you cannot easily interleave structure with text. In addition, XML is human-readable and human-editable; protocol buffers, at least in their native format, are not. XML is also â" to some extent â" self-describing. A protocol buffer is only meaningful if you have the message definition (the .proto file)."
As for IDL, they came up with an solution that would work for there specific needs. Given the circumstances most dev teams would do the same.
Everyone (who wants to) knows the SGML XML historical association by now. Mark-up has its place. My post was about the irrelevance of speed comparision not any specific love or hate for a mark-up technology.

The killer feature is simplicity by jandrese · 2008-07-08 08:37 · Score: 5, Insightful

The point of this isn't so much that it's faster than XML (so is everything else), it's that google took everything that a real person needs in a IDL and cut out everything else. Most IDLs have a serious case of second system effect, where features are added that nobody uses but seriously complicate the API. Even XML suffers from that (have you ever seen the kind of data structure you need to store a DOM, or what that does to library APIs for manipulating XML)?

I'd use it because 95% of the time all I need is something simple like this, and the other 5% of the time I should go back and rethink my design anyway.

That said, there is still a case for XML, especially the self documenting and human readable nature of the document, but there are a lot of cases where it is used today where it only adds unnecessary complexity and actually makes your code more difficult to maintain instead of simpler.

--

I read the internet for the articles.

When can we talk this way? by Sybert42 · 2008-07-08 08:37 · Score: 1

So...when can we abandon these silly letters and decimal numbers to express ourselves in binary? It's like the elephant in the room. We all want a semantic web, but we all want it in English. At least Lojban has a start on a parsable language, but it still wants to be speakable.

Re:When can we talk this way? by maxume · 2008-07-08 09:46 · Score: 1

Nobody cares about the semantic web.
Of course, that isn't entirely true, but it is a lot more true than the statement that we all want the semantic web.

--
Nerd rage is the funniest rage.

Re:No PERL API ??!!?? by Rgb465 · 2008-07-08 08:39 · Score: 1

Thats OK, we have Storable.

Re:No PERL API ??!!?? by yknott · 2008-07-08 08:40 · Score: 5, Informative

According to Brad Fitzpatrick's(of LiveJounral fame) blog, He's working on Perl support.

XML is a crappy format by Alex+Belits · 2008-07-08 08:42 · Score: 4, Insightful

I always told people that -- it's optimized for:

1. Easy parsing by parsers written by people who slept through their compiler classes.

2. Verification in situations when it's impossible to devise a meaningful reaction to a failure (other than either "everything failed, turn off the computers and go home" and "assume the data to be valid anyway because ALL of it will have the same formatting error because the same program generates it")

3. Dealing with data that arrives in neatly packaged "documents" and "requests", as opposed to being constantly produced and consumed.

4. Either communicating between programs that have the same knowledge of message semantics, or preparation of pretty human-readable documents.

None of the above even remotely applies to anything practical except UI/display formats -- this is why XHTML and ODF (and because of that at some extent XSL) are usable, SOAP is a load of crap, and for the rest of purposes XML is used as a glorified CSL with angle brackets. XML is widespread because monumentally stupid standard is still better than no standard.

So here is your example of how superior can be ANY format that is not based on this stupid idea.

--
Contrary to the popular belief, there indeed is no God.

Re:XML is a crappy format by mattcasters · 2008-07-08 09:19 · Score: 1

...
5. Handles codepages very well
6. Supports Unicode
7. Handles codepages very well
8. Supports Unicode
9. Handles codepages very well
and did I mention this one?
10. Supports Unicode

--
News about the Kettle Open Source project: on my blog
Re:XML is a crappy format by Alex+Belits · 2008-07-08 09:32 · Score: 2, Informative

Actually it handles languages EXTREMELY POORLY, because one of the design goal was to make Unicode mandatory. If XML was truly designed for handling multilingual data, every tag would be able to have attributes for language, charset and encoding, and those tags would default to "undefined, treat as opaque", to ensure safe round trip of untagged data from/to other formats.
Now it's impossible to use non-Unicode charsets when using multiple languages in the same program because THE WHOLE FREAKING DOCUMENT (why is it even mandatory to have "document"? Log file, for example, may contain records that should be readable before the file reached it final length -- if it is written in XML, it's formally invalid until the last moment when the logger stopped writing to it and closed its last tag even though it has no semantic meaning for the log that is a collection of records and records only) has to have one and only one charset/encoding even though it can contain text in multiple languages. Since most charsets only support one or few related languages, all implementations of XML-using software ended up hardcoding Unicode as the only supported charset.

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by afidel · 2008-07-08 09:52 · Score: 1

And then there's those of us using XML and SOAP and WSDL to build real world applications that work just fine. If you're an Oracle shop running one of their financial suites you know just how awesome it is that they are opening everything up as web services and integrating components from all of their product lines. We put together a workflow process for our accounts payable in just a few months that will save us millions over the next couple years. Compare that to the work it would have taken with a traditional fat client and closed protocol stack architecture and it's simply amazing. I'm not so much a fan of XML as I am a fan of openness and XML just happens to be the tool that brought openness to the real world of enterprise applications.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:XML is a crappy format by Alex+Belits · 2008-07-08 11:14 · Score: 2, Informative

You still have all conversion routines built into language support, so all non-Unicode charsets still carry their support code into software. And it would be very easy to switch between charsets -- this happens anyway when you deal with character ranges that are not present in the fonts you use for your output. It all happens behind the scenes anyway.
The problem is, XML developers' Unicode fanaticism threw all this flexibility out of the window on the level of document data and metadata processing, keeping all complexity and sabotaging functionality, just to leave implementors and users no choice but to convert everything to Unicode.
This came at a pretty high price -- "simplified" processing allowed to handle text as if languages don't matter, so whenever I write a document in both Russian and English I actually get either document in English with sequences of Cyrillic characters that look like Russian words, or document in Russian with sequences of Roman characters that look like English words (as opposed to, say, Roman characters in Russian-only text that contains formulas).
Everything that is capable of editing or processing XML will make no attempt to let me choose the languages, so support for language-dependent processing is just as missing as support for charset-dependent processing, it's all Unicode, right? Except, of course, Unicode will do nothing to tell the application to choose right capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting, etc. -- "simplification" only made sense if it turned language handling into a superficial imitation of itself. If support for multiple charsets was included, it would automatically have to properly process all metadata including languages, and it would be able to use non-Unicode-oriented language-specific routines that existed for decades before Unicode.
So this is what Unicode produced -- it allowed to write software that looks like it draws all those pretty characters, but in fact does not handle languages in a way that a native speaker would recognize them as such. Good job, indeed.

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by Alex+Belits · 2008-07-08 11:24 · Score: 1

For centuries people burned forests to get patches of land that remained suitable for agriculture for a few years before moving to another patch. It worked, too. This does not mean that such barbaric practice was not a result of fundamental ignorance, or that it would be acceptable in a modern society.
Same applies to "technologies" such as XML -- except, of course, primitive agriculture had far more excuses to be used at its time than XML has now.

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by mmurphy000 · 2008-07-08 11:31 · Score: 3, Insightful

Y'know, I usually give low-UID Slashdotters a modicum of respect, but this diatribe is off-the-charts nonsense.

1. Easy parsing by parsers written by people who slept through their compiler classes.
And your evidence of this assertion is...what exactly? Not to mention the minor detail that XML and compilers are orthogonal: you can use XML (or many other data interchange formats) with non-compiled languages, and most compilers know nothing about XML (or many other data interchange formats).

2. Verification in situations when it's impossible to devise a meaningful reaction to a failure (other than either "everything failed, turn off the computers and go home" and "assume the data to be valid anyway because ALL of it will have the same formatting error because the same program generates it")
And your evidence of this assertion is...what exactly? XML-consuming programs that are aware of the data structure can have as detailed a "reaction to a failure" as a JSON-consuming program, or a YAML-consuming program, or a Protocol Buffer-consuming program. XML-consuming programs that are not aware of the data structure can, if the XML supplies it, validate against a DTD or schema, things which are not possible in some other data interchange formats (e.g., JSON, YAML).

3. Dealing with data that arrives in neatly packaged "documents" and "requests", as opposed to being constantly produced and consumed.
All data comes in neatly packaged buckets of varying types. We call them "bytes" and "packets" and "structures" and "records" and "frames" and "rows" and the like. The only way I can interpret your claim in a way that makes sense is to translate it as "XML sucks for streaming audio and video", which is undoubtedly true, and I don't think anyone uses it in that arena.

4. Either communicating between programs that have the same knowledge of message semantics, or preparation of pretty human-readable documents.
On the contrary, this is one of XML's primary strengths — handling cases where programs lack the "same knowledge of message semantics".
With most data interchange formats, from CSV to JSON to Protocol Buffers, either you know everything about the data structure you're receiving, or you're screwed. In other words, there is no discoverability and no standardized means of being able to only deal with a portion of the data. This is particularly true for binary formats, like Protocol Buffers — either you know exactly what structure you received so you can parse it, or you're SOL, since it's just a bunch of bytes.
With XML namespaces, it is entirely possible for Program X to publish data that Program Y has no intrinsic knowledge of in its entirety, but might know in part. If Program Y knows how to handle documents containing Dublin Core elements, for example, it can work with just those elements and ignore the rest of the document.
You're welcome to have any opinion of XML you like. Heck, I even agree that XML tends to be used in places where it's overkill or too verbose. But if you want to convince others that your opinion is the correct one, you'll need to do a better job than this.

--
The Busy Coder's Guide to Android Development
Re:XML is a crappy format by Alex+Belits · 2008-07-08 12:08 · Score: 1

And your evidence of this assertion is...what exactly?
The fact that I didn't sleep through compiler classes, and therefore know that a parser is an extremely simple piece of software. XML is only useful if you want to use XML parser, and the actual design of the format is at about the same level of quality as Fortran syntax. Likely for the same reason -- Fortran predates widespread knowledge of parsers, so it was made to be so cumbersome, a person who does not understand what a parser is still can write a parser for it.

Not to mention the minor detail that XML and compilers are orthogonal: you can use XML (or many other data interchange formats) with non-compiled languages, and most compilers know nothing about XML (or many other data interchange formats).
Any procedure that converts structured serialized data into any other format is by definition a parser. Parsers are usually studied in compiler-related classes, however same concepts and designs apply to all parsers.

And your evidence of this assertion is...what exactly? XML-consuming programs that are aware of the data structure can have as detailed a "reaction to a failure" as a JSON-consuming program, or a YAML-consuming program, or a Protocol Buffer-consuming program.
And all of them "check" the format, wasting CPU time, memory and cache, then can do nothing but crash (oh, sorry, throw exception for which there is no valid logic to handle) in the impossible case of format being invalid, and doing nothing if the actual data is semantically invalid (because semantic processing is done by a program written by a programmer who knows that it can't verify the data). Validation solves the problem that does not exist, it makes as much sense as accompanying data structures in memory with a CRC -- if it ever does not match, what are you going to do, send a message "Stand by for imminent crash" into the log? It's a completely wrong place for verification unless your application development model is "perma-debugging".

XML-consuming programs that are not aware of the data structure can, if the XML supplies it, validate against a DTD or schema, things which are not possible in some other data interchange formats (e.g., JSON, YAML).
DTD and schema contain no semantic information. Semantics (actual meaning of data for application) is what the application has to process and possibly validate, it's too late to react if things are so screwed up that syntax is invalid, your application would be dead at that point already.

All data comes in neatly packaged buckets of varying types. We call them "bytes" and "packets" and "structures" and "records" and "frames" and "rows" and the like. The only way I can interpret your claim in a way that makes sense is to translate it as "XML sucks for streaming audio and video", which is undoubtedly true, and I don't think anyone uses it in that arena.
XML sucks for streaming data. Most of the data in anything that actually used for some practical purpose is of a "streaming" kind, request-response cycle is more often an exception than a rule. It only became popular because it's easy to implement with crappy tools.

On the contrary, this is one of XML's primary strengths -- handling cases where programs lack the "same knowledge of message semantics".
By definition, if you don't know semantics, data is meaningless (get it -- semantics, meaning). All you can do is to resend, filter or display the data, what illustrates my point -- it's good for displaying pretty pictures to the user, bad for doing actual processing work.

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by SurturZ · 2008-07-08 13:10 · Score: 1

I think you are missing the point. XML is good where you want to receive data from other systems over which you have no control. So it doesn't matter how good you are as a programmer, and how well you write YOUR program, the issue is that you've got cabbages (or programmers who resemble cabbages) upstream sending you data.
You therefore need to specify the protocol in a format even a cabbage can read and format. XML is great for that. Yes you need to check each messages which wastes processor cycles, but you can't trust the data you are receiving, so you need to check it anyway. XML is fat, bloated and redundant, but it is the lowest common denominator* and that is why it is so useful.
* Actually it replaced the lowest common denominator - CSV. At least XML has a written standard, unlike CSV.
Re:XML is a crappy format by mmurphy000 · 2008-07-08 13:12 · Score: 3, Interesting
And all of them "check" the format, wasting CPU time, memory and cache, then can do nothing but crash (oh, sorry, throw exception for which there is no valid logic to handle) in the impossible case of format being invalid, and doing nothing if the actual data is semantically invalid (because semantic processing is done by a program written by a programmer who knows that it can't verify the data). Validation solves the problem that does not exist, it makes as much sense as accompanying data structures in memory with a CRC -- if it ever does not match, what are you going to do, send a message "Stand by for imminent crash" into the log? It's a completely wrong place for verification unless your application development model is "perma-debugging".
In the world I live in, data is frequently valid, but not always:
- Data corruption in a communications link (e.g., this series of tubes we're using)
- Data corruption in a storage medium (e.g., hardware hiccup, bit flip due to cosmic ray)
- Version differences between sender and receiver conception of the data format
- Malware that pretends to be a legitimate sender but, instead, sends invalid data
Many of those can be caught by the general-purpose validators that you decry, and that limits the number of validation routines programmers have to deal with. And your complaints re: CPU, memory, and cache place a value on them that may or may not be proper in every context. Or, as my former business partner put it, "in six months' time, computers will be faster and cheaper, but programmers will be neither".

Most of the data in anything that actually used for some practical purpose is of a "streaming" kind, request-response cycle is more often an exception than a rule. It only became popular because it's easy to implement with crappy tools.
You obviously have a very different definition of "streaming" than I do, as I'd argue virtually nothing uses streaming, from the days of FORTRAN and COBOL to the present day.

By definition, if you don't know semantics, data is meaningless (get it -- semantics, meaning).
Precisely. Decomposable formats, like XML, allow programs to have semantics for part, but not all, of a data structure. Non-decomposable formats, like C structs, require semantics for all of a data structure. In situations where you know 100% of all use cases for a data structure, non-decomposable formats are fine. If, however, you want to allow for what Jonathan Zittrain refers to as "generativity" (i.e., unanticipated uses for existing technology as a means of advancing said technology), decomposable formats can be a benefit.
Take, for example, ODT vs. classic binary Word documents, which are pretty much just a serialization of a big-ass binary structure as I understand it. I've written programs that parse and generate ODT, or, more precisely, the portions of ODT that I need. Frankly, I don't care what the rest of it is, so long as my generated documents work properly. And I didn't need to refer to the ODT documentation on OASIS or anything to write them, as the XML was sufficiently human-readable that, accompanied with experimentation, I was able to determine how to generate valid ODT. With Word, even if there were OOXML-sized documentation for it, I'd have to hand-roll my own parser for the whole damn format, just to pick out the pieces I need to work with. Now, if I worked for Microsoft on the Word team, I wouldn't have that problem, because I'd already have the parser. However, I, like most people, don't work for Microsoft, and even if Microsoft's parsers were available, they might not fit my environment (e.g., won't run on Linux).
Don't get me wrong, XML definitely gets overused. That's a problem with the uses of XML, not XML itself.
--
The Busy Coder's Guide to Android Development
Re:XML is a crappy format by Alex+Belits · 2008-07-08 13:26 · Score: 2, Interesting

I think you are missing the point. XML is good where you want to receive data from other systems over which you have no control. So it doesn't matter how good you are as a programmer, and how well you write YOUR program, the issue is that you've got cabbages (or programmers who resemble cabbages) upstream sending you data.
So XML is good for talking to systems that use XML, and not for actually developing efficient or usable software!
That's my whole point -- its only value is that it's some standard that replaced the situation when no common standard existed. Actual quality of its design is still crap, it's written by wrong people, derived from wrong theoretical base and is implementing using wrong tools and techniques. I am not claiming that it's completely unusable, or that people shouldn't use it for user-oriented applications and interoperability. I claim that the quality of the standard is total shit, and people who developed it are self-serving, ideologically blinded, dishonest, incompetent hacks, so no wonder that those who actually needed a good data interchange format had to sevelop something different.

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by Alex+Belits · 2008-07-08 13:51 · Score: 1

Data corruption in a communications link (e.g., this series of tubes we're using)
This should never reach the application -- in practice the probability of it less than the probability of undetectable hardware failure that would crash the system. Even if it does happen, program has to perform its own buffer limits management to prevent crash because of overused resources -- and that can happen much earlier than any format error will be found.

Data corruption in a storage medium (e.g., hardware hiccup, bit flip due to cosmic ray)
Applications should NEVER try to recover from it. There is higher probability that application code or internal data structures are damaged, so more code == more opportunities to crash. ECC hardware handles this already, and if ECC did not detect it, there are higher chances that your application will cause less damage by crashing than by trying to do any error handling while its internal code or data are corrupt.

Version differences between sender and receiver conception of the data format
This should implemented as a part of format/protocol, just like each and every protocol that was ever used over the Internet. Receiver should never try to second-guess the sender and vice-versa -- if there is a backward compatibility built into protocol, it should not rely on validation. If there isn't, session should fail immediately. SMTP does that cleanly. HTTP does that cleanly. Freaking Telnet does that cleanly. There is absolutely no excuse to rely on invalid data to detect version-dependent changes.

Malware that pretends to be a legitimate sender but, instead, sends invalid data
Secure protocol design is a veru well-researched area. XML is not designed according to its guidelines in the first place, validation of invalid data is often a great DoS target in itself.
The mere fact that you mention those issues mean that you are unfamiliar with general principles of network protocol design that were developed over decades of theoretical development and practical use. Same applies to people who developed XML.

Precisely. Decomposable formats, like XML, allow programs to have semantics for part, but not all, of a data structure. Non-decomposable formats, like C structs, require semantics for all of a data structure. In situations where you know 100% of all use cases for a data structure, non-decomposable formats are fine. If, however, you want to allow for what Jonathan Zittrain refers to as "generativity" (i.e., unanticipated uses for existing technology as a means of advancing said technology), decomposable formats can be a benefit.
No. You don't just abandon the task of handling semantics. YOU CREATE A LANGUAGE to describe semantics. What is not difficult as long as you keep within appropriate language model. The problem is, judging by XML designers' ignorance about parsers, language design was far beyond their intellectual capabilities. What goes back to my point -- XML was developed by incompetent people, for wrong reasons, and based on assumptions that were known to be incorrect long before this monstrosity was developed.

Take, for example, ODT vs. classic binary Word documents, which are pretty much just a serialization of a big-ass binary structure as I understand it. I've written programs that parse and generate ODT, or, more precisely, the portions of ODT that I need. Frankly, I don't care what the rest of it is, so long as my generated documents work properly. And I didn't need to refer to the ODT documentation on OASIS or anything to write them, as the XML was sufficiently human-readable that, accompanied with experimentation, I was able to determine how to generate valid ODT. With Word, even if there were OOXML-sized documentation for it, I'd have to hand-roll my own parser for the whole damn format, just to pick out the pieces I need to work with. Now, if I worked for Microsoft on the Wo

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by Breakfast+Pants · 2008-07-08 17:34 · Score: 1

Don't be ignorant: http://en.wikipedia.org/wiki/Terra_preta

--

--

WHO ATE MY BREAKFAST PANTS?
Re:XML is a crappy format by pikine · 2008-07-08 18:23 · Score: 2, Informative

Not to mention the minor detail that XML and compilers are orthogonal: you can use XML (or many other data interchange formats) with non-compiled languages, and most compilers know nothing about XML (or many other data interchange formats).
If you have taken a compiler class, you'd learn about "compiler compilers" which are parser generators. He's just talking about the concept of parsing in general, and that XML is for people who don't understand how to write parsers.
I don't agree with everything he says, but I think you need to know some context to understand that it's not off-the-chart nonsense.

--
I once had a signature.
Re:XML is a crappy format by Alex+Belits · 2008-07-08 23:21 · Score: 1

Really?
1. What is the actual rate of undetected corrupt (not lost -- those are reliably detected) packets that reach TCP layer, and how it compares with undetected error rate of those computers' RAM? We are not just dealing with extremely small, even though nonzero probabilities -- it's a multiplication of very small probabilities.
2. If you really have computer so reliable that you have to worry about magically checksum- matching TCP packet corruption, syntax check is the wrong tool for the job -- those errors are very unlikely to show up in XML as invalid syntax, so XML validation is still completely pointless for them. You need either a tunnel/VPN that will detect an error when packet or TCP stream is forwarded to the local host or network, or application-level data validation, what is beyond the scope of XML.
This is, of course, a solution for the paranoid. Its practical usefulness is about at the same level as planning a defense against an army of ninjas attacking the data center -- after all, probability of that is nonzero, too.

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by Alex+Belits · 2008-07-08 23:50 · Score: 1

If you want to pretend that Wikipedia is a valid source for learning about ancient agricultural practices, at least try to follow the links to some most relevant concepts mentioned in the article.
Such as http://en.wikipedia.org/wiki/Slash_and_burn

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by shutdown+-p+now · 2008-07-09 05:09 · Score: 2, Informative

The problem is, XML developers' Unicode fanaticism threw all this flexibility out of the window on the level of document data and metadata processing, keeping all complexity and sabotaging functionality, just to leave implementors and users no choice but to convert everything to Unicode.
You might have not noticed, but it's not just XML. Almost everyone has moved to Unicode now, and those who haven't yet (Ruby, PHP) are being mocked for just that, and have the move on the top of their TODO. Learn to live with it already.

This came at a pretty high price -- "simplified" processing allowed to handle text as if languages don't matter, so whenever I write a document in both Russian and English I actually get either document in English with sequences of Cyrillic characters that look like Russian words, or document in Russian with sequences of Roman characters that look like English words (as opposed to, say, Roman characters in Russian-only text that contains formulas).
Everything that is capable of editing or processing XML will make no attempt to let me choose the languages, so support for language-dependent processing is just as missing as support for charset-dependent processing, it's all Unicode, right? Except, of course, Unicode will do nothing to tell the application to choose right capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting, etc. -- "simplification" only made sense if it turned language handling into a superficial imitation of itself. If support for multiple charsets was included, it would automatically have to properly process all metadata including languages, and it would be able to use non-Unicode-oriented language-specific routines that existed for decades before Unicode.
You are extremely confused here. Neither of these: "capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting" - has anything to do with charset or encoding; none whatsoever. This is because encoding has nothing to do with a language. UTF-8 is an encoding which can handle hundreds of languages. Latin-1 is another one that can handle perhaps several dozen. Windows-1251, IIRC, can handle both Russian and Belarusian. It really does not matter. What matters is the language of the text and the associated culture (aka "locale") - trying to infer it from encoding or charset is silly and, in the end, futile.
And, surprise surprise, XML has a standard mechanism to associate content of an element with a specific language - it's xml:lang attribute. So, whenever you write, for example, an XHTML document that contains both English and Russian, all you need is to surround parts of texts in another language with <span> or <div>, and mark them with xml:lang. It's even specifically mentioned in the XHTML spec.
Re:XML is a crappy format by SurturZ · 2008-07-09 13:55 · Score: 1

That's my whole point -- its only value is that it's some standard that replaced the situation when no common standard existed. Actual quality of its design is still crap, it's written by wrong people, derived from wrong theoretical base and is implementing using wrong tools and techniques. I am not claiming that it's completely unusable, or that people shouldn't use it for user-oriented applications and interoperability.
In terms of 'quality of design', the optimised features are readability and understandability. In that, it is very good. Everything else can obviously be better done in another way, using a binary format or whatever. The fact that it can be validated by non-programmers is an important advantage of XML. If you are a customer with two vendors, each who claim the other's system is at fault, being able to look at the XML passing between them is a huge advantage. If it is a binary format, and you are not a computer programmer, then you are unable to referee the dispute.

I claim that the quality of the standard is total shit, and people who developed it are self-serving, ideologically blinded, dishonest, incompetent hacks.
Well I agree with you here. XML is useful DESPITE the standard, not because of it :-)
Re:XML is a crappy format by Alex+Belits · 2008-07-09 18:05 · Score: 1

I don't think anyone is arguing that XML is well designed or fast. But until some other format offers you the ability to express validation rules in as nuanced a way as XSD and manages to get the format standardized to the point where everyone can be expected to understand it, it's all we have.
Validation is a worthless waste of time. It will not help you if you have received a message with invalid data and valid format, and it will not help you to develop a procedure that can only produce valid format -- you can only test finite number of generated messages/documents, and therefore will never be sure that your system will not fail.
You don't need validation, you need a mechanism that produces two matching procedures -- one that is guaranteed to generate valid format, another that is guaranteed to parse it. As long as those procedures support the same standard (what may have to include support for forward and backward compatibility -- "format" may be pretty complex), are resistant to deliberately invalid data (so they will produce meaningless output or error but won't crash or overuse resources), and can be used on all platforms and languages that may need to handle this format, you can rely on the system built with them.
Please note that I consider meaningless output in response to invalid data to be a perfectly acceptable outcome because application that can be exposed to potentially malicious or corrupt data must support some trust/security model and consistency checks that will prevent damage when invalid data arrives regardless of that data being properly or improperly formatted. All protocol and format have to do is not to contain their own vulnerabilities. It does not matter how compatibility is achieved -- procedures may be generated from meta-language, they may be written manually and proven to be correct, or they may call some library that parses data according to meta-language. The point is, there should be no theoretical possibility of producing mismatched data producer and consumer. You don't write something, then run it and check if you were lucky, and it miraculously happened to produce a format that passes validation. You make something YOU CAN PROVE to be incapable of producing invalid data or mis-parsing valid one.
XML standard developers did not understand those most fundamental things that have to be considered when someone writes software that produces an output in some language.

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by cduffy · 2008-07-11 06:00 · Score: 1

I think this is a silly subtopic in a discussion of XML, but just to jump in here...

What is the actual rate of undetected corrupt (not lost -- those are reliably detected) packets that reach TCP layer, and how it compares with undetected error rate of those computers' RAM? We are not just dealing with extremely small, even though nonzero probabilities -- it's a multiplication of very small probabilities.
Depends on where your failures are, but sometimes it's quite high; I recall a time within the last ten years I had to redownload an ISO from the exact same server several times to get the file without the md5sum on the transferred file coming out wrong -- and on comparison between the attempts it turned out to be a single byte here or there that was flipped. Adding an extra layer of data integrity protection (IPsec) resolved the issue (and I had no further issues after that point), but data corruption making it past the TCP-layer checksum really can and does happen in practice.
Re:XML is a crappy format by Alex+Belits · 2008-07-13 22:44 · Score: 1

You might have not noticed, but it's not just XML. Almost everyone has moved to Unicode now, and those who haven't yet (Ruby, PHP) are being mocked for just that, and have the move on the top of their TODO. Learn to live with it already.
Actually no. I know because I am Russian.

You are extremely confused here. Neither of these: "capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting" - has anything to do with charset or encoding; none whatsoever. This is because encoding has nothing to do with a language. UTF-8 is an encoding which can handle hundreds of languages. Latin-1 is another one that can handle perhaps several dozen. Windows-1251, IIRC, can handle both Russian and Belarusian. It really does not matter. What matters is the language of the text and the associated culture (aka "locale") - trying to infer it from encoding or charset is silly and, in the end, futile.
And, surprise surprise, XML has a standard mechanism to associate content of an element with a specific language - it's xml:lang attribute. So, whenever you write, for example, an XHTML document that contains both English and Russian, all you need is to surround parts of texts in another language with or

, and mark them with xml:lang. It's even specifically mentioned in the XHTML spec.
Apparently you haven't read what I wrote.
XML "allows" to handle languages. The problem is, it does not force implementors to handle them because it allows an easy shortcut. And implementors who knew nothing about language metadata that may be relevant (pretty much all of people who pushed for Unicode in the first place) didn't feel an obligation to support any kind of language-dependent processing. If the standard forced them to handle multiple charsets, they would not be able to create this fake language support -- at the moment the application encountered anything beyond their beloved ISO-8859-1 (what would be 99% of "Unicode" documents those people seen), it would have to perform some metadata-dependent text handling -- for both charset and language. Application would have two choices -- never assume anything about a language and pass data transparently, or call charset/language-dependent handling for everything they do that may be dependent on charset and language. Since this wasn't done, at this point not a single piece of software actually supports multiple languages in a way that XML and Unicode-promoters promised. Not OpenOffice.org, not Microsoft Office, not text input in any UI toolkit, not mailreaders, not databases. Every time someone implements a particular language support or common support for multiple languages, it has to be done in a way that breaks the intended model of metadata handling in XML, and imposes "special" rules on Unicode ranges that have nothing to do with Unicode standard.
You may claim that those are all crappy implementations, however my point is, the standard was specifically designed to allow developers hide behind pretty pictures that look like text is "foreign" languages -- except really there is no language support whatsoever.

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by Alex+Belits · 2008-07-13 22:51 · Score: 1

Oh, great, your "tags" broken the quoting.

You might have not noticed, but it's not just XML. Almost everyone has moved to Unicode now, and those who haven't yet (Ruby, PHP) are being mocked for just that, and have the move on the top of their TODO. Learn to live with it already.
Actually no. I know because I am Russian.

You are extremely confused here. Neither of these: "capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting" - has anything to do with charset or encoding; none whatsoever. This is because encoding has nothing to do with a language. UTF-8 is an encoding which can handle hundreds of languages. Latin-1 is another one that can handle perhaps several dozen. Windows-1251, IIRC, can handle both Russian and Belarusian. It really does not matter. What matters is the language of the text and the associated culture (aka "locale") - trying to infer it from encoding or charset is silly and, in the end, futile.
"Locale" is NEVER about handling of data, only about presentation of it to the user and locally defined elements of user interface. Language in a document has to be always handled in exactly the same way as long as it's the same language, regardless of what user's settings are. There may not even be a "user" or "locale" for non-interactive data processing. People suffered more than enough with "locale" settings in databases mangling their data already.

And, surprise surprise, XML has a standard mechanism to associate content of an element with a specific language - it's xml:lang attribute. So, whenever you write, for example, an XHTML document that contains both English and Russian, all you need is to surround parts of texts in another language with span or div, and mark them with xml:lang. It's even specifically mentioned in the XHTML spec.
Apparently you haven't read what I wrote.
XML "allows" to handle languages. The problem is, it does not force implementors to handle them because it allows an easy shortcut. And implementors who knew nothing about language metadata that may be relevant (pretty much all of people who pushed for Unicode in the first place) didn't feel an obligation to support any kind of language-dependent processing. If the standard forced them to handle multiple charsets, they would not be able to create this fake language support -- at the moment the application encountered anything beyond their beloved ISO-8859-1 (what would be 99% of "Unicode" documents those people seen), it would have to perform some metadata-dependent text handling -- for both charset and language. Application would have two choices -- never assume anything about a language and pass data transparently, or call charset/language-dependent handling for everything they do that may be dependent on charset and language. Since this wasn't done, at this point not a single piece of software actually supports multiple languages in a way that XML and Unicode-promoters promised. Not OpenOffice.org, not Microsoft Office, not text input in any UI toolkit, not mailreaders, not databases. Every time someone implements a particular language support or common support for multiple languages, it has to be done in a way that breaks the intended model of metadata handling in XML, and imposes "special" rules on Unicode ranges that have nothing to do with Unicode standard.
You may claim that those are all crappy implementations, however my point is, the standard was specifically designed to allow developers hide behind pretty pictures that look like text is "foreign" languages -- except really there is no language support whatsoever.

--
Contrary to the popular belief, there indeed is no God.
Re:XML is a crappy format by shutdown+-p+now · 2008-07-14 00:55 · Score: 1

Actually no. I know because I am Russian.
So am I. I develop mostly Russian-language software (though we also provide English versions for some products) for a Russian company. We use Unicode throughout, and don't have any trouble with it. Your point was?

Re:Good by drinkypoo · 2008-07-08 08:43 · Score: 1

Microsoft has open-sourced some things upon abandonment. That's better than some companies, even. Companies can be good in some areas, and evil in others, however.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Re:No PERL API ??!!?? by dedazo · 2008-07-08 08:44 · Score: 1

Yeah, and I'd like this for the .NET CLR and Mono as well. I looked at the code and the generators are not that complicated, maybe I'll give it a shot over the weekend. Does Google accept outside contribs for projects like these?

--
Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo

Re:No PERL API ??!!?? by maxume · 2008-07-08 08:46 · Score: 1

http://brad.livejournal.com/2387105.html

--
Nerd rage is the funniest rage.

Re:WTF am I missing by jandrese · 2008-07-08 08:47 · Score: 5, Informative

They open sourced the compiler (for C++, Java, and Python) that lets you actually use the data interchange format. If you follow the link you can download the code and start using it today. The code is open source.

--

I read the internet for the articles.

JSON by hey · 2008-07-08 08:49 · Score: 4, Interesting

Looks kinda like JSON to me.

Re:JSON by SuperKendall · 2008-07-08 08:52 · Score: 1

I was kind of wondering the same thing, JSON was created to fill the same need. JSON is more like XML in that it's meant to be human parsable though, which counts for a lot in web use I think.

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Re:JSON by Temporal · 2008-07-08 09:20 · Score: 4, Informative

Structurally Protocol Buffers are similar to JSON, yes. In fact, you could use the classes generated by the Protocol Buffer compiler together with some code that encodes and decodes them in JSON. This is something some Google projects do internally since it's useful for communicating with AJAX apps. Writing a custom encoding that operates on arbitrary protocol buffer classes is actually pretty easy since all protocol message objects have a reflection interface (even in C++).
The advantage of using the protocol buffer format instead of JSON is that it's smaller and faster, but you sacrifice human-readability.
Re:JSON by merreborn · 2008-07-08 09:27 · Score: 1

Looks kinda like JSON to me.
JSON is a text-based serialization format. "Protocol buffer" has a binary format, in addition to a text format. The binary format is where the size & speed benefits come from. Text formats introduce overhead.
It also handles all the schema compliance and schema versioning for you. JSON doesn't do any of that.
Re:JSON by pavon · 2008-07-08 09:57 · Score: 3, Informative

The major difference between this and something like JSON or YAML or even XML is that those formats all include the format information (variable names, nesting, etc) along with the data. This does not.

message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}
What you are looking at above is the Protocol Format (.proto file) for a single message, which is analogous to an XML schema. No data is stored in that file - the numbers you see are unique ids for the different fields, and they are used in the low low-level representation of the data (not all fields have to be included in every instance of a message)
The actual data is serialized using a compact binary format, not ASCII like JSON/YAML/XML which makes it much more efficient both to transfer over a network as well as to parse.
Re:JSON by 0xABADC0DA · 2008-07-08 10:35 · Score: 5, Interesting

Modify JSON so unquoted attributes are 'type labels' and define the type of an attribute by giving a label or a default value. For instance:

phoneType: { MOBILE: 0, HOME: 1, WORK: 2 }

phoneNumber: { "number": "", "type": phoneType }

person: { "name": "", "id": 0, "email": "", "phone": [ phoneNumber ], }

... now you have pretty much exactly the same message definition as protocol buffers, but in pure JSON. It could also use some convention like "@WORK" for labels/classes so that a normal JSON parser can parse the message definitions. You can write a code generator to make access classes for messages just by walking the json and looking at the types. I don't see that 'required' and 'optional' keywords help much... imo defaults are generally better (even if they are nil). But this could easily be expressed in a json message definition.
It's easy to make a binary JSON format that is fast and also small, so there is little advantage to protocol buffers there. It's also easy and ridiculously fast to compress JSON text using say character-based lzo (Oberhumer).
Maybe somebody can explain, but it doesn't seem like protocol buffers really have much advantages over JSON. It sounds like it is effectively just a binary format for JSON-like data (name-value pairs they say) along with a code generator to access it. The code generator is nice, but this is like a day's work max. Maybe I'm not understanding google's problems, but I'll stick with JSON since it actually is a cross-platform, language neutral data format... and you can always optimize it if actually needed.
Re:JSON by Enrico+Pulatzo · 2008-07-09 04:20 · Score: 1

Of course, one must realize that you can't just change JSON. Changes to JSON will suffer the same fate as technologies like E4X. For server-side things this change could happen, but one would lose the benefit of JSON: a single, simple object representation.
Re:JSON by RAMMS+EIN · 2008-07-09 04:34 · Score: 1

``Maybe somebody can explain, but it doesn't seem like protocol buffers really have much advantages over JSON. It sounds like it is effectively just a binary format for JSON-like data''
That's your advantage right there. Without having done benchmarks to back up my statements, I dare claim that JSON is more verbose and slower to parse than Protocol Buffers.

--
Please correct me if I got my facts wrong.

Re:WTF am I missing by Chyeld · 2008-07-08 08:51 · Score: 5, Insightful

Seems like you are missing the code they released that allows you to implement this in a number of languages from the 'get-go'.

You've also missed that they've just told the world how the majority of their systems talk, something most people would find interesting given how much Google does and the fact that one of Google's strong points is mangling huge amounts of data in a relatively quickly manner.

PS. Your format stinks and is horribly slow and unscalable when it comes to adding to the library. Genre's are so unbelievably grey defined that you might as well just sort them by the dominate color of the cover. Google would have done better.

Have they ever heard of BER/DER? by ugen · 2008-07-08 08:51 · Score: 2, Insightful

How is this either implementationally or conceptually different from BER/DER encoding (commonly used and available all over the place)?

Looks to me like it is exactly the same thing, reimplemented. I am sure bearing a mark of Google is nice and all, but they are definitely reinventing the wheel here.

Re:Have they ever heard of BER/DER? by Dan+Berlin · 2008-07-08 08:54 · Score: 1

Have you ever met anyone who worked with ASN.1 and didn't run screaming for the hills?
Re:Have they ever heard of BER/DER? by forsetti · 2008-07-08 09:00 · Score: 1

Yeah - those guys at MIT (Kerberos), UMich (LDAP), and the SSL guys ... not that anyone uses any of those protocols/implementations ...
ASN.1 is the solution ... the problem just hasn't been properly specified yet.

--
10b||~10b -- aah, what a question!
Re:Have they ever heard of BER/DER? by Dan+Berlin · 2008-07-08 09:09 · Score: 3, Funny

Uh, having one of the OpenSSL guys working down the hall, he certainly said he would shoot himself if he had to work with ASN.1 again.
Re:Have they ever heard of BER/DER? by forsetti · 2008-07-08 09:17 · Score: 1

Heh -- I'm not talking about the *implementers* ... just the *protocol designers*. That's why I left out *OpenLDAP* and *OpenSSL*. ;-)
Seriously though - ASN.1 is pretty good for specification, and some of the serializations aren't bad. Fairly compact, flexible ... But don't code it by hand - se a ASN.1 compiler.

--
10b||~10b -- aah, what a question!
Re:Have they ever heard of BER/DER? by hattig · 2008-07-08 10:41 · Score: 1

My first job after college was coding SNMP MIBs for DiffServ within FreeBSD.
ASN.1 wasn't all that bad really. And of course we used code generators that did all the ASN.1 stuff. But it was quite readable, and seemed very powerful.

Wow, they've reinvented FAST!!! by Giant+Electronic+Bra · 2008-07-08 08:52 · Score: 1

lol. Not that FAST is IDENTICAL, but it is essentially just a much more sophisticated implementation of the same basic idea...

--
"Malo periculosam, libertatem quam quietam servitutem." -- Jefferson

Re:Wow, they've reinvented FAST!!! by Dan+Berlin · 2008-07-08 08:56 · Score: 1

I don't think you "get" it. Google open sourced this because they thought it would be cool, not because they think it is an amazingly new idea that nobody has ever done before. It's not like Google hasn't been using this internally for 5 years (Which of course, makes all the JSON comments humorous).
Re:Wow, they've reinvented FAST!!! by Giant+Electronic+Bra · 2008-07-08 09:36 · Score: 1

Oh, I'm not knocking it down especially. Perhaps I'm jaded, I've seen fashions in data representation come and go, etc. Technologically it just isn't a big deal, thats all.
I think there are a few things the XML haters still haven't figured out though. Serialized XML is obviously no prize format efficiency-wise. I don't think anyone ever believed it would be. But the more different formats we have, the more balkanized data becomes. All these various formats don't do a thing about that problem.
XML did 2 things for us, it advanced the cause of separation of logical and physical formats, and it gave us a format which can at least represent the vast majority of data, which at least lets us have SOME common way of interchanging all this lovely 'my wonderful new glossy data format' data with all the systems that use someone else's wonderful glossy data format.
I wouldn't have expected Google to be using FAST since it didn't exist 5 years ago. I just think all the big 'hoorays' are a yawn is all.

--
"Malo periculosam, libertatem quam quietam servitutem." -- Jefferson

Re:WTF am I missing by Anonymous Coward · 2008-07-08 08:53 · Score: 1, Insightful

You are missing that you're an idiot. Cheers.

Re:No PERL API ??!!?? by A+beautiful+mind · 2008-07-08 08:54 · Score: 1

It's called "Perl".

--
It takes a man to suffer ignorance and smile
Be yourself no matter what they say

Re:WTF am I missing by shis-ka-bob · 2008-07-08 08:56 · Score: 1

You open access to the source code of the C++, Java and Python libraries that you use in your internal work.

--
Think global, act loco

XDR? by Rene+S.+Hollan · 2008-07-08 08:56 · Score: 1

I guess that XDR wasn't good enough, then, or ASN.1 (which supports multiple abstract encodings to boot).

XML, as an interchange format?

I suppose one could load source code into memory, and compile it every time, too. Even Java compiles to bytecode.

Bloated formats are fine for human interpretation (I rather like one kind of structure for my config files), or occasional parsing (which is why most of the stuff in /etc is human-readable, for small data sets (I do remember when "the internet" was one big /etc/hosts file), but for interchange? Just cause you're big-endian and I'm little-endian?

The trick to making non-human readable formats acceptable is the prevelence of wide-spread encoding and decoding tools.

Yes, XML is self-describing, at least syntactically (and formally with an XSD), and specific encoding semantics can be tagged, but the same can be achieved with means for type encoding. The big thing with XDR and related formats is that types are implicit -- both ends need to know what is being serialized. For RPC, with well-defined interfaces, this is not a problem, but it does make type-checking a remote service a bit of a challenge.

However, types can be encoded as data, and serialized as well: this happens for variant types naturally. Thus, there is no reason to not have a type-encoding and type-exchange protocol to permit dynamic type-checking. The advantage over self-describing data serializations is that it can be done on an as-required basis, instead of with every damn serialization.

--
In Liberty, Rene

Re:No PERL API ??!!?? by fbjon · 2008-07-08 08:58 · Score: 1

You... You have GOT to be new here.

--
True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.

Ok, I'll bite... by Dutch+Gun · 2008-07-08 09:03 · Score: 5, Interesting

Obviously, those at Google felt XML didn't work well for them. They have the resources to invent a protocol and libraries to support it. And, they are big enough to be their own ecosystem, which means as long as everyone at Google is using their formats, interop is no biggie. Good for them, I don't begrudge that decision.

I'm actually a game developer, not a web developer, so I'll speak to XML's use as a file format in general. Here's a few points regarding our use of XML:

* We only use it as a source format for our tools. XML is far too inefficient and verbose to use in the final game - all our XML data is packed into our own proprietary binary data format.
* We also only use it as a meta-data format, not a primary container type. For instance, we store gameplay scripts, audio script, and cinematic meta-data in XML format. We're not foolish enough to store images, sounds, or maps in a highly-verbose, text-based format. XML's value to us is in how well it can glue large pieces of our game together.
* All our latest tools are written in C# and using the .NET platform (Windows is our development platform, of course). It's astoundingly easy to serialize data structures to XML using .NET libraries - just a few lines of code.
* Because it's a text-based format and human readable, if a file breaks in any way, we can just do a diff in source control to see what changed, and why it's breaking.

I'll make a concession that I've heard of some pretty awful uses of XML. But those who dismiss XML as a valuable tool in the toolchest are equally as foolish as those who believe it's the end-all and be-all of programming (I'm not saying that's true of you, just pointing out foolishness on both sides). Like any tool, it's most valuable when used in it's optimal role, not when shoehorned into projects as a solution to everything.

--
Irony: Agile development has too much intertia to be abandoned now.

Re:Ok, I'll bite... by afidel · 2008-07-08 09:41 · Score: 1

Is your game hackable enough to allow for things like variable changes in the data files? Because if it some of your players might be really happy if you could use XML in the data file and simply load those in at run time. I know I'm kind of tired of using game specific tools to manipulate resources when doing mods. I'm obviously not talking about things like total conversions, but rather things like re-adjusting damage tables or resource levels, etc.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Ok, I'll bite... by JamesP · 2008-07-08 12:21 · Score: 1

But those who dismiss XML as a valuable tool in the toolchest are equally as foolish as those who believe it's the end-all and be-all of programming
I couldn't agree more, and I think you've got the main advantages of XML right there: human-readable (and changeable), easy of use, flexibility and plenty of library support, etc, etc
And between reinventing the wheel for a general purpose thing and using XML, by all means, go with XML!
The main problem, of course, is speed. If this is a config file or something that is written and read not very often or is something that is not very big, it's no biggie, but it will get in the way in case of very big data-sets, parsing and searching, etc

--
how long until /. fixes commenting on Chrome?
Re:Ok, I'll bite... by CaptnMArk · 2008-07-08 20:47 · Score: 1

>We also only use it as a meta-data format, not a primary container type. For instance, we store gameplay scripts, audio script, and cinematic meta-data in XML format. We're not foolish enough to store images, sounds, or maps in a highly-verbose, text-based format.
Nobody is insane enough to do that. Except some SOAP people.
Re:Ok, I'll bite... by Dutch+Gun · 2008-07-09 09:24 · Score: 1

We're developing an MMO, so there are obviously some things that are not exposed to the client at all, such as damage tables or any fundamental gameplay altering data. Technically speaking, our game has the ability to load XML file directly. This way, the iteration process is much quicker than having to export, restart, reload, repeat. This is what we do for development versions of the games, but this would likely not be compiled into the final release binary.
I'm not sure if we're ever going to officially open up our file formats. Generally speaking, the XML is more of a means to an end from our perspective, and there aren't really plans to use then except internally. Still, who knows what the future holds?

--
Irony: Agile development has too much intertia to be abandoned now.
Re:Ok, I'll bite... by mdfst13 · 2008-07-09 15:17 · Score: 1

The main problem, of course, is speed.
And bandwidth. Sure, it doesn't matter much if you're talking about doubling or tripling the size of a one time file, but in the Google case, presumably they are moving gigabytes, if not terabytes or petabytes, of data in an hour. At that level, the bandwidth cost is noticeable. Big .com sites see significant bandwidth improvements by turning on mod_deflate. This is just a better optimization.

Re:No PERL API ??!!?? by KillNateD · 2008-07-08 09:05 · Score: 1

http://brad.livejournal.com/2387105.html

How is this different.. by Ztream · 2008-07-08 09:05 · Score: 1

.. from things like YAML and JSON?

Re:How is this different.. by Temporal · 2008-07-08 09:14 · Score: 1

YAML and JSON are text-based formats intended for human readability. Protocol Buffers are binary, and therefore smaller and faster, but not human-readable.
Also, the protocol buffer compiler provides friendly data access objects. You could actually use these with JSON or YAML, by just writing a new encoder and decoder (which is easy to do).
Re:How is this different.. by zolf13 · 2008-07-08 09:18 · Score: 1

Two new great features making it useable!
1) binary
2) not "human friendly"
Re:How is this different.. by GXTi · 2008-07-08 16:35 · Score: 1

YAML and JSON are text-based formats intended for human readability.
Actually, JSON's human-readability is (or originally was) secondary to the fact that it is JavaScript so you can just paste it into the output of your dynamic web pages to make data available to scripts running on the page. Naturally, JavaScript is a programming language, and programming languages are supposed to be human readable (or at least more human readable than machine code; I can't speak more highly of some languages). This doesn't detract from the fact that they are legible, but it's a different class than XML and YAML which are originally designed to be manageable by mere mortals, particularly YAML.

Re:do you think it makes you look smart? by neokushan · 2008-07-08 09:06 · Score: 1

I think it makes me look an order of magnitude smarter, yes.

--
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill

Re:No PERL API ??!!?? by Goaway · 2008-07-08 09:09 · Score: 1

You're not really going to see the benefits of Perl in one month. It's not a very straightforward language like that.

Re:This isn't new... by natoochtoniket · 2008-07-08 09:10 · Score: 1

Of course it's not new. It not only looks like ASN.1, it actually is very much like ASN.1. But to me it looks more like an extension of rpcgen, because ASN.1 came with a lot of other baggage. Of course, both rpcgen and asn.1 are just the best known implementations of ideas that were developed far earlier. Shannon's book on information theory explains just this sort of prefix code. These kinds of prefix codes have been in use since the 1960s, and code-generators have been around since the 1970s.

I think the reason that some people at google think it's new is because they are all young. Young people are constantly coming up with "new" ideas that are really two decades or more old. The idea seems new to the young person because he/she has not seen it before. That isn't a jab at google, or at young people. It is just a fact that everything seems new until after you have seen it before.

I have an XML alternative format too. by IGnatius+T+Foobar · 2008-07-08 09:15 · Score: 3, Funny

I have my own data format that is an alternative to XML as well. It works by normalizing the data into records which all contain the same number of fields, and placing an agreed-upon delimiter between each field. The end of the record is indicated by a newline.

I think this "delimited" format has a lot of potential.

--
Tired of FB/Google censorship? Visit UNCENSORED!

C# by 1000101 · 2008-07-08 09:16 · Score: 1

"then compile them to produce classes to represent those structures in the language of your choice"

That's not entirely true, but I digress. Anyway, can someone shed some light on how this is different than binary serialization I've been using to pass C# objects around for quite some time now? It is just a matter of giving the class a Serializable attribute and then using the BinaryFormatter class to serialize the object to a stream. XML serialization is available if needed to pass to non-M$ entities, but binary serialization has been around a while, no?

Re:C# by Temporal · 2008-07-08 09:24 · Score: 1

What happens to your binary serialization if you add a new field to your class? Can you still read serialized objects created by older versions of your software? (Honest question; I don't know how C# serialization works.) Also, can you read your data in other programming languages?
Re:C# by jrumney · 2008-07-08 09:50 · Score: 2, Informative

Can you still read serialized objects created by older versions of your software?
As long as all you have done is added new fields, then you can tag the new fields as OptionalField or NonSerialized to maintain backwards compatibility. The advantage of using Google's library is that it works across languages and runtimes. Java, .NET, PHP and Python all have serialization built in, but they are all incompatible, so you can't use it to pass an object from your Java backend to a C# client then on to Python for some final processing before displaying in a PHP generated webpage.
Re:C# by Westley · 2008-07-08 09:54 · Score: 1

Try deserializing your object in Java or Python later... or easily adding more fields to it and still being able to deserialiize old files.
(And if you want C# support, just wait a while. I'll see what I can do - and frankly if I don't do it, I'm sure someone else will.)
Jon

Why rely on a single data encoding system ? by markpapadakis · 2008-07-08 09:20 · Score: 1

Our systems ( application servers, frameworks, data serialization and unserialization facilities etc ) understand/support XML-RPC, binary XML-RPC, JSON and PHP object notation (ref: PHP serialize() / unserialize() functions).

There is a set of primitives ( string, integers, floats, arrays, structures, timestamps, raw data ) which are the only datatypes developers utilize; the serialization process is transparent. That way, depending on the nature and capabilities of the systems involved in a data interchange procedure, the more efficient transport protocol is utilized ( for example JSON when interfacing with an application server from a Javascript application, XML-RPC when talking to remote XML-RPC servers etc ).

It's probably more important to decide upon a list of primitive constructs and always operate on that kind of datatypes as opposed to figuring out the ideal way to store and retrieve them. You can always come up with an even better way to encode that data later on, anyway.

--
Technology ramblings : Simple is Beautiful

Re:Why rely on a single data encoding system ? by Temporal · 2008-07-08 09:28 · Score: 2, Informative

It's worth noting that writing alternative encoders and decoders for protocol buffers is really easy (since protocol message objects have a reflection interface, even in C++), so you can use the friendly generated code without being tied to the format.
Re:Why rely on a single data encoding system ? by Shados · 2008-07-08 12:10 · Score: 1

Being able to swap message format on the fly is really the way to go, since they all have advantages and issues, from speed, backward compatibility, debugging, security constraints, etc.
Thats one of the things I liked in messaging frameworks such as WCF (though its far from being the first to do it). Code to the API, the message protocol can be decided later, and changed/plugged in on the fly when something better comes.

Why XML is good by El+Cabri · 2008-07-08 09:20 · Score: 1

XML is good because it promotes keeping orthogonal, two very different problems : structuring data and encoding it. Tying these two problems creates complications down the road, as entangling independent areas of information processing and algorithm sciences usually does.

Re:Why XML is good by Abcd1234 · 2008-07-08 09:26 · Score: 1

Umm... what? XML and this protocol differ in only one way: one is plain text, the other is binary. Period. In both cases, you have to emit the data in a structured format, and you have to encode it (in one case, it's a binary encoding, in the other, it's a plain-text encoding, but it's still encoded).
Re:Why XML is good by Temporal · 2008-07-08 09:39 · Score: 2, Informative

XML and this protocol differ in only one way: one is plain text, the other is binary.
They also differ in that XML has a *lot* more features. For example, protocol buffers have no concept of entities, or even interleaved text. Those can be useful when your data is a text document with markup -- e.g. HTML -- but they tend to get in the way when you just want to pass around something like a struct.

Binary message formats are good by kriston · 2008-07-08 09:20 · Score: 2, Insightful

Thankfully an alternative to XML.
If you didn't think XML was among least efficient transport formats then you weren't really paying attention. Battery-conscious mobile devices do not really enjoy parsing XML DTD and then the XML file itself.
It reminds me a little bit of AOL's SNAC message types.

We get something good for the industry from Google, after a rash of bad press, and is actually NOT a beta.

--

Kriston

This is a good thing by Rival · 2008-07-08 09:20 · Score: 1

I agree, the "order of magnitude" note sounds more like a media-bite than anything, but here are a few points to consider.

1.) The example they give is for a small set of data, and percentages vary more dramatically as sample sizes decrease.

2.) If your usage generally involves many such small sets of data, the benefits of slight reductions in latency will multiply significantly.

3.) Even if the speed performance is identical to XML, the reduction in data size should not be ignored, especially in large-volume production environments.

4.) This is not a small internal tool Google is releasing -- this is a major component which they have heavily used. It has been real-world tested (albeit at just one company) and proven at ridiculous scales.

5.) They are giving this away. Source, documentation, examples -- the works. I know this isn't driven entirely by altruism, but neither is it is an embrace-extend-extinguish maneuver. They just made a tool that meets a specific need better than what was currently available, and then made it available.

Re:This is a good thing by Temporal · 2008-07-08 10:13 · Score: 5, Insightful

The example they give is for a small set of data, and percentages vary more dramatically as sample sizes decrease.
We wanted to give an idea of the speed without trying to boast too much or look like we were directly challenging anyone. Of course every news outlet has chosen to highlight the speed comment -- including the numbers which were intended to be ballpark figures -- more than was intended, but I guess that isn't surprising.
I agree that the tiny "person" example is not a good benchmark case. It was intended as a usage example, not a speed example, but I stuck the speed numbers in there just meaning to give people a vague idea of the difference. The "20-100 times faster" comment is based on testing a variety of formats -- both unrealistic ones and real-life formats used in our search pipeline -- against programmatically generated XML equivalents (which may or may not themselves be realistic, though they contain the same data with the same structure). libxml2 was used for parsing XML. I don't really know how libxml2's speed compares to other XML parsers, but I didn't have a lot of time to investigate. The 20x faster number comes from the largest data set (~100k-ish) while the 100x number comes from a very small message. The most realistic case was about 50x. Sorry that I cannot provide exact details of the benchmark setup since many of the test cases were proprietary internal formats.
In any case, I'm hoping that some independent source conducts some tests because I think anything we produced would probably have unintentional biases in it. Of course, I'll update the numbers in the docs if they turn out to be wildly off-base.
Re:This is a good thing by Rival · 2008-07-08 11:45 · Score: 1

Thank you for the specifics. I'm not surprised that you're seeing such significant performance-enhancements. I'm just glad to see a simple, scalable alternative to the angle tax, and now that I'm back home I'm looking forward to exploring Protocol Buffers. Thanks again!
Re:This is a good thing by martin-boundary · 2008-07-08 12:25 · Score: 1

libxml2 was used for parsing XML.
IIRC (it's been a while), the fastest parser library is still expat. There are various benchmarks on the net, but I wouldn't be surprised if expat is highly competitive speedwise with Protocol Buffers, simply because expat is a SAX parser that goes through the data exactly once with minimal memory allocation or data conversions. This sort of thing is always an order of magnitude faster than doing DOM conversion, wich is what many XML libraries, including libxml2, do.
Re:This is a good thing by Temporal · 2008-07-08 12:47 · Score: 2, Informative

Note that protocol buffers give you the equivalent of a DOM -- an object representing the parsed message. This is usually much more convenient to use than SAX parsing (depending on your use case, of course). So, I'm not sure if comparing against SAX is necessarily fair. Though I think protocol buffers would still win just because there is less to parse and parsing length-delimited chunks is faster than character-delimited.
Re:This is a good thing by martin-boundary · 2008-07-08 13:33 · Score: 1

I don't see why it would be an unfair comparison. Certainly, SAX is only an efficient approach to reading an XML datastream, which is a small part of what many XML libraries, and protocol buffers, can do. In particular, expat is limited to reading XML only, although admittedly writing XML from scratch is often an easier task.
Yet building a DOM is an up-front cost which can only be amortized if the application truly needs two or more passes through the data. That's an algorithmic question orthogonal to the actual method used for data input, so it's fair to optimize the reading method a posteriori.
Are your XML test files available on the web, or were they just throwaway experiments?
Re:This is a good thing by cryptoluddite · 2008-07-08 13:53 · Score: 1

Note that protocol buffers give you the equivalent of a DOM -- an object representing the parsed message.
Actually no, they don't. If the parser has a different .proto file then it knows nothing about extra fields in the message, since the .proto info is not sent with the message (if it were, there would be no space savings over binary JSON for instance). This is not the equivalent of a DOM or a JSON object.

Though I think protocol buffers would still win just because there is less to parse and parsing length-delimited chunks is faster than character-delimited.
In most cases, trivially faster. The difference between 5aword and aword\0 mostly depends on how memory is allocated and how strings are referenced. For instance, in C in some cases the character-delimited string could be used in-place (with no copy or memory alloc) whereas the length-delimited one cannot (without first reimplementing all the string functions).
.. or did you not mean NULL when you said character-deliminated ??
Re:This is a good thing by L7_ · 2008-07-08 16:13 · Score: 1

SAX Parsing is more akin to length-delimiting than character delimiting -- the parser can skip over lots of unused or non-important text based on the registered callbacks. IMHO, you can't neccesarily use DOM parsing when you can populate a class object with a set of SAX parsing events, which seems what the protocol buffers serialization is doing.
Re:This is a good thing by Temporal · 2008-07-08 18:47 · Score: 1

Sounds like you tested _DOM_ (tree-building) xml processing, to in-place binary data extraction
I compared DOM parsing with protocol buffer parsing. They are equivalent. Protobuf parsing constructs a single message object representing the entire parsed message, and it currently does not do "in-place binary data extraction", although I've been thinking of adding that in a future version.
SAX would be comparable to using a protobuf CodedInputStream and reading fields from it manually. It would be faster but a lot less convenient to use in most cases.
Note that for a fair comparison with SAX, you would have to actually construct an object representing the document based on the SAX parsing events, not just use noops for all the callbacks.
Re:This is a good thing by vidarh · 2008-07-09 10:16 · Score: 1

No, it isn't, since it needs to check each character to see if it's the start/end of whatever lexical element it is currently processing.
It means the minimal parsing cost once the data is in memory of a sax parser is a compare and conditional branch per character, while a length delimited protocol has a minimal parsing cost that can approach a single memory read (in the extreme case of a single length identifier for a field that can be skipped.
Whether that overhead matters to you greatly depend on the application - for most normal usage the IO latency and context switches will tend to be more expensive than the difference between character delimiting and length delimiting, all else being equal. But the Google's applications aren't typical due to sheer scale, and all else is rarely equal.

Re:Good by kriston · 2008-07-08 09:24 · Score: 1

You might call Microsoft the Rails-To-Trails Conservancy of the software industry. Use it until it outlasts its usefulness and then release it to the rest of the world for no charge.

Sorta.

--

Kriston

The BetaNews article is horrible by cakoose · 2008-07-08 09:26 · Score: 1

Man, the BetaNews article is horrible. Practically everything — except for the direct quotes from the Google blog post — is incorrect. I somehow expect more from someone who goes by "Scott M. Fulton, III".

Google's public documentation shows Protocol Buffers (which has yet to be formally abbreviated) is indeed conceptually different from XML, in that it's rooted more in procedural logic than structural declaration. In XML, there's a schema which defines the structures of tables and recordsets, which is separate from the document that relates the contents of records in that structure.

Nope, they're conceptually the same. The ".proto" files are like DTD or XSD. The actual document data is stored in a binary format (though there's also a text representation). The data manipulation API is similar what you get from Castor or JAX-B.

But here, in an unusual departure from the norm, the default values for these members are set to digits (for strings or literals) or values (for numerals) that define their place in a sequence -- where they fall within a record. Imagine if data were streamed onto recording tape, the way it used to be in the late 1960s and '70s. It's that streaming of the data sequence, without all the fenceposts, that differentiates XML from Protocol Buffers, by taking out all those markups that say when an entry or a record starts and stops.

The "= number" at the end of a field definition is not a "default value". It is a numeric tag that identifies that field. That said, "= number" is quite unintuitive syntax; maybe something like "@number" would have been less confusing.

Looking at some of the documentation, I don't think the aforementioned numbers directly index the field's location in the record. They lay down the present fields one after another, probably putting each field's tag number before the field data. This also allows them to avoid sending fields that use the default value. So they still need to specify how long each record is — either with "fenceposts" between records or a "length" specifier before each record.

Re:The BetaNews article is horrible by hattig · 2008-07-08 10:58 · Score: 1

Yeah, I came to the same conclusion as you. Poor article, but a brief glance at the tutorials on the Google site show what each thing is. Anyone here that's written a binary network protocol (or binary file format) will have come across the same issues at some point. Variable length fields (strings, arrays of primitive types) need lengths first, and you need to tag what data is what if you have optional fields or repeated fields. Parsing is easy - O(n) once through the binary data. Of course you can do the same with an XML string if you want a speedy parser. Google's just replaced a string tag with a numeric tag. So it's nothing special. But it's nice they've done it. It won't replace SOAP or REST or XML of course.

Re:faster than XML ?? by Bryansix · 2008-07-08 09:29 · Score: 1

Granted that a format can't have a speed and neither can an Internet connection. An Internet connection can have an amount of bandwidth but things still go the same speed. The end result though is more data gets there FASTER. In the same vein if you mix in some data with a bunch of garbage then it will take longer to see the data. This is the point. This is why XML is slow. Stop with the semantics and see the forest for the trees.

Re:No PERL API ??!!?? by mpeg4codec · 2008-07-08 09:32 · Score: 5, Insightful

Perl is to programming languages what English is to natural languages: easy to fool around with, hard to learn well, but when you do, the expressive power is incredible. And when you mess it up, nobody understands what you're trying to say.

Old-School Property lists? by menace3society · 2008-07-08 09:35 · Score: 3, Insightful

The similarity between these things and NeXT's Property Lists (now called "Old-School Property Lists" that Apple/NeXT has standardized on XML) is incredible. Some things are changed, like having a specification instead of just assuming that the recipient will parse it and figure it out, but the likeness is there. I wonder if any of the proto people at google had experience with plists, or if it's just a case of convergent design.

Everything old-school is new-school again, I guess.

Re:Old-School Property lists? by GXTi · 2008-07-08 16:45 · Score: 1

This problem has been solved probably thousands of different ways, all subtly different. The odds of any particular solution resembling another solution by sheer chance are rather high.

In the company I used to work... by cyfer2000 · 2008-07-08 09:46 · Score: 1

In the document, google showed one case of XML <person> <name>John Doe</name> <email>jdoe@example.com</email> </person> However, in the company I used to work, we required such a file to be written something like <person name="John Doe" email="jdoe@example.com"/> The google protocol buffer format it will beperson { name = "John Doe" email = "jdoe@example.com" } I failed to see why the protocol buffer format is much smaller and faster.

--
There is a spark in every single flame bait point.

Re:In the company I used to work... by Temporal · 2008-07-08 10:31 · Score: 2, Informative

This is 49 bytes: <person name="John Doe" email="jdoe@example.com">
The equivalent Protocol Buffer is 28 bytes. In addition to the 24 bytes of text, each field has a 1-byte tag and a 1-byte length. The example you quoted is protocol buffer *text* format, which is used mostly for debugging, not for actual interchange.
Re:In the company I used to work... by hattig · 2008-07-08 11:03 · Score: 1

That's the prototype description. The binary would be like: 18John Doe216jdoe@example.com 1 = tag 1 (byte) 8 = length of string (int) 2 = tag 2 (byte) 16 = length of string (int) total size = 34 bytes Of course, a TCP packet can hold a lot more than that anyway, so it's a poor example of saving bandwidth. The speed supposedly comes from creation and parsing of these things compared to even optimised XML.
Re:In the company I used to work... by nuttycom · 2008-07-10 08:35 · Score: 1

XML tag attributes are generally horribly misused, as demonstrated by your example above. Tag attributes are supposed to contain metadata about the contents of the tag, not the data itself. Granted, a name and email address may be considered to be metadata in some cases, but I think you'd have to stretch pretty far to find them.
It seems like people generally use tag attributes for data instead of metadata because they're less verbose to type. Which, of course, is one of the reasons why XML sucks.

ASN.1 encoded with BER/DER just needs tools by Animats · 2008-07-08 09:51 · Score: 3, Informative

ASN.1, from 1985, really is very similar. Here's a message defined in ASN.1 form:

Order ::= SEQUENCE { header Order-header, items SEQUENCE OF Order-line} Order-header ::= SEQUENCE { number Order-number, date Date, client Client,payment Payment-method } Order-number ::= NumericString (SIZE (12)) Date ::= NumericString (SIZE (8)) -- MMDDYYYY Client ::= SEQUENCE { name PrintableString (SIZE (1..20)), street PrintableString (SIZE (1..50)) OPTIONAL,postcode NumericString (SIZE (5)), town PrintableString (SIZE (1..30)), country PrintableString (SIZE (1..20)) DEFAULT default-country } default-country PrintableString ::= "France" Payment-method ::= CHOICE { check NumericString (SIZE (15)), credit-card Credit-card, cash NULL } Credit-card ::= SEQUENCE { type Card-type, number NumericString (SIZE (20)), expiry-date NumericString (SIZE (6)) -- MMYYYY -- } Card-type ::= ENUMERATED { cb(0), visa(1), eurocard(2), diners(3), american-express(4) }

Note that this has almost exactly the same feature set as Google's representation. There are named, typed field which can be optional or repeated. It just looks more like Pascal, while Google's syntax looks more like C.

Re:ASN.1 encoded with BER/DER just needs tools by SnowZero · 2008-07-08 23:17 · Score: 1

Without tag numbers how do you guarantee forward and backward compatibility between readers and writers?

So XML Was Dominant for What, say 6 years? by littlewink · 2008-07-08 09:51 · Score: 1

All that work! How sad I am that we must reschedule the Web Services Choreography Working Group to consider to study XML's replacement, Protocol Buffers.

Re:No PERL API ??!!?? by pr0nbot · 2008-07-08 09:54 · Score: 1

I looked at some of their binary interchange format. It looks like a valid Perl program to me! *rimshot*

Still waiting for Perl to make use of the Euro key as an operator...

Re:Between a rock and hard place by goombah99 · 2008-07-08 09:55 · Score: 1

Exactly my feeling, I'm so tired of seeing XML used in places YAML is perfect for.

The linked article does not really hit the nail on the head on what's so great about Protocol Buffers or why it should be faster. In an article linked form the link, there's a better explanation:

Instead, we developed Protocol Buffers. Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple "get" and "set" methods, and once you're ready, serializing the whole thing to â" or parsing it from â" a byte array or an I/O stream just takes a single method call.

So as I read that the methods of accessing a table are not generic to the DataBase but actually are individually optimized to the Data itself. That is the accessors know the structure rather than having to discover it from the markup. Presumably the code that rides around with the objects is free to contain it's own meta data, caches and pre-parsing of the records fo optimization. Yet from the outside it's just a bunch of get-set methods to provide uniform encapsulation.

My guess is the meta-data all totalled is less than all the wasted space in the XML fenceposts, plus by encapsulating they are free to compress the actual data when it makes sense.

Anyhow to all you XML folks. Stop picking up the XML cresent wrench and trying to use it as a hammer. Reach for the YAML.

--
Some drink at the fountain of knowledge. Others just gargle.

All the world is not a VAX^W^WWindows. by argent · 2008-07-08 10:02 · Score: 2, Informative

Anyway, can someone shed some light on how this is different than binary serialization I've been using to pass C# objects around for quite some time now?

It's portable and language-independent?

Re:Between a rock and hard place by goombah99 · 2008-07-08 10:02 · Score: 1

YAML

--
Some drink at the fountain of knowledge. Others just gargle.

Re:No PERL API ??!!?? by thanasakis · 2008-07-08 10:09 · Score: 1

Whoa, Mr. AC there, that was extremely helpful.

BTW, it's Perl, not PERL.

Pray tell us, why should I heed someones opinion on a the language when he can't even spell it's name correctly?

Re:Between a rock and hard place by metamatic · 2008-07-08 10:16 · Score: 3, Insightful

Funny, I'm tired of seeing YAML in places where XML would work fine.

Like serializing my Ruby objects, for example. When I don't care about performance, XML is best, because almost everything else will read and write it, including my text editor, and I know the syntax. When I *do* care about performance, I'm not going to use YAML either.

I don't see the niche YAML fits, frankly.

--
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak

XML is not a 'format'! by r3g3x · 2008-07-08 10:52 · Score: 3, Insightful

XML is crappy format

That statement underlines most people's myopic vision of the XML family of technologies. XML is not a format it is a family of technologies based around a common grammar.

XML is not a bucket.
It is not a passive container for data.
It is a transformable semantic graph.

The heart and sole of XML is XLST it serves as a common 'glue' that allows the transformation between the various standardized 'languages' XML, XHTML, XLST, XSL-FO, SVG, RDF, RSS, etc...

Example; the same XML document (lets say it represents rows in a database) can be transformed into a web page, pdf file, visual graph, rss feed, directed graph, or [insert non-XML text based output of choice]. More importantly the transformation can take place on the client side of a transaction effectively decoupling content and representation.

That being said, I completely agree that XML is over-kill for simple fixed message passing. But, then again simple fixed format message passing isn't what XML was really designed for :-) XML was designed for situations where the representation needs of the client are unknown and/or dynamic.

--
If you don't know XSLT you don't know XML

Re:XML is not a 'format'! by Alex+Belits · 2008-07-08 11:19 · Score: 1

XML is absolutely definitely a format -- eXtensible Markup Language.
The facts that this is not an actual language but an umbrella format for multiple incompatible and non-discoverable XML-based formats, and that it's so crappy that a whole industry was grown out of building support for a single type of data structure (tree, not even an arbitrary graph that real-world data is) represented in it, only emphasizes a massive engineering failure that it is.

--
Contrary to the popular belief, there indeed is no God.
Re:XML is not a 'format'! by r3g3x · 2008-07-08 15:49 · Score: 3, Insightful

XML is absolutely definitely a format -- eXtensible Markup Language.
XML is a system of grammar that is used to create defined formats.

You can't use XML to markup data. You have to use a defined grammar to create a format. You might say that this is an issue of semantics but that is the point. If your only use/understanding of XML is as a static data format then your doing it [XML/XSLT/..] wrong.

XML is crappy tool for static storage. If the data is being read/written by the same program there are faster/simpler was to encode that data. But that isn't what XML is meant for. To repeat my previous post; XML documents are abstracted semantic models that are designed to be transformed and dynamically interpreted.

Here is a link to an example of how XML/XSLT can be used to extend and enhance an existing XML based web service [Generating RSS with XSLT and Amazon ECS]. This a perfect example of the agnostic client scenario that XML was designed for (ie: the service could care less how the data is represented or transformed).
Re:XML is not a 'format'! by Alex+Belits · 2008-07-08 23:43 · Score: 2, Interesting

XML is a system of grammar that is used to create defined formats.
...made for people who slept through compiler courses.

You can't use XML to markup data. You have to use a defined grammar to create a format. You might say that this is an issue of semantics but that is the point. If your only use/understanding of XML is as a static data format then your doing it [XML/XSLT/..] wrong.
No, you can't "create" a format with XML. To "create" anything but the most trivial formats you have to provide a definition of both syntax and semantics. XML provides ridiculously complex, stupidly designed means to define a syntax, and absolutely nothing to define semantics, so you still have to either document it or, more likely, provide an implementation.
Guess what? The syntax is such a microscopic part of your task, the amount of work you have just placed into your reference implementation of semantics is multiple orders of magnitude higher than whatever you "saved" by not implementing syntax parser from scratch, leave alone implemented it using any tools that existed long before XML was introduced. The problem is, people who "learn XML" never learn how dead simple parsing in general is, so they use those "frameworks" and "tools" to save what otherwise would be literally seconds of their mental work.
I am not against simplifying further tasks that are already simple if it serves any valid purpose. The problem with XML, it does not really simplify anything, it provides ridiculously amateurish solution for a common easy problem without even a slightest attempt to help with truly complex part of work.

XML is crappy tool for static storage. If the data is being read/written by the same program there are faster/simpler was to encode that data. But that isn't what XML is meant for. To repeat my previous post; XML documents are abstracted semantic models that are designed to be transformed and dynamically interpreted.
Words "XML", "abstract" and "semantic" do not belong in the same phrase -- XML is developed at the level of a second-year CS student who managed to completely miss what "abstract" and "semantics" mean. It's not "abstract", it's artificial and irrelevant. The only value of XML is the fact that it's some standard, however this does not change the fact that it's nearly the worst possible solution for any imaginable problem.

Here is a link to an example of how XML/XSLT can be used to extend and enhance an existing XML based web service [Generating RSS with XSLT and Amazon ECS]. This a perfect example of the agnostic client scenario that XML was designed for (ie: the service could care less how the data is represented or transformed).
Have you read anything I wrote? XML is useful for interoperability with things that already use XML, and for making representation of pretty pictures/UI. This has nothing to do with the fact that it's crap, and that we all will be better off if with a standard created by someone competent. For the values of "competent" as in "anyone who actually studied CS".

--
Contrary to the popular belief, there indeed is no God.
Re:XML is not a 'format'! by r3g3x · 2008-07-09 07:40 · Score: 1

XML is a system of grammar that is used to create defined formats.
...made for people who slept through compiler courses.
For the record, there is nothing I like more than writing custom parsers its hella fun! That being said; I have things to do other than watch Matlock and write parsers for no good reason (it's like knitting for DPD-11 crowd).

... XML provides ridiculously complex, stupidly designed means to define a syntax, and absolutely nothing to define semantics, so you still have to either document it or, more likely, provide an implementation.
Your argument about implementation is a red herring. If by semantics you are refering to application-level semantics then yes you have to provide an implementation. Show me a file-oriented data storage method that automatically handles application level semantics. You can't because application level semantics are always handled by the implementation. If you write a custom format/parser you can deal with these issues during the parsing phase, but it doesn't remove the fact that this is an implementation level feature.

Guess what? The syntax is such a microscopic part of your task, the amount of work you have just placed into your reference implementation of semantics is multiple orders of magnitude higher than whatever you "saved" by not implementing syntax parser from scratch, leave alone implemented it using any tools that existed long before XML was introduced. The problem is, people who "learn XML" never learn how dead simple parsing in general is, so they use those "frameworks" and "tools" to save what otherwise would be literally seconds of their mental work.
So your suggestion is that I should develop custom file formats and parsers for every single application I write?!?!? That sounds great! I think I'll do everything from now on in custom packed binary! Who needs architecture independence?! Who needs the ability to share or re-use data easily!?! Who needs the ability to read their data outside of the framework it was developed in!?! And what if the specs change? I get to throw out all my existing data and start over again... or get the honor of writting a custom conversion script!!! I'm glad you've made me see the light!

I am not against simplifying further tasks that are already simple if it serves any valid purpose. The problem with XML, it does not really simplify anything, it provides ridiculously amateurish solution for a common easy problem without even a slightest attempt to help with truly complex part of work.
Yes you are right; 1) preparing an ad-hoc representation of my document, 2) automatically generating a schema from it, 3) being able to read/validate instances of the document in 3 line of code is horribly complex!! God forbid the only thing I have to do to get data into/out of my program is check the correctness of the data. God forbid that I don't have to write/debug/maintain a custom parser/format because I'm using one based on standards!

Words "XML", "abstract" and "semantic" do not belong in the same phrase -- XML is developed at the level of a second-year CS student who managed to completely miss what "abstract" and "semantics" mean. It's not "abstract", it's artificial and irrelevant. The only value of XML is the fact that it's some standard, however this does not change the fact that it's nearly the worst possible solution for any imaginable problem.
You fail to understand the big picture and the larger set of interactions the XML family is meant to address W3C Semantic Web Activity.

Have you read anything I wrote? [...]
Unfortunately... I think it's time you had your glass of warm milk and went to bed, you need to be up to greet the mailman at 6am and complain about the adverts that keep showing up in your mailbox.
Re:XML is not a 'format'! by Alex+Belits · 2008-07-09 23:38 · Score: 1

For the record, there is nothing I like more than writing custom parsers its hella fun! That being said; I have things to do other than watch Matlock and write parsers for no good reason (it's like knitting for DPD-11 crowd).
Parsers are generated by automated tools. To be precise, XML parsers are among few exceptions that are not generated.

Your argument about implementation is a red herring. If by semantics you are refering to application-level semantics then yes you have to provide an implementation. Show me a file-oriented data storage method that automatically handles application level semantics.
Any modern relational database has minimal semantics support -- it allows to define constraints, references and indexes that reflect application semantics as well as table structure that reflect syntax. Some databases (ex: Oracle, PostgreSQL) have embedded procedural languages that can handle almost arbitrary semantics of data.
Of course, being relational databases, they only support limited syntax (records and tables in their various representations), and definitions of structure and operation are implemented as a primitive, hard to use language (SQL) however principles of their design can be applied to more complex data structures as well. It's really a shame when ancient monstrosity such as SQL can be given as an example of superior design compared to supposedly modern and advanced infrastructure built around XML.

You can't because application level semantics are always handled by the implementation. If you write a custom format/parser you can deal with these issues during the parsing phase, but it doesn't remove the fact that this is an implementation level feature.
That depends entirely on the languages involved. All parsers pass their data to semantics handling routines in some form. Nothing prevents a parser to have interface to semantics handling defined in some language (same or other than the language of the application) that implements all semantics-related parts of the standard. Then this semantics definition can be included as a part of standard, as opposed to being a part of implementation that only has to work with one piece of software, and programmers will have to keep it as defined in the standard (though implementation may transform it into a form that is compatible with application's language and interface) to have their implementations compliant.
For example, if the message contains a tree data structure that is meant as an update to the data defined as a graph (what is a common application of XML now), standard has to include a definition of all operations that may have to be done when it arrives, including lookup and update, consistency check, etc. The actual implementation that will perform those things in application may be generated from it, and may be extended to perform more data handling, however if standard says that it updates a graph, it has to define how exactly. Then there won't be "BUT WE DIDN'T REALLY MEAN THIS!" and "WHO WOULD THINK THAT SUCH A THING MAY HAPPEN?" from format developers every time it is discovered that standard is ambiguous, impossible, or standard document contradicts with standard originator's own implementation.
That would be worthy of being called "infrastructure".

--
Contrary to the popular belief, there indeed is no God.
Re:XML is not a 'format'! by r3g3x · 2008-07-10 11:06 · Score: 1

[...] Show me a file-oriented data storage method that automatically handles application level semantics.

Any modern relational database has minimal semantics support [...]
I should have been clearer and said 'non-database'... But yes you are correct a formal relational database is great way to store static information it can automatically enforce it's schema and basic semantic integrity. Anyone who uses XML as a replacement for a database is using to much of it. Like I've stressed before XML is a poor format for static data storage. If the data isn't going to be transformed, aggregated, filtered, or translated at some time in the future; then XML may not be best choice for storage. To state it differently: XML is meant to be used for transportation vs. storage of data.

For example, if the message contains a tree data structure that is meant as an update to the data defined as a graph (what is a common application of XML now), standard has to include a definition of all operations that may have to be done when it arrives, including lookup and update, consistency check, etc. The actual implementation that will perform those things in application may be generated from it, and may be extended to perform more data handling [...]
Essentially what you are describing is IDL with a touch of COM. This approach is fine when you have control over your deployment environment and software ecosystem, as is the situation at Google. They [Google] are not advocating Protocol Buffers as a replacement for XML. Anyone who got that impression didn't RTFA. They had a well defined problem and found an effective solution for a process that traditionally might have been done with XML. They are making this technology available in the hopes that others may find it useful in solving their own well defined data transport problems. They are not advocating it as de facto competitor to XML. Protocol Buffers is a framework for creating new binary protocols. To quote the article (emphasis mine):

And now, we're making Protocol Buffers available to the Open Source community. We have seen how effective a solution they can be to certain tasks , and wanted more people to be able to take advantage of and build on this work.
Unfortunately the internet is not well defined and homogenous. The degradation abilities of XML allows disparate clients and processes to interact without having an explicit contract. It means a well designed document can be extended without breaking older clients and processes that were never explicitly designed to talk to each other can.
This a big shift in program design (specifically what constitutes a 'program'). Instead of monolithic code bases you have distributed servlets and transformation processes. A 'program' can be an abstracted service or it could be the description of the processing and filtering chain combining the resources of third party data sources and services.
If you are writing components that only talk to each other and could care less about interoperability or openness(*) then use whatever fits your needs. Nobody is forcing you to use XML (I don't care for evangelism of any kind). But don't poo-poo a technology because it doesn't fit your needs or your style of programming. If your gripe is with people doing stupid things with XML more power to you (but please be clear).
* Yes a published IDL and API libraries are open... One convenient thing about XML is that all you need are a parsing lib and network libs and you can transact with any XML service you don't have to install/compile/maintain a new library for every service (vs. every service having it's own unique protocol). Yes this does mean that XML trends to a lowest common denominator solution, but to allow the creation of laze-fare processes it has to be. Accommodation for the lowest common denominator is whole point of degrad
Re:XML is not a 'format'! by Alex+Belits · 2008-07-10 13:59 · Score: 1

I should have been clearer and said 'non-database'... But yes you are correct a formal relational database is great way to store static information it can automatically enforce it's schema and basic semantic integrity. Anyone who uses XML as a replacement for a database is using to much of it. Like I've stressed before XML is a poor format for static data storage. If the data isn't going to be transformed, aggregated, filtered, or translated at some time in the future; then XML may not be best choice for storage. To state it differently: XML is meant to be used for transportation vs. storage of data.
Databases are tied to a single data model, and their access protocols are not designed for efficient bulk data transfer. The major justification for XML (and all its possible replacements) was support for data that is not limited to a relational data model.

Essentially what you are describing is IDL with a touch of COM.
IDL is primitive, and COM ties general semantics to particular application's implementation and does not translate beyond it without a manual reimplementation. Neither counts as a formal semantics definition.

This a big shift in program design (specifically what constitutes a 'program'). Instead of monolithic code bases you have distributed servlets and transformation processes. A 'program' can be an abstracted service or it could be the description of the processing and filtering chain combining the resources of third party data sources and services.
Commingling of data format and consistency support with implementation-specific logic is a design deficiency, not a requirement for data consistency support.

It called documentation...
Documentation is informal -- it requires a human to interpret and implement, and it almost never gives unambiguous definition for all possible behaviors that can be expected from a system. If you look at any documentation that accompanies XML-based formats, it's a huge mass of words that can mean pretty much anything, and this is the root of all incompatibility.
To make a system reliable, formats have to be defined formally, and all behavior has to be reflected in an unambiguous way. Then even if behavior is not what a programmer wants or expects, he can be sure what it actually is.

--
Contrary to the popular belief, there indeed is no God.

need something different by speedtux · 2008-07-08 11:11 · Score: 4, Insightful

If Google had tried to build their system on relational databases, XDR, and NFS, they would have spent huge amounts of money and spent lots of time trying to shoehorn their software into those constraints. And it's not just Google that did this: Amazon did the same thing, with their SimpleDB, S3, and SQS.

The actual mistakes were relational databases, XML, and distributed POSIX file systems; all of those were systems designed by people with too much time on their hand and no real-world, large scale problems to solve. Finally, those mistakes are getting corrected, at least when it comes to high-end computing. At the low end, I suppose people will continue to tinker around with those toys.

Re:WTF am I missing by Fastolfe · 2008-07-08 11:22 · Score: 1

Hi, take a look at http://code.google.com/p/protobuf/ and http://code.google.com/apis/protocolbuffers/docs/reference/overview.html for details about what it is that's being offered. It's not the format per se that's being released, it's the software that allows you to use it in your own applications.

Re:faster than XML ?? by CoughDropAddict · 2008-07-08 12:00 · Score: 1

Really? In that case, I am defining a format for specifying series of integers. Here's how it works: for every integer, you find the corresponding prime (eg. for the number 5 you find the 5th prime). Then for every pair of numbers, you multiply them together, and emit the product into the output file. To parse, all you have to do is find each product and factor it into its two primes.

According to you, my format cannot have a speed. It is a format; it has no speed. So please write me a parser that parses my format into the original integers and is comparably fast to other formats (on a byte-by-byte basis).

Comment removed by account_deleted · 2008-07-08 12:09 · Score: 1

Comment removed based on user account deletion

More XML? EXI, Efficient Xml Interchange! by refactored · 2008-07-08 12:12 · Score: 3, Informative

http://www.w3.org/XML/EXI/

The development of the Efficient XML Interchange (EXI) format was guided by five design principles, namely, the format had to be general, minimal, efficient, flexible, and interoperable. The format satisfies these prerequisites, achieving generality, flexibility, and performance while at the same time keeping complexity in check.
Many of the concepts employed by the EXI format are applicable to the encoding of arbitrary languages that can be described by a grammar. Even though EXI utilizes schema information to improve compactness and processing efficiency, it does not depend on accurate, complete or current schemas to work.

Comment removed by account_deleted · 2008-07-08 12:12 · Score: 1

Comment removed based on user account deletion

Re:No PERL API ??!!?? by Onan · 2008-07-08 12:15 · Score: 2, Informative

Uh, no. Google officially deems perl unmaintainable, and its internal use is completely verboten.

You're quite welcome to write your own if you want it, but it's not something we'd ever use ourselves.

This looks good. by ChrisA90278 · 2008-07-08 12:21 · Score: 1

I think this is actually the fastest most compact way possible to encode information. It all depends on how good the compilers are. What they've done is replaces a generalized system like XML with custom written code that works only for the specific messages that are being passed. Nothing could be faster. The objection has always been that writing the custom code is hard. They have solved that issue.

Lets actually compare by cryptoluddite · 2008-07-08 13:19 · Score: 2, Informative

The big difference is that a protocol buffer cannot be understood without the message format (.proto file). Now lets actually take a look at a real list, like say the developers for apache (as a list of {name:,email:} objects):

protobuf: ~1654 bytes

json: 1915 bytes

protobuf.lzop: ~744 bytes

json.lzop: 809 bytes

What you see is precious little difference in the size of the data even though the json is self-describing. The lzop version is essentially identically sized, and compressing and decompressing with lzo is wicked fast. So size is not a reason to use proto buffers.

Maybe speed is? Instead of using lzo compression just create a JSON binary format. This is trivial, and provides essentially the same size and speed benefits as protocol buffers while still being JSON in nature.

The only advantage to protocol buffers then is that they generate access and verify classes for you in you favorite language (if that language is C++, Java, or Python). Big deal, again this is absolutely trivial.

To me what this demonstrates is premature optimization. Instead, first use a simple text format like JSON then if that is too large compress it. Then if that is too slow send it in binary.

Note: I approximated the size of the proto buffers based on the descriptions of the binary format since I haven't downloaded the code (it actually compresses less well since I did not vary the 'length' bytes in my test file).

Re:Lets actually compare by Temporal · 2008-07-08 13:35 · Score: 1

Protocol Buffers essentially is a binary format for JSON. This wasn't an actual design goal, but I don't know that anything would be done significantly differently if it were. Note that you can trivially write a JSON encoder/decoder that works with arbitrary protocol message classes, using protobuf reflection.

To me what this demonstrates is premature optimization. Instead, first use a simple text format like JSON then if that is too large compress it. Then if that is too slow send it in binary.
The optimization was not "premature". We actually do need the speed and space.
It may very well be that most users don't need the speed, but switching formats down the road is pretty hard. It's not exactly like optimizing implementation details -- this is the format you use to communicate with other entities that you may or may not control.
Re:Lets actually compare by cryptoluddite · 2008-07-08 15:20 · Score: 1

The optimization was not "premature". We actually do need the speed and space.
And I'm sure decoding varints is really fast right (a branch on every byte, some shifts and xors)? And 25% of the available wire type codes being 'deprecated' is really space efficient? Or using ten bytes for any negative number unless the coder specifically declares an optimization?

It may very well be that most users don't need the speed, but switching formats down the road is pretty hard. It's not exactly like optimizing implementation details -- this is the format you use to communicate with other entities that you may or may not control.
What I like about protocol buffers is that the fields are identified by number. This makes it possible for different components to use different names to represent the same data format (for example for fixing spelling errors). But I think in most cases, in a general sense (I'm not talking google specifically), the downside of not being to make any sense out of the data without a .proto description outweights this.
And a data format like protocol buffers is essentially a custom compression algorithm, like for instance x86 instructions are. Unfortunately, it is really hard to make a custom compression that holds up over time. For instance, the two deprecated wire types could be used to optimize the first 32 fields rather than only the first 16 but instead they are just wasted.
I'm not trying to knock protocol buffers. It's a decidedly good effort. But the hackery involved in making it that slight bit faster and smaller probably isn't going to be worth while in the long run.

Re:Between a rock and hard place by goombah99 · 2008-07-08 13:51 · Score: 1

XML still wins for typing by hand, I reckon. Tags are easier to type than holding down the spacebar or trying to get your editor to expand tabs to spaces for YAML files but not for every other file in your project.

this issue was solved by Barney Rubble about the time Jesus was still riding his dinosaur.

Come join us in the 21sth century.

--
Some drink at the fountain of knowledge. Others just gargle.

Re:This isn't new... by bipbop · 2008-07-08 14:17 · Score: 1

Yeah, this sort of thing goes on at lots of places, as you imply, not just Google. I think there's also a "cool" factor--when you work at a cool company where half the employees are drinking the "we're all geniuses" kool-aid like Google (and there are lots of others, I don't mean to single out Google either), you may know about existing alternatives and decide you can do better, and the time investment to build and maintain that technology is justified.

Tagged "notinventedhere", as NIH syndrome is the name I first heard for this anti-pattern.

Re:Good by stickystyle · 2008-07-08 14:22 · Score: 1

Here is 160 FOSS projects google released http://code.google.com/hosting/projects.html. Heck even google's new big thing AppEngine is OSS.

Thats a pretty good amount of code. What else do you want them to release? their search engine? gmail?

--
Pluralitas non est ponenda sine neccesitate

Re:Good by stickystyle · 2008-07-08 14:24 · Score: 1

Oh, and let us not forget http://code.google.com/android/

--
Pluralitas non est ponenda sine neccesitate

YAIDL by Coward+Anonymous · 2008-07-08 14:53 · Score: 1

Yet Another Interface Definition Language...
What's wrong with XDR?
3/4 of the companies I've worked for had some engineer who had to unroll his own RPC format with matching IDL for some "technical" reason that had no basis in reality. They were all pretty crummy implementations. All you hot shot engineers, please, just stop re-inventing the wheel.

Hey! XML wasn't intended to be fast. by WoTG · 2008-07-08 15:44 · Score: 1

It's easy to say XML is slow... no one ever planned it to be fast! The reason for XML's existence is to be human readable (especially by people who are used to reading HTML). That's it. People expecting it to be fast are using the wrong tool for the job.

HP Basic by PPH · 2008-07-08 16:50 · Score: 1

BDAT format lives!

--
Have gnu, will travel.

Re:This isn't new... by pikine · 2008-07-08 18:10 · Score: 1

The idea is not new, but the fact that Protocol Buffer takes a more C-like syntax as opposed to ASN.1 (more Pascal- or Fortran-like) appeals to software developers in this generation who starts learning programming with C or Java. Besides, Protocol Buffer has great integration with C++, Java, and Python. When it comes to data serialization format, it's really the implementation that counts rather than the idea, and they have a nice implementation.

--
I once had a signature.

Pity they did not use Corba IDL + IIOP by Anonymous Coward · 2008-07-08 20:12 · Score: 1, Interesting

OK, Corba IDL and IIOP have some quirks, but they work very well. There are excellent Open Source implementations like JacORB, TAO or IIOP.NET that interoperate very well with each other or J2EE. Google could have been compatible to all this instead of going their own way.

ASN.1 not mentioned by metaconcept · 2008-07-08 20:16 · Score: 1

It's obvious that the developers were familar with the flaws of older protocols, and found ways to fix most of them.

Maybe, but no mention is made of ASN.1, which to me suggests a lack of historical awareness. I would have appreciated a comparison. A comment from Kenton Varda, attached to the announcement blog post, reads in part:

Sorry, I personally am not very familiar with ASN.1 DER. A brief look at some documentation suggests to me that it is more complicated than Protocol Buffers, which can be good or bad depending on whether you need that complication.

Re:ASN.1 not mentioned by SnowZero · 2008-07-08 22:13 · Score: 1

Kenton headed up the open source rewrite, not the original version. The original authors were aware of ASN.1. PBs have backward *and* forward compatibility, a compact wire-format, and try very hard to stay simple. Most other encodings achieve only two out of the three.
Re:ASN.1 not mentioned by vrmlguy · 2008-07-08 23:44 · Score: 1

ASN.1 doesn't define the wire format of your data; in fact there are at least half-a-dozen accepted encodings (including XML!). Protocol Buffers, otoh, does specify the wire format. This means that you can store your encoded data and not worry about future tools being able to process it; specifically, Protocol Buffers allows the creation of tools that don't understand the data being processed.

--
Nothing for 6-digit uids?

Re:Between a rock and hard place by Filip22012005 · 2008-07-08 20:32 · Score: 1

Plus, writing a decent XML parser is easy!

--
When the policeman of the tie, rule you violate, hello punishment of the kitty?

pre-processing and talking about space!!!!! speed! by nikanth · 2008-07-08 20:49 · Score: 1

If it is pre-processed to produce C++ code, Java code, etc... it should be possible to do in XML also without affecting size and speed. It is time to come-up with an "xml to protocol buffers" (de)converter.

Re:I felt a great disturbance in Teh FOSS... by hostyle · 2008-07-08 21:27 · Score: 1

Hush! If you stay really really quiet you may just be lucky enough to spot another spelltard in its native environment.

--
Caesar si viveret, ad remum dareris.

Re:Between a rock and hard place by SnowZero · 2008-07-08 22:17 · Score: 1

Well now XML is wedged between YAML on the low end (e.g. config files, human readable data, ad hoc files) and ProBuff on the high end (massive structured data bases).

Individual PBs are meant to be fairly small and decode into memory in a single class/struct. The scale issue is mainly about having billions of these messages, and not wanting to overpay for storage in a less efficient format, or network bandwidth for moving them around.

Re:This isn't new... by statusbar · 2008-07-08 22:24 · Score: 1

Do you know of any decent open source ASN.1 code generators to compare with these google protobufs?

--jeffk++

--
ipv6 is my vpn

Re:No PERL API ??!!?? by SnowZero · 2008-07-08 22:50 · Score: 1

Yeah, and I'd like this for the .NET CLR and Mono as well. I looked at the code and the generators are not that complicated, maybe I'll give it a shot over the weekend.

Some people have already mentioned that on the Google Group, so its probably a good idea to go there and compare ideas / combine efforts.

Does Google accept outside contribs for projects like these?

Yes, but I think the guy running it is encouraging separate projects that are then pushed upstream. Internally, this is very mature software, so the release cycle won't be as fast as some people will want, at least while it ramps up.

Re:I felt a great disturbance in Teh FOSS... by indifferent+children · 2008-07-09 00:16 · Score: 1

Parse error: I'm sorry, but your comment should have begun with "Crikey", not "Hush".

--
Censorship is telling a man he can't have a steak just because a baby can't chew it. --Mark Twain

RFC1925 by huge · 2008-07-09 00:20 · Score: 1

But, but .. is it compliant with sections 2.1 and 2.12 of RFC1925?

--
-- Reality checks don't bounce.

Elevator Statement by somethingwicked · 2008-07-09 00:24 · Score: 2, Funny

Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more."

Christ, I hope I'm never in an elevator with someone who would consider THAT an elevator statement.

--

---"What did I say that sounded like 'Tell me about your day?'"---

Common sense and Unix-Style by gweihir · 2008-07-09 01:35 · Score: 1

Not surprising either. I have done similar thiongs several times. Still, it is nice to find this augemented with a reasonable binary encoding and a cross-platform (to some degree) library support.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Re:No PERL API ??!!?? by AbbyNormal · 2008-07-09 01:58 · Score: 1

That being said, it is still a very nice scripting language to do data parsing and integration on the fly.

--
Sig it.

Oh the ironing... by ilitirit · 2008-07-09 04:38 · Score: 1

I always find it amusing when people use the Web to sulk about the faults of XML.

Anyway, this is what Google has to say:
However, protocol buffers are not always a better solution than XML â" for instance, protocol buffers would not be a good way to model a text-based document with markup (e.g. HTML), since you cannot easily interleave structure with text. In addition, XML is human-readable and human-editable; protocol buffers, at least in their native format, are not. XML is also â" to some extent â" self-describing. A protocol buffer is only meaningful if you have the message definition (the .proto file).

Re:No PERL API ??!!?? by Tetsujin · 2008-07-09 04:39 · Score: 1

Uh, no. Google officially deems perl unmaintainable, and its internal use is completely verboten.

That is so cool...

--
Bow-ties are cool.

Has Anyone Done A Comparison? by RAMMS+EIN · 2008-07-09 04:48 · Score: 1

The way Protocol Buffers encode data is similar to how I have designed protocols in the past. But, of course, there are various ways to encode data. Some examples:

- Encode your data in human-readable form, using separators. E.g. s-expressions, CSV, XML, JSON.
- Encode your data using type, length, value tuples. E.g. some encodings in ASN.1.
- Encode your data using type, value (and length, if necessary) tuples. E.g. some encodings of ASN.1.

Then there are choices as to adding indices to quickly jump to interesting parts of the data or skip over uninteresting parts, mechanisms for forward compatibility, etc.

I wonder if anyone has done a comparison of various techniques that can be used in data interchange formats and their impacts on message size, parsing performance, and other interesting quantities.

--
Please correct me if I got my facts wrong.

"Human Readability" is a fallacy by Tetsujin · 2008-07-09 04:50 · Score: 1

Binary encoding, none hierarchy based string list,
and simple file serialization are all faster than XML.
XML was created flexibility, commonality and human readability not speed. XSL, XQuery, and XPATH along with the DOM or SAX supply out of the box query, transformation, and manipulation capability.

The thing about "human readability" is that, just like any other binary file format (ASCII text is a binary encoding too, remember) it is not intrinsically human-readable, rather it relies upon a proper set of tools to make it human-readable.

The counter-argument here is that, while that's true enough, just about every tool in the world can read ASCII files, right? From Blender to Emacs to a simple paginator like "more"...

But except in simple cases that's not sufficient to actually work with the data. In XML for instance one would ideally like a structural representation of the data, the ability to hide a block at a time to streamline the display, etc. Or if editing you'd at least want simple validation features, maybe the ability to match opening tags with closing ones, etc... In theory you can work the data over in any text editor but in practice you would use something more specialized.

If the same specialization is made available for a compact binary format, then it'll be every bit as "human-readable" as an ASCII-encoded one.

--
Bow-ties are cool.

Re:No PERL API ??!!?? by mpeg4codec · 2008-07-09 04:57 · Score: 1

If anything, I was trying to praise Perl. Guess the mods around here just have a twisted sense of humor.

Re:Between a rock and hard place by shutdown+-p+now · 2008-07-09 05:12 · Score: 1

Anyhow to all you XML folks. Stop picking up the XML cresent wrench and trying to use it as a hammer. Reach for the YAML.

With XML and .NET today, I can write XmlSerializer(typeof(MyClass)).Serialize(myObject), and get it all for free, both ways, with validation and versioning. It keeps my code short and clear. So long as it's not a performance bottleneck, why should I bother with anything else?

Not Network Byte Order by certsoft · 2008-07-09 06:48 · Score: 1

I noticed a posting in the discussion group questioning the use of little-endian "on the wire" rather big-endian (network byte order). The response was that most of googles computers were little-endian, which seems pretty short-sighted.

Re:Hey! XML wasn't intended to be fast. by DragonWriter · 2008-07-09 07:02 · Score: 1

The reason for XML's existence is to be human readable (especially by people who are used to reading HTML). That's it.

XML's not really good at being human-readable, either.

About the only thing XML is really good at is "being widely supported by available tools", which is often quite important.

making old things new by 12357bd · 2008-07-09 07:45 · Score: 1

Wow, define a datatype, and generate code snipets to manage-it. Fantastic! And that's news!?

It seems people is just too bussy reading new acronims, to learn computer programming fundamentals. Go figure!

But... hey!, that's from Google, it must be great somehow... Bullshit!

--
What's in a sig?

Re:No PERL API ??!!?? by dedazo · 2008-07-09 08:22 · Score: 1

Some people have already mentioned that on the Google Group [google.com], so its probably a good idea to go there and compare ideas / combine efforts.

Cool, thanks.

--
Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo

Re:Between a rock and hard place by goombah99 · 2008-07-09 10:01 · Score: 1

Whoopee I can import a library too. The point is if you're never going to have to look at the serialized object then it makes no difference how you serialize it. But if you are ever going to look at it or hand edit it (e.g. a config file, document header, debug report, automatic mail parsing, .... ) then YAML is the right choice. Use the right tool.

--
Some drink at the fountain of knowledge. Others just gargle.

Re:Between a rock and hard place by shutdown+-p+now · 2008-07-09 16:48 · Score: 1

You missed my point. XML has very good libraries available for virtually every platform and language out there, and, in most cases, they come as part of the base library. YAML, on the other hand, only comes with Ruby.

Re:Between a rock and hard place by goombah99 · 2008-07-12 07:52 · Score: 1

In addition to C libraries, Bindings for YAML exist for the following languages:

* Perl
o YAML:: is a common interface to several YAML parsers.
o YAML::Tiny implements a useful subset of YAML; small, pure Perl, and faster than the full implementation.
o YAML::Syck Binding to SYCK C-library. Offers fast, highly featured YAML
o YAML::XS Binding to LibYaml. Better yaml 1.1 compatibility.
* PHP
o Spyc is a pure PHP implementation
o PHP-Syck (binding to SYCK library)
* Python
o PyYaml Highly featured. Pure Python or optionally uses LibYAML.
o PySyck Binding to SYCK C-Library
* Ruby (YAML included in standard library since 1.8. based on SYCK)
* Java
o jvyaml based on Syck, and patterned off ruby-yaml
o JYaml pure Java implementation
* R (programming language)
o CRAN YAML based on SYCK
* JavaScript
o native Java script emits but does not read YAML
o YAML-Javascript emitter and parser
* .NET Framework
o project page
* OCaml
o OCaml-Syck
* C++
o C++ wrapper for libYaml
* Objective-C
o Cocoa-Syck
* Lua
o Lua-Syck
* Haskell
o Haskell Reference wrappers

--
Some drink at the fountain of knowledge. Others just gargle.

Slashdot Mirror

Google Open Sources Its Data Interchange Format

253 of 332 comments (clear)