Google Open Sources Its Data Interchange Format
A number of readers have noted Google's open sourcing of their internal data interchange format, called Protocol Buffers (here's the code and the doc). Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more." It's the way data is formatted to move around inside of Google. Betanews spotlights some of Protocol Buffers' contrasts with XML and IDL, with which it is most comparable. Google's blogger claims, "And, yes, it is very fast — at least an order of magnitude faster than XML."
C++
Python
Java
what about PERL ? :]
Go out and write one, sonny!
That's the beauty of open source.
And as a bonus, they help undermine opponents who use competing technologies by helping train the workforce away from their practices. Overall I think it's very intelligent and well done strategic move.
Oh honey look... How cute... an angry slashdotter!
Being 10x faster than XML to work with is entirely believable: If you're serializing directly to binary structures, those structures can be directly manipulated without any parsing at all... and if you need to do some byte-swapping and alignment adjustments to get them into and out of native form for your current processor, those are still operations which can be performed in a matter of a few CPU instructions, rather than through a few hundred KB of libraries.
I drink the XML kool-aid plenty -- but there are things it's good for, and things it's not. Serializing and parsing truly massive amounts of data is part of the latter set.
The point of this isn't so much that it's faster than XML (so is everything else), it's that google took everything that a real person needs in a IDL and cut out everything else. Most IDLs have a serious case of second system effect, where features are added that nobody uses but seriously complicate the API. Even XML suffers from that (have you ever seen the kind of data structure you need to store a DOM, or what that does to library APIs for manipulating XML)?
I'd use it because 95% of the time all I need is something simple like this, and the other 5% of the time I should go back and rethink my design anyway.
That said, there is still a case for XML, especially the self documenting and human readable nature of the document, but there are a lot of cases where it is used today where it only adds unnecessary complexity and actually makes your code more difficult to maintain instead of simpler.
I read the internet for the articles.
I always told people that -- it's optimized for:
1. Easy parsing by parsers written by people who slept through their compiler classes.
2. Verification in situations when it's impossible to devise a meaningful reaction to a failure (other than either "everything failed, turn off the computers and go home" and "assume the data to be valid anyway because ALL of it will have the same formatting error because the same program generates it")
3. Dealing with data that arrives in neatly packaged "documents" and "requests", as opposed to being constantly produced and consumed.
4. Either communicating between programs that have the same knowledge of message semantics, or preparation of pretty human-readable documents.
None of the above even remotely applies to anything practical except UI/display formats -- this is why XHTML and ODF (and because of that at some extent XSL) are usable, SOAP is a load of crap, and for the rest of purposes XML is used as a glorified CSL with angle brackets. XML is widespread because monumentally stupid standard is still better than no standard.
So here is your example of how superior can be ANY format that is not based on this stupid idea.
Contrary to the popular belief, there indeed is no God.
This is just yet another way in which Google demonstrates that it is suffering from NIH syndrome. Instead of improving existing tools, they have to go off and re-invent all the bad mistakes of past, including non-relational databases, clunky binary encodings, and a bizarre non-POSIX filesystem.
Just imagine how far we ahead we would be today if Google had put the same effort into creating tools the rest of the SQL-writing, open(2)-using world could use.
Seems like you are missing the code they released that allows you to implement this in a number of languages from the 'get-go'.
You've also missed that they've just told the world how the majority of their systems talk, something most people would find interesting given how much Google does and the fact that one of Google's strong points is mangling huge amounts of data in a relatively quickly manner.
PS. Your format stinks and is horribly slow and unscalable when it comes to adding to the library. Genre's are so unbelievably grey defined that you might as well just sort them by the dominate color of the cover. Google would have done better.
Perl is to programming languages what English is to natural languages: easy to fool around with, hard to learn well, but when you do, the expressive power is incredible. And when you mess it up, nobody understands what you're trying to say.
I dont think its NIH syndrome. They no doubt tested other solutions before doing their own thing.
Dont forget this code is in widespread use and works very well. Googles server farm aint exactly small and the load they see is probably second to none.
A couple of percents of better efficiency for Google probably means millions in saved costs. Tossing a couple of months on development on something like this is money well spent.
I guess if all you have is SQL everything is a SQL SELECT no matter what you want to achieve.
HTTP/1.1 400
We wanted to give an idea of the speed without trying to boast too much or look like we were directly challenging anyone. Of course every news outlet has chosen to highlight the speed comment -- including the numbers which were intended to be ballpark figures -- more than was intended, but I guess that isn't surprising.
I agree that the tiny "person" example is not a good benchmark case. It was intended as a usage example, not a speed example, but I stuck the speed numbers in there just meaning to give people a vague idea of the difference. The "20-100 times faster" comment is based on testing a variety of formats -- both unrealistic ones and real-life formats used in our search pipeline -- against programmatically generated XML equivalents (which may or may not themselves be realistic, though they contain the same data with the same structure). libxml2 was used for parsing XML. I don't really know how libxml2's speed compares to other XML parsers, but I didn't have a lot of time to investigate. The 20x faster number comes from the largest data set (~100k-ish) while the 100x number comes from a very small message. The most realistic case was about 50x. Sorry that I cannot provide exact details of the benchmark setup since many of the test cases were proprietary internal formats.
In any case, I'm hoping that some independent source conducts some tests because I think anything we produced would probably have unintentional biases in it. Of course, I'll update the numbers in the docs if they turn out to be wildly off-base.
If Google had tried to build their system on relational databases, XDR, and NFS, they would have spent huge amounts of money and spent lots of time trying to shoehorn their software into those constraints. And it's not just Google that did this: Amazon did the same thing, with their SimpleDB, S3, and SQS.
The actual mistakes were relational databases, XML, and distributed POSIX file systems; all of those were systems designed by people with too much time on their hand and no real-world, large scale problems to solve. Finally, those mistakes are getting corrected, at least when it comes to high-end computing. At the low end, I suppose people will continue to tinker around with those toys.
Technically, you are correct - platform-agnostic data transfer has been possible since Sun's earliest RPC implementations. However, this seems to be considerably lighter-weight (although so is Mount Everest) and because order is specified, it's going to be much simpler to pluck specific data out of a data stream. You don't need to have an order-agnostic structure and then an ordering layer in each language-specific library.
Actually, XDR (used for Sun's RPC) is very lightweight, arguably lighter than PB. (Yes, I forsee a Java implementation called PB&J.) XDR is potentially more compact, since it doesn't encode field identifiers, but it's also big-endian, which made it less attactive as little-endian computer archtectures took over the world. Also, while XDR demands a fixed ordering of fields, field order in PB *isn't* specified; the field identifiers allow you to order the fields anyway that you like.
Overall, I like it. It's obvious that the developers were familar with the flaws of older protocols, and found ways to fix most of them. The only obvious thing I see missing is a canonical way to encode the .proto file as a Protocol Buffer, to make a stream self-describing.
Nothing for 6-digit uids?
You think it's a "mistake of the past" that Google wrote things like GFS and BigTable that run on commodity hardware, scale basically horizontally (eg. you can just throw machines at the problem) and survive machine failures without human intervention?
You don't "improve" on an existing tool like a relational database by adding a "feature" like fault tolerance. You have to redesign from the base up with those assumptions.