Slashdot Mirror


Google Open Sources Its Data Interchange Format

A number of readers have noted Google's open sourcing of their internal data interchange format, called Protocol Buffers (here's the code and the doc). Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more." It's the way data is formatted to move around inside of Google. Betanews spotlights some of Protocol Buffers' contrasts with XML and IDL, with which it is most comparable. Google's blogger claims, "And, yes, it is very fast — at least an order of magnitude faster than XML."

28 of 332 comments (clear)

  1. No PERL API ??!!?? by Proudrooster · · Score: 4, Insightful

    C++
    Python
    Java

    what about PERL ? :]

  2. Re:No PERL API ??!!?? by Anonymous Coward · · Score: 4, Insightful

    Go out and write one, sonny!

    That's the beauty of open source.

  3. Back to the 70's night? by Madball · · Score: 1, Insightful

    But here, in an unusual departure from the norm, the default values for these members are set to digits (for strings or literals) or values (for numerals) that define their place in a sequence -- where they fall within a record.

    Wow! They've invented fixed position data files. What will they invent next, a cool new programming language called RPG?

    1. Re:Back to the 70's night? by Temporal · · Score: 3, Insightful

      Wow! They've invented fixed position data files. What will they invent next, a cool new programming language called RPG?

      The article is actually completely wrong there. The protocol buffer binary format uses tag/value pairs, not fixed positions. Parsers simply ignore any tag they don't recognize and move on to the next.

  4. Smart move by ruin20 · · Score: 5, Insightful
    Since they're Google people will clamor over this (as we're doing here) and the result will be at least a handful of folks will learn and use it. Google's key to success has always been finding fresh talent and removing barriers from their contributing and advancement so what I've seen they've done is A) help train potential employee's on how they're tech and thought process works, and B) provide themselves a filter by which to gauge the ability for a potential employee to understand they're system.

    And as a bonus, they help undermine opponents who use competing technologies by helping train the workforce away from their practices. Overall I think it's very intelligent and well done strategic move.

    --
    Oh honey look... How cute... an angry slashdotter!
  5. Re:Likely story! by cduffy · · Score: 5, Insightful

    Being 10x faster than XML to work with is entirely believable: If you're serializing directly to binary structures, those structures can be directly manipulated without any parsing at all... and if you need to do some byte-swapping and alignment adjustments to get them into and out of native form for your current processor, those are still operations which can be performed in a matter of a few CPU instructions, rather than through a few hundred KB of libraries.

    I drink the XML kool-aid plenty -- but there are things it's good for, and things it's not. Serializing and parsing truly massive amounts of data is part of the latter set.

  6. The killer feature is simplicity by jandrese · · Score: 5, Insightful

    The point of this isn't so much that it's faster than XML (so is everything else), it's that google took everything that a real person needs in a IDL and cut out everything else. Most IDLs have a serious case of second system effect, where features are added that nobody uses but seriously complicate the API. Even XML suffers from that (have you ever seen the kind of data structure you need to store a DOM, or what that does to library APIs for manipulating XML)?

    I'd use it because 95% of the time all I need is something simple like this, and the other 5% of the time I should go back and rethink my design anyway.

    That said, there is still a case for XML, especially the self documenting and human readable nature of the document, but there are a lot of cases where it is used today where it only adds unnecessary complexity and actually makes your code more difficult to maintain instead of simpler.

    --

    I read the internet for the articles.
  7. Re:Likely story! by dedazo · · Score: 2, Insightful

    The 10x does not refer to the transmission speed (you're not getting that for a 100KB XML string vs. a 80KB binary blob), but the speed at which the [de]serialization occurs.

    In fact this approach is even faster than runtime-specific stream serialization like cPickle in Python or the built-in binary formatter in the .NET CLR, because those use reflection.

    --
    Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
  8. XML is a crappy format by Alex+Belits · · Score: 4, Insightful

    I always told people that -- it's optimized for:

    1. Easy parsing by parsers written by people who slept through their compiler classes.

    2. Verification in situations when it's impossible to devise a meaningful reaction to a failure (other than either "everything failed, turn off the computers and go home" and "assume the data to be valid anyway because ALL of it will have the same formatting error because the same program generates it")

    3. Dealing with data that arrives in neatly packaged "documents" and "requests", as opposed to being constantly produced and consumed.

    4. Either communicating between programs that have the same knowledge of message semantics, or preparation of pretty human-readable documents.

    None of the above even remotely applies to anything practical except UI/display formats -- this is why XHTML and ODF (and because of that at some extent XSL) are usable, SOAP is a load of crap, and for the rest of purposes XML is used as a glorified CSL with angle brackets. XML is widespread because monumentally stupid standard is still better than no standard.

    So here is your example of how superior can be ANY format that is not based on this stupid idea.

    --
    Contrary to the popular belief, there indeed is no God.
    1. Re:XML is a crappy format by mmurphy000 · · Score: 3, Insightful

      Y'know, I usually give low-UID Slashdotters a modicum of respect, but this diatribe is off-the-charts nonsense.

      1. Easy parsing by parsers written by people who slept through their compiler classes.

      And your evidence of this assertion is...what exactly? Not to mention the minor detail that XML and compilers are orthogonal: you can use XML (or many other data interchange formats) with non-compiled languages, and most compilers know nothing about XML (or many other data interchange formats).

      2. Verification in situations when it's impossible to devise a meaningful reaction to a failure (other than either "everything failed, turn off the computers and go home" and "assume the data to be valid anyway because ALL of it will have the same formatting error because the same program generates it")

      And your evidence of this assertion is...what exactly? XML-consuming programs that are aware of the data structure can have as detailed a "reaction to a failure" as a JSON-consuming program, or a YAML-consuming program, or a Protocol Buffer-consuming program. XML-consuming programs that are not aware of the data structure can, if the XML supplies it, validate against a DTD or schema, things which are not possible in some other data interchange formats (e.g., JSON, YAML).

      3. Dealing with data that arrives in neatly packaged "documents" and "requests", as opposed to being constantly produced and consumed.

      All data comes in neatly packaged buckets of varying types. We call them "bytes" and "packets" and "structures" and "records" and "frames" and "rows" and the like. The only way I can interpret your claim in a way that makes sense is to translate it as "XML sucks for streaming audio and video", which is undoubtedly true, and I don't think anyone uses it in that arena.

      4. Either communicating between programs that have the same knowledge of message semantics, or preparation of pretty human-readable documents.

      On the contrary, this is one of XML's primary strengths — handling cases where programs lack the "same knowledge of message semantics".

      With most data interchange formats, from CSV to JSON to Protocol Buffers, either you know everything about the data structure you're receiving, or you're screwed. In other words, there is no discoverability and no standardized means of being able to only deal with a portion of the data. This is particularly true for binary formats, like Protocol Buffers — either you know exactly what structure you received so you can parse it, or you're SOL, since it's just a bunch of bytes.

      With XML namespaces, it is entirely possible for Program X to publish data that Program Y has no intrinsic knowledge of in its entirety, but might know in part. If Program Y knows how to handle documents containing Dublin Core elements, for example, it can work with just those elements and ignore the rest of the document.

      You're welcome to have any opinion of XML you like. Heck, I even agree that XML tends to be used in places where it's overkill or too verbose. But if you want to convince others that your opinion is the correct one, you'll need to do a better job than this.

  9. Re:Now just release Goobuntu... by Anonymous Coward · · Score: 1, Insightful

    Trust me, you won't.

    Typed on a Goobuntu machine.

  10. Re:Why another encoding scheme? by QuoteMstr · · Score: 4, Insightful

    This is just yet another way in which Google demonstrates that it is suffering from NIH syndrome. Instead of improving existing tools, they have to go off and re-invent all the bad mistakes of past, including non-relational databases, clunky binary encodings, and a bizarre non-POSIX filesystem.

    Just imagine how far we ahead we would be today if Google had put the same effort into creating tools the rest of the SQL-writing, open(2)-using world could use.

  11. Re:WTF am I missing by Chyeld · · Score: 5, Insightful

    Seems like you are missing the code they released that allows you to implement this in a number of languages from the 'get-go'.

    You've also missed that they've just told the world how the majority of their systems talk, something most people would find interesting given how much Google does and the fact that one of Google's strong points is mangling huge amounts of data in a relatively quickly manner.

    PS. Your format stinks and is horribly slow and unscalable when it comes to adding to the library. Genre's are so unbelievably grey defined that you might as well just sort them by the dominate color of the cover. Google would have done better.

  12. Have they ever heard of BER/DER? by ugen · · Score: 2, Insightful

    How is this either implementationally or conceptually different from BER/DER encoding (commonly used and available all over the place)?

    Looks to me like it is exactly the same thing, reimplemented. I am sure bearing a mark of Google is nice and all, but they are definitely reinventing the wheel here.

  13. Re:How about C? by AuMatar · · Score: 2, Insightful

    They gave you C++. If you can't translate C++ to C, please turn in your keyboard and leave.

    --
    I still have more fans than freaks. WTF is wrong with you people?
  14. Re:WTF am I missing by Anonymous Coward · · Score: 1, Insightful

    You are missing that you're an idiot. Cheers.

  15. Binary message formats are good by kriston · · Score: 2, Insightful

    Thankfully an alternative to XML.
    If you didn't think XML was among least efficient transport formats then you weren't really paying attention. Battery-conscious mobile devices do not really enjoy parsing XML DTD and then the XML file itself.
    It reminds me a little bit of AOL's SNAC message types.

    We get something good for the industry from Google, after a rash of bad press, and is actually NOT a beta.

    --

    Kriston

  16. Re:No PERL API ??!!?? by mpeg4codec · · Score: 5, Insightful

    Perl is to programming languages what English is to natural languages: easy to fool around with, hard to learn well, but when you do, the expressive power is incredible. And when you mess it up, nobody understands what you're trying to say.

  17. Old-School Property lists? by menace3society · · Score: 3, Insightful

    The similarity between these things and NeXT's Property Lists (now called "Old-School Property Lists" that Apple/NeXT has standardized on XML) is incredible. Some things are changed, like having a specification instead of just assuming that the recipient will parse it and figure it out, but the likeness is there. I wonder if any of the proto people at google had experience with plists, or if it's just a case of convergent design.

    Everything old-school is new-school again, I guess.

  18. Re:Why another encoding scheme? by miffo.swe · · Score: 4, Insightful

    I dont think its NIH syndrome. They no doubt tested other solutions before doing their own thing.

    Dont forget this code is in widespread use and works very well. Googles server farm aint exactly small and the load they see is probably second to none.

    A couple of percents of better efficiency for Google probably means millions in saved costs. Tossing a couple of months on development on something like this is money well spent.

    I guess if all you have is SQL everything is a SQL SELECT no matter what you want to achieve.

    --
    HTTP/1.1 400
  19. Re:This is a good thing by Temporal · · Score: 5, Insightful

    The example they give is for a small set of data, and percentages vary more dramatically as sample sizes decrease.

    We wanted to give an idea of the speed without trying to boast too much or look like we were directly challenging anyone. Of course every news outlet has chosen to highlight the speed comment -- including the numbers which were intended to be ballpark figures -- more than was intended, but I guess that isn't surprising.

    I agree that the tiny "person" example is not a good benchmark case. It was intended as a usage example, not a speed example, but I stuck the speed numbers in there just meaning to give people a vague idea of the difference. The "20-100 times faster" comment is based on testing a variety of formats -- both unrealistic ones and real-life formats used in our search pipeline -- against programmatically generated XML equivalents (which may or may not themselves be realistic, though they contain the same data with the same structure). libxml2 was used for parsing XML. I don't really know how libxml2's speed compares to other XML parsers, but I didn't have a lot of time to investigate. The 20x faster number comes from the largest data set (~100k-ish) while the 100x number comes from a very small message. The most realistic case was about 50x. Sorry that I cannot provide exact details of the benchmark setup since many of the test cases were proprietary internal formats.

    In any case, I'm hoping that some independent source conducts some tests because I think anything we produced would probably have unintentional biases in it. Of course, I'll update the numbers in the docs if they turn out to be wildly off-base.

  20. Re:Between a rock and hard place by metamatic · · Score: 3, Insightful

    Funny, I'm tired of seeing YAML in places where XML would work fine.

    Like serializing my Ruby objects, for example. When I don't care about performance, XML is best, because almost everything else will read and write it, including my text editor, and I know the syntax. When I *do* care about performance, I'm not going to use YAML either.

    I don't see the niche YAML fits, frankly.

    --
    GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
  21. XML is not a 'format'! by r3g3x · · Score: 3, Insightful

    XML is crappy format

    That statement underlines most people's myopic vision of the XML family of technologies. XML is not a format it is a family of technologies based around a common grammar.

    XML is not a bucket.
    It is not a passive container for data.
    It is a transformable semantic graph.

    The heart and sole of XML is XLST it serves as a common 'glue' that allows the transformation between the various standardized 'languages' XML, XHTML, XLST, XSL-FO, SVG, RDF, RSS, etc...

    Example; the same XML document (lets say it represents rows in a database) can be transformed into a web page, pdf file, visual graph, rss feed, directed graph, or [insert non-XML text based output of choice]. More importantly the transformation can take place on the client side of a transaction effectively decoupling content and representation.

    That being said, I completely agree that XML is over-kill for simple fixed message passing. But, then again simple fixed format message passing isn't what XML was really designed for :-) XML was designed for situations where the representation needs of the client are unknown and/or dynamic.

    --
    If you don't know XSLT you don't know XML

    1. Re:XML is not a 'format'! by r3g3x · · Score: 3, Insightful

      XML is absolutely definitely a format -- eXtensible Markup Language.

      XML is a system of grammar that is used to create defined formats.

      You can't use XML to markup data. You have to use a defined grammar to create a format. You might say that this is an issue of semantics but that is the point. If your only use/understanding of XML is as a static data format then your doing it [XML/XSLT/..] wrong.

      XML is crappy tool for static storage. If the data is being read/written by the same program there are faster/simpler was to encode that data. But that isn't what XML is meant for. To repeat my previous post; XML documents are abstracted semantic models that are designed to be transformed and dynamically interpreted.

      Here is a link to an example of how XML/XSLT can be used to extend and enhance an existing XML based web service [Generating RSS with XSLT and Amazon ECS]. This a perfect example of the agnostic client scenario that XML was designed for (ie: the service could care less how the data is represented or transformed).

  22. need something different by speedtux · · Score: 4, Insightful

    If Google had tried to build their system on relational databases, XDR, and NFS, they would have spent huge amounts of money and spent lots of time trying to shoehorn their software into those constraints. And it's not just Google that did this: Amazon did the same thing, with their SimpleDB, S3, and SQS.

    The actual mistakes were relational databases, XML, and distributed POSIX file systems; all of those were systems designed by people with too much time on their hand and no real-world, large scale problems to solve. Finally, those mistakes are getting corrected, at least when it comes to high-end computing. At the low end, I suppose people will continue to tinker around with those toys.

  23. Re:An order of magnitude over XML? by vrmlguy · · Score: 5, Insightful

    Technically, you are correct - platform-agnostic data transfer has been possible since Sun's earliest RPC implementations. However, this seems to be considerably lighter-weight (although so is Mount Everest) and because order is specified, it's going to be much simpler to pluck specific data out of a data stream. You don't need to have an order-agnostic structure and then an ordering layer in each language-specific library.

    Actually, XDR (used for Sun's RPC) is very lightweight, arguably lighter than PB. (Yes, I forsee a Java implementation called PB&J.) XDR is potentially more compact, since it doesn't encode field identifiers, but it's also big-endian, which made it less attactive as little-endian computer archtectures took over the world. Also, while XDR demands a fixed ordering of fields, field order in PB *isn't* specified; the field identifiers allow you to order the fields anyway that you like.

    Overall, I like it. It's obvious that the developers were familar with the flaws of older protocols, and found ways to fix most of them. The only obvious thing I see missing is a canonical way to encode the .proto file as a Protocol Buffer, to make a stream self-describing.

    --
    Nothing for 6-digit uids?
  24. Re:Why another encoding scheme? by CoughDropAddict · · Score: 4, Insightful

    You think it's a "mistake of the past" that Google wrote things like GFS and BigTable that run on commodity hardware, scale basically horizontally (eg. you can just throw machines at the problem) and survive machine failures without human intervention?

    You don't "improve" on an existing tool like a relational database by adding a "feature" like fault tolerance. You have to redesign from the base up with those assumptions.

  25. Re:Why another encoding scheme? by SnowZero · · Score: 2, Insightful

    ~2000-2001 (I think the reference is in this video). Even if something newer is a bit better, we're not going to go back and port everything. Some future Google APIs will probably have an optional PB interface, because that's what it was being converting to internally anyway, so everyone might as well benefit from the compact over-the-wire encoding.