Google Open Sources Its Data Interchange Format

← Back to Stories (view on slashdot.org)

Google Open Sources Its Data Interchange Format

Posted by kdawson on Tuesday July 8, 2008 @08:07AM from the it's-fast-that's-why dept.

A number of readers have noted Google's open sourcing of their internal data interchange format, called Protocol Buffers (here's the code and the doc). Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more." It's the way data is formatted to move around inside of Google. Betanews spotlights some of Protocol Buffers' contrasts with XML and IDL, with which it is most comparable. Google's blogger claims, "And, yes, it is very fast — at least an order of magnitude faster than XML."

12 of 332 comments (clear)

Min score:

Reason:

Sort:

Re:Likely story! by caerwyn · 2008-07-08 08:26 · Score: 3, Informative

Are you serious? XML is great for certain applications, but the one thing it *isn't* is fast. It's very believable that something like this could be an order of magnitude faster.

--
The ringing of the division bell has begun... -PF
compare to thrift ( from facebook) by Anonymous Coward · 2008-07-08 08:29 · Score: 5, Informative

both really from the same design sheet, but thrift has been opensource'd for over a year, and has many more language bindings. its been in use in several opensource projects (thrudb comes to mind), and has much more extant articles/documentation.
http://developers.facebook.com/thrift/
Re:No PERL API ??!!?? by yknott · 2008-07-08 08:40 · Score: 5, Informative

According to Brad Fitzpatrick's(of LiveJounral fame) blog, He's working on Perl support.
Re:WTF am I missing by jandrese · 2008-07-08 08:47 · Score: 5, Informative

They open sourced the compiler (for C++, Java, and Python) that lets you actually use the data interchange format. If you follow the link you can download the code and start using it today. The code is open source.

--

I read the internet for the articles.
Re:JSON by Temporal · 2008-07-08 09:20 · Score: 4, Informative

Structurally Protocol Buffers are similar to JSON, yes. In fact, you could use the classes generated by the Protocol Buffer compiler together with some code that encodes and decodes them in JSON. This is something some Google projects do internally since it's useful for communicating with AJAX apps. Writing a custom encoding that operates on arbitrary protocol buffer classes is actually pretty easy since all protocol message objects have a reflection interface (even in C++).
The advantage of using the protocol buffer format instead of JSON is that it's smaller and faster, but you sacrifice human-readability.
Re:Why another encoding scheme? by Abcd1234 · 2008-07-08 09:33 · Score: 4, Informative

You think? Take BigTable. Wikipedia describes it as: '"a sparse, distributed multi-dimensional sorted map", sharing characteristics of both row-oriented and column-oriented databases'. Sounds, to me, like a specialized solution to a very specialized problem, a problem that, I presume, didn't fit with any existing solution. Same goes with GFS. After all, do you really think they didn't evaluate existing solutions before embarking on building an entirely new distributed filesystem? Do you really think they're that stupid?
As for Protocol Buffers, given the existing solutions out there (such as ASN.1 and CORBA) are generally ugly and/or over-engineered, it sounds to me like they're simply addressing a gap in the industry... after all, XML and SOAP aren't the end-all and be-all of generic object-passing protocols.
ASN.1 encoded with BER/DER just needs tools by Animats · 2008-07-08 09:51 · Score: 3, Informative

ASN.1, from 1985, really is very similar. Here's a message defined in ASN.1 form:

Order ::= SEQUENCE { header Order-header, items SEQUENCE OF Order-line} Order-header ::= SEQUENCE { number Order-number, date Date, client Client,payment Payment-method } Order-number ::= NumericString (SIZE (12)) Date ::= NumericString (SIZE (8)) -- MMDDYYYY Client ::= SEQUENCE { name PrintableString (SIZE (1..20)), street PrintableString (SIZE (1..50)) OPTIONAL,postcode NumericString (SIZE (5)), town PrintableString (SIZE (1..30)), country PrintableString (SIZE (1..20)) DEFAULT default-country } default-country PrintableString ::= "France" Payment-method ::= CHOICE { check NumericString (SIZE (15)), credit-card Credit-card, cash NULL } Credit-card ::= SEQUENCE { type Card-type, number NumericString (SIZE (20)), expiry-date NumericString (SIZE (6)) -- MMYYYY -- } Card-type ::= ENUMERATED { cb(0), visa(1), eurocard(2), diners(3), american-express(4) }

Note that this has almost exactly the same feature set as Google's representation. There are named, typed field which can be optional or repeated. It just looks more like Pascal, while Google's syntax looks more like C.
Re:JSON by pavon · 2008-07-08 09:57 · Score: 3, Informative

The major difference between this and something like JSON or YAML or even XML is that those formats all include the format information (variable names, nesting, etc) along with the data. This does not.

message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}
What you are looking at above is the Protocol Format (.proto file) for a single message, which is analogous to an XML schema. No data is stored in that file - the numbers you see are unique ids for the different fields, and they are used in the low low-level representation of the data (not all fields have to be included in every instance of a message)
The actual data is serialized using a compact binary format, not ASCII like JSON/YAML/XML which makes it much more efficient both to transfer over a network as well as to parse.
Re:An order of magnitude over XML? by jd · 2008-07-08 10:12 · Score: 4, Informative

Technically, you are correct - platform-agnostic data transfer has been possible since Sun's earliest RPC implementations. However, this seems to be considerably lighter-weight (although so is Mount Everest) and because order is specified, it's going to be much simpler to pluck specific data out of a data stream. You don't need to have an order-agnostic structure and then an ordering layer in each language-specific library.
There have been all kinds of attempts to produce this sort of stuff. RPC, DCE, Corba, DCOM, etc, are programmatic interfaces and handle function calls, synchronization, etc. OPeNDAP is probably the closest to Google's architecture in that it is ONLY data. It's more sophisticated, as it handles much more complex data types than mere structures, but it has its own overheads issues. It isn't designed to scale to terabyte databases, although it DOES scale extremely well and is definitely the preferred method of delivering high-volume structured scientific data - at least when compared to the RPC family of methods, or indeed the XML family. I wouldn't use it for the kind of volume of data Google handles, though, you'd kill the servers.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
More XML? EXI, Efficient Xml Interchange! by refactored · 2008-07-08 12:12 · Score: 3, Informative

http://www.w3.org/XML/EXI/

The development of the Efficient XML Interchange (EXI) format was guided by five design principles, namely, the format had to be general, minimal, efficient, flexible, and interoperable. The format satisfies these prerequisites, achieving generality, flexibility, and performance while at the same time keeping complexity in check.
Many of the concepts employed by the EXI format are applicable to the encoding of arbitrary languages that can be described by a grammar. Even though EXI utilizes schema information to improve compactness and processing efficiency, it does not depend on accurate, complete or current schemas to work.
Re:Likely story! by cnettel · 2008-07-08 12:13 · Score: 4, Informative

The problem is that, in my experience, it is easy to write a 99 % XML-compliant parser that is 10 times faster. That last percent, though...
Re:An order of magnitude over XML? by vrmlguy · 2008-07-08 15:25 · Score: 3, Informative

The only obvious thing I see missing is a canonical way to encode the .proto file as a Protocol Buffer, to make a stream self-describing.
A-ha! I found it! "Thus, the classes in this file allow protocol type definitions to be communicated efficiently between processes."
Why do you need this? Well, you may not. "Most users will not care about descriptors, because they will write code specific to certain protocol types and will simply use the classes generated by the protocol compiler directly. Advanced users who want to operate on arbitrary types (not known at compile time) may want to read descriptors in order to learn about the contents of a message."

--
Nothing for 6-digit uids?