Google Open Sources Its Data Interchange Format
A number of readers have noted Google's open sourcing of their internal data interchange format, called Protocol Buffers (here's the code and the doc). Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more." It's the way data is formatted to move around inside of Google. Betanews spotlights some of Protocol Buffers' contrasts with XML and IDL, with which it is most comparable. Google's blogger claims, "And, yes, it is very fast — at least an order of magnitude faster than XML."
So is, well, just about anything.
Isn't xdr compact enough?
"Google's blogger claims, "And, yes, it is very fast -- at least an order of magnitude faster than XML."
That is just because they aren't using enough XML!
"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
Is that like PHP's serialize?
I must say - I'm amazed.
C++
Python
Java
what about PERL ? :]
Just think of the kind of power it took to make millions of employees standardize on the same format for their data interchange. Humans just gravitate to power wielding forces. Wonder what format they require for their surprise blog posts.
Go out and write one, sonny!
That's the beauty of open source.
SunRPC is old and awkward. Always want something better.
...and we'll be happy.
It is by my will alone my thoughts acquire motion; it is by the juice of the coffee bean that the thoughts acquire speed
But here, in an unusual departure from the norm, the default values for these members are set to digits (for strings or literals) or values (for numerals) that define their place in a sequence -- where they fall within a record.
Wow! They've invented fixed position data files. What will they invent next, a cool new programming language called RPG?
I'm sure it won't take long for the module to show up on CPAN.
I read the internet for the articles.
It looks like Google has taken some of the good elements of CORBA and IIOP into its own interchange format.
While CORBA certainly is bloated in a lot of ways, the IIOP wire protocol it uses is vastly faster and more efficient than any XML out there.. and yes it is just as "open" (publicly documented and Freely available for use in any open source application) as any XML schema out there. J2EE uses IIOP as well and its is technically possible to interoperate (although the problem with CORBA is that different implementations never really interoperated as they were supposed to).
As a side note, I'd rather write IDL code than an XML schema any day of the week too, but that's another rant.
both really from the same design sheet, but thrift has been opensource'd for over a year, and has many more language bindings. its been in use in several opensource projects (thrudb comes to mind), and has much more extant articles/documentation.
http://developers.facebook.com/thrift/
"And, yes, it is very fast â" at least an order of magnitude faster than XML."
Just wait for the XML zealots to come crashing and not believing that XML is not the fastest, best, solution to all the world's problems (including cancer) and of course people at Google are amateurs and id10ts and WHY DO YOU HATE XML kind of stuff.
Or, as Joel Spolski once said: http://www.joelonsoftware.com/articles/fog0000000296.html
No, there is nothing wrong with XML per se, except for the fans...
how long until
And as a bonus, they help undermine opponents who use competing technologies by helping train the workforce away from their practices. Overall I think it's very intelligent and well done strategic move.
Oh honey look... How cute... an angry slashdotter!
Binary encoding, none hierarchy based string list, and simple file serialization are all faster than XML. XML was created flexibility, commonality and human readability not speed. XSL, XQuery, and XPATH along with the DOM or SAX supply out of the box query, transformation, and manipulation capability.
The point of this isn't so much that it's faster than XML (so is everything else), it's that google took everything that a real person needs in a IDL and cut out everything else. Most IDLs have a serious case of second system effect, where features are added that nobody uses but seriously complicate the API. Even XML suffers from that (have you ever seen the kind of data structure you need to store a DOM, or what that does to library APIs for manipulating XML)?
I'd use it because 95% of the time all I need is something simple like this, and the other 5% of the time I should go back and rethink my design anyway.
That said, there is still a case for XML, especially the self documenting and human readable nature of the document, but there are a lot of cases where it is used today where it only adds unnecessary complexity and actually makes your code more difficult to maintain instead of simpler.
I read the internet for the articles.
So...when can we abandon these silly letters and decimal numbers to express ourselves in binary? It's like the elephant in the room. We all want a semantic web, but we all want it in English. At least Lojban has a start on a parsable language, but it still wants to be speakable.
Thats OK, we have Storable.
According to Brad Fitzpatrick's(of LiveJounral fame) blog, He's working on Perl support.
I always told people that -- it's optimized for:
1. Easy parsing by parsers written by people who slept through their compiler classes.
2. Verification in situations when it's impossible to devise a meaningful reaction to a failure (other than either "everything failed, turn off the computers and go home" and "assume the data to be valid anyway because ALL of it will have the same formatting error because the same program generates it")
3. Dealing with data that arrives in neatly packaged "documents" and "requests", as opposed to being constantly produced and consumed.
4. Either communicating between programs that have the same knowledge of message semantics, or preparation of pretty human-readable documents.
None of the above even remotely applies to anything practical except UI/display formats -- this is why XHTML and ODF (and because of that at some extent XSL) are usable, SOAP is a load of crap, and for the rest of purposes XML is used as a glorified CSL with angle brackets. XML is widespread because monumentally stupid standard is still better than no standard.
So here is your example of how superior can be ANY format that is not based on this stupid idea.
Contrary to the popular belief, there indeed is no God.
Microsoft has open-sourced some things upon abandonment. That's better than some companies, even. Companies can be good in some areas, and evil in others, however.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Yeah, and I'd like this for the .NET CLR and Mono as well. I looked at the code and the generators are not that complicated, maybe I'll give it a shot over the weekend. Does Google accept outside contribs for projects like these?
Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
http://brad.livejournal.com/2387105.html
Nerd rage is the funniest rage.
They open sourced the compiler (for C++, Java, and Python) that lets you actually use the data interchange format. If you follow the link you can download the code and start using it today. The code is open source.
I read the internet for the articles.
Looks kinda like JSON to me.
Seems like you are missing the code they released that allows you to implement this in a number of languages from the 'get-go'.
You've also missed that they've just told the world how the majority of their systems talk, something most people would find interesting given how much Google does and the fact that one of Google's strong points is mangling huge amounts of data in a relatively quickly manner.
PS. Your format stinks and is horribly slow and unscalable when it comes to adding to the library. Genre's are so unbelievably grey defined that you might as well just sort them by the dominate color of the cover. Google would have done better.
How is this either implementationally or conceptually different from BER/DER encoding (commonly used and available all over the place)?
Looks to me like it is exactly the same thing, reimplemented. I am sure bearing a mark of Google is nice and all, but they are definitely reinventing the wheel here.
lol. Not that FAST is IDENTICAL, but it is essentially just a much more sophisticated implementation of the same basic idea...
"Malo periculosam, libertatem quam quietam servitutem." -- Jefferson
You are missing that you're an idiot. Cheers.
It's called "Perl".
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
You open access to the source code of the C++, Java and Python libraries that you use in your internal work.
Think global, act loco
I guess that XDR wasn't good enough, then, or ASN.1 (which supports multiple abstract encodings to boot).
XML, as an interchange format?
I suppose one could load source code into memory, and compile it every time, too. Even Java compiles to bytecode.
Bloated formats are fine for human interpretation (I rather like one kind of structure for my config files), or occasional parsing (which is why most of the stuff in /etc is human-readable, for small data sets (I do remember when "the internet" was one big /etc/hosts file), but for interchange? Just cause you're big-endian and I'm little-endian?
The trick to making non-human readable formats acceptable is the prevelence of wide-spread encoding and decoding tools.
Yes, XML is self-describing, at least syntactically (and formally with an XSD), and specific encoding semantics can be tagged, but the same can be achieved with means for type encoding. The big thing with XDR and related formats is that types are implicit -- both ends need to know what is being serialized. For RPC, with well-defined interfaces, this is not a problem, but it does make type-checking a remote service a bit of a challenge.
However, types can be encoded as data, and serialized as well: this happens for variant types naturally. Thus, there is no reason to not have a type-encoding and type-exchange protocol to permit dynamic type-checking. The advantage over self-describing data serializations is that it can be done on an as-required basis, instead of with every damn serialization.
In Liberty, Rene
You... You have GOT to be new here.
True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.
Obviously, those at Google felt XML didn't work well for them. They have the resources to invent a protocol and libraries to support it. And, they are big enough to be their own ecosystem, which means as long as everyone at Google is using their formats, interop is no biggie. Good for them, I don't begrudge that decision.
I'm actually a game developer, not a web developer, so I'll speak to XML's use as a file format in general. Here's a few points regarding our use of XML:
* We only use it as a source format for our tools. XML is far too inefficient and verbose to use in the final game - all our XML data is packed into our own proprietary binary data format. .NET platform (Windows is our development platform, of course). It's astoundingly easy to serialize data structures to XML using .NET libraries - just a few lines of code.
* We also only use it as a meta-data format, not a primary container type. For instance, we store gameplay scripts, audio script, and cinematic meta-data in XML format. We're not foolish enough to store images, sounds, or maps in a highly-verbose, text-based format. XML's value to us is in how well it can glue large pieces of our game together.
* All our latest tools are written in C# and using the
* Because it's a text-based format and human readable, if a file breaks in any way, we can just do a diff in source control to see what changed, and why it's breaking.
I'll make a concession that I've heard of some pretty awful uses of XML. But those who dismiss XML as a valuable tool in the toolchest are equally as foolish as those who believe it's the end-all and be-all of programming (I'm not saying that's true of you, just pointing out foolishness on both sides). Like any tool, it's most valuable when used in it's optimal role, not when shoehorned into projects as a solution to everything.
Irony: Agile development has too much intertia to be abandoned now.
http://brad.livejournal.com/2387105.html
.. from things like YAML and JSON?
I think it makes me look an order of magnitude smarter, yes.
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
You're not really going to see the benefits of Perl in one month. It's not a very straightforward language like that.
Of course it's not new. It not only looks like ASN.1, it actually is very much like ASN.1. But to me it looks more like an extension of rpcgen, because ASN.1 came with a lot of other baggage. Of course, both rpcgen and asn.1 are just the best known implementations of ideas that were developed far earlier. Shannon's book on information theory explains just this sort of prefix code. These kinds of prefix codes have been in use since the 1960s, and code-generators have been around since the 1970s.
I think the reason that some people at google think it's new is because they are all young. Young people are constantly coming up with "new" ideas that are really two decades or more old. The idea seems new to the young person because he/she has not seen it before. That isn't a jab at google, or at young people. It is just a fact that everything seems new until after you have seen it before.
I have my own data format that is an alternative to XML as well. It works by normalizing the data into records which all contain the same number of fields, and placing an agreed-upon delimiter between each field. The end of the record is indicated by a newline.
I think this "delimited" format has a lot of potential.
Tired of FB/Google censorship? Visit UNCENSORED!
"then compile them to produce classes to represent those structures in the language of your choice"
That's not entirely true, but I digress. Anyway, can someone shed some light on how this is different than binary serialization I've been using to pass C# objects around for quite some time now? It is just a matter of giving the class a Serializable attribute and then using the BinaryFormatter class to serialize the object to a stream. XML serialization is available if needed to pass to non-M$ entities, but binary serialization has been around a while, no?
Our systems ( application servers, frameworks, data serialization and unserialization facilities etc ) understand/support XML-RPC, binary XML-RPC, JSON and PHP object notation (ref: PHP serialize() / unserialize() functions).
There is a set of primitives ( string, integers, floats, arrays, structures, timestamps, raw data ) which are the only datatypes developers utilize; the serialization process is transparent. That way, depending on the nature and capabilities of the systems involved in a data interchange procedure, the more efficient transport protocol is utilized ( for example JSON when interfacing with an application server from a Javascript application, XML-RPC when talking to remote XML-RPC servers etc ).
It's probably more important to decide upon a list of primitive constructs and always operate on that kind of datatypes as opposed to figuring out the ideal way to store and retrieve them. You can always come up with an even better way to encode that data later on, anyway.
Technology ramblings : Simple is Beautiful
XML is good because it promotes keeping orthogonal, two very different problems : structuring data and encoding it. Tying these two problems creates complications down the road, as entangling independent areas of information processing and algorithm sciences usually does.
Thankfully an alternative to XML.
If you didn't think XML was among least efficient transport formats then you weren't really paying attention. Battery-conscious mobile devices do not really enjoy parsing XML DTD and then the XML file itself.
It reminds me a little bit of AOL's SNAC message types.
We get something good for the industry from Google, after a rash of bad press, and is actually NOT a beta.
Kriston
I agree, the "order of magnitude" note sounds more like a media-bite than anything, but here are a few points to consider.
1.) The example they give is for a small set of data, and percentages vary more dramatically as sample sizes decrease.
2.) If your usage generally involves many such small sets of data, the benefits of slight reductions in latency will multiply significantly.
3.) Even if the speed performance is identical to XML, the reduction in data size should not be ignored, especially in large-volume production environments.
4.) This is not a small internal tool Google is releasing -- this is a major component which they have heavily used. It has been real-world tested (albeit at just one company) and proven at ridiculous scales.
5.) They are giving this away. Source, documentation, examples -- the works. I know this isn't driven entirely by altruism, but neither is it is an embrace-extend-extinguish maneuver. They just made a tool that meets a specific need better than what was currently available, and then made it available.
You might call Microsoft the Rails-To-Trails Conservancy of the software industry. Use it until it outlasts its usefulness and then release it to the rest of the world for no charge.
Sorta.
Kriston
Man, the BetaNews article is horrible. Practically everything — except for the direct quotes from the Google blog post — is incorrect. I somehow expect more from someone who goes by "Scott M. Fulton, III".
Nope, they're conceptually the same. The ".proto" files are like DTD or XSD. The actual document data is stored in a binary format (though there's also a text representation). The data manipulation API is similar what you get from Castor or JAX-B.
The "= number" at the end of a field definition is not a "default value". It is a numeric tag that identifies that field. That said, "= number" is quite unintuitive syntax; maybe something like "@number" would have been less confusing.
Looking at some of the documentation, I don't think the aforementioned numbers directly index the field's location in the record. They lay down the present fields one after another, probably putting each field's tag number before the field data. This also allows them to avoid sending fields that use the default value. So they still need to specify how long each record is — either with "fenceposts" between records or a "length" specifier before each record.
Granted that a format can't have a speed and neither can an Internet connection. An Internet connection can have an amount of bandwidth but things still go the same speed. The end result though is more data gets there FASTER. In the same vein if you mix in some data with a bunch of garbage then it will take longer to see the data. This is the point. This is why XML is slow. Stop with the semantics and see the forest for the trees.
Perl is to programming languages what English is to natural languages: easy to fool around with, hard to learn well, but when you do, the expressive power is incredible. And when you mess it up, nobody understands what you're trying to say.
The similarity between these things and NeXT's Property Lists (now called "Old-School Property Lists" that Apple/NeXT has standardized on XML) is incredible. Some things are changed, like having a specification instead of just assuming that the recipient will parse it and figure it out, but the likeness is there. I wonder if any of the proto people at google had experience with plists, or if it's just a case of convergent design.
Everything old-school is new-school again, I guess.
In the document, google showed one case of XML
<person>
<name>John Doe</name>
<email>jdoe@example.com</email>
</person>
However, in the company I used to work, we required such a file to be written something like
<person
name="John Doe"
email="jdoe@example.com"/>
The google protocol buffer format it will beperson {
name = "John Doe"
email = "jdoe@example.com"
}
I failed to see why the protocol buffer format is much smaller and faster.
There is a spark in every single flame bait point.
ASN.1, from 1985, really is very similar. Here's a message defined in ASN.1 form:
Note that this has almost exactly the same feature set as Google's representation. There are named, typed field which can be optional or repeated. It just looks more like Pascal, while Google's syntax looks more like C.
All that work! How sad I am that we must reschedule the Web Services Choreography Working Group to consider to study XML's replacement, Protocol Buffers.
I looked at some of their binary interchange format. It looks like a valid Perl program to me! *rimshot*
Still waiting for Perl to make use of the Euro key as an operator...
Exactly my feeling, I'm so tired of seeing XML used in places YAML is perfect for.
The linked article does not really hit the nail on the head on what's so great about Protocol Buffers or why it should be faster. In an article linked form the link, there's a better explanation:
Instead, we developed Protocol Buffers. Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple "get" and "set" methods, and once you're ready, serializing the whole thing to â" or parsing it from â" a byte array or an I/O stream just takes a single method call.
So as I read that the methods of accessing a table are not generic to the DataBase but actually are individually optimized to the Data itself. That is the accessors know the structure rather than having to discover it from the markup. Presumably the code that rides around with the objects is free to contain it's own meta data, caches and pre-parsing of the records fo optimization. Yet from the outside it's just a bunch of get-set methods to provide uniform encapsulation.
My guess is the meta-data all totalled is less than all the wasted space in the XML fenceposts, plus by encapsulating they are free to compress the actual data when it makes sense.
Anyhow to all you XML folks. Stop picking up the XML cresent wrench and trying to use it as a hammer. Reach for the YAML.
Some drink at the fountain of knowledge. Others just gargle.
Anyway, can someone shed some light on how this is different than binary serialization I've been using to pass C# objects around for quite some time now?
It's portable and language-independent?
YAML
Some drink at the fountain of knowledge. Others just gargle.
Whoa, Mr. AC there, that was extremely helpful.
BTW, it's Perl, not PERL.
Pray tell us, why should I heed someones opinion on a the language when he can't even spell it's name correctly?
Funny, I'm tired of seeing YAML in places where XML would work fine.
Like serializing my Ruby objects, for example. When I don't care about performance, XML is best, because almost everything else will read and write it, including my text editor, and I know the syntax. When I *do* care about performance, I'm not going to use YAML either.
I don't see the niche YAML fits, frankly.
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
XML is crappy format
That statement underlines most people's myopic vision of the XML family of technologies. XML is not a format it is a family of technologies based around a common grammar.
:-) XML was designed for situations where the representation needs of the client are unknown and/or dynamic.
XML is not a bucket.
It is not a passive container for data.
It is a transformable semantic graph.
The heart and sole of XML is XLST it serves as a common 'glue' that allows the transformation between the various standardized 'languages' XML, XHTML, XLST, XSL-FO, SVG, RDF, RSS, etc...
Example; the same XML document (lets say it represents rows in a database) can be transformed into a web page, pdf file, visual graph, rss feed, directed graph, or [insert non-XML text based output of choice]. More importantly the transformation can take place on the client side of a transaction effectively decoupling content and representation.
That being said, I completely agree that XML is over-kill for simple fixed message passing. But, then again simple fixed format message passing isn't what XML was really designed for
--
If you don't know XSLT you don't know XML
If Google had tried to build their system on relational databases, XDR, and NFS, they would have spent huge amounts of money and spent lots of time trying to shoehorn their software into those constraints. And it's not just Google that did this: Amazon did the same thing, with their SimpleDB, S3, and SQS.
The actual mistakes were relational databases, XML, and distributed POSIX file systems; all of those were systems designed by people with too much time on their hand and no real-world, large scale problems to solve. Finally, those mistakes are getting corrected, at least when it comes to high-end computing. At the low end, I suppose people will continue to tinker around with those toys.
Hi, take a look at http://code.google.com/p/protobuf/ and http://code.google.com/apis/protocolbuffers/docs/reference/overview.html for details about what it is that's being offered. It's not the format per se that's being released, it's the software that allows you to use it in your own applications.
Really? In that case, I am defining a format for specifying series of integers. Here's how it works: for every integer, you find the corresponding prime (eg. for the number 5 you find the 5th prime). Then for every pair of numbers, you multiply them together, and emit the product into the output file. To parse, all you have to do is find each product and factor it into its two primes.
According to you, my format cannot have a speed. It is a format; it has no speed. So please write me a parser that parses my format into the original integers and is comparably fast to other formats (on a byte-by-byte basis).
Comment removed based on user account deletion
Comment removed based on user account deletion
Uh, no. Google officially deems perl unmaintainable, and its internal use is completely verboten.
You're quite welcome to write your own if you want it, but it's not something we'd ever use ourselves.
I think this is actually the fastest most compact way possible to encode information. It all depends on how good the compilers are. What they've done is replaces a generalized system like XML with custom written code that works only for the specific messages that are being passed. Nothing could be faster. The objection has always been that writing the custom code is hard. They have solved that issue.
The big difference is that a protocol buffer cannot be understood without the message format (.proto file). Now lets actually take a look at a real list, like say the developers for apache (as a list of {name:,email:} objects):
protobuf: ~1654 bytesjson: 1915 bytes
protobuf.lzop: ~744 bytesjson.lzop: 809 bytes
What you see is precious little difference in the size of the data even though the json is self-describing. The lzop version is essentially identically sized, and compressing and decompressing with lzo is wicked fast. So size is not a reason to use proto buffers.
Maybe speed is? Instead of using lzo compression just create a JSON binary format. This is trivial, and provides essentially the same size and speed benefits as protocol buffers while still being JSON in nature.
The only advantage to protocol buffers then is that they generate access and verify classes for you in you favorite language (if that language is C++, Java, or Python). Big deal, again this is absolutely trivial.
To me what this demonstrates is premature optimization. Instead, first use a simple text format like JSON then if that is too large compress it. Then if that is too slow send it in binary.
Note: I approximated the size of the proto buffers based on the descriptions of the binary format since I haven't downloaded the code (it actually compresses less well since I did not vary the 'length' bytes in my test file).
XML still wins for typing by hand, I reckon. Tags are easier to type than holding down the spacebar or trying to get your editor to expand tabs to spaces for YAML files but not for every other file in your project.
this issue was solved by Barney Rubble about the time Jesus was still riding his dinosaur.
Come join us in the 21sth century.
Some drink at the fountain of knowledge. Others just gargle.
Yeah, this sort of thing goes on at lots of places, as you imply, not just Google. I think there's also a "cool" factor--when you work at a cool company where half the employees are drinking the "we're all geniuses" kool-aid like Google (and there are lots of others, I don't mean to single out Google either), you may know about existing alternatives and decide you can do better, and the time investment to build and maintain that technology is justified.
Tagged "notinventedhere", as NIH syndrome is the name I first heard for this anti-pattern.
Here is 160 FOSS projects google released http://code.google.com/hosting/projects.html. Heck even google's new big thing AppEngine is OSS.
Thats a pretty good amount of code. What else do you want them to release? their search engine? gmail?
Pluralitas non est ponenda sine neccesitate
Oh, and let us not forget http://code.google.com/android/
Pluralitas non est ponenda sine neccesitate
Yet Another Interface Definition Language...
What's wrong with XDR?
3/4 of the companies I've worked for had some engineer who had to unroll his own RPC format with matching IDL for some "technical" reason that had no basis in reality. They were all pretty crummy implementations. All you hot shot engineers, please, just stop re-inventing the wheel.
It's easy to say XML is slow... no one ever planned it to be fast! The reason for XML's existence is to be human readable (especially by people who are used to reading HTML). That's it. People expecting it to be fast are using the wrong tool for the job.
BDAT format lives!
Have gnu, will travel.
The idea is not new, but the fact that Protocol Buffer takes a more C-like syntax as opposed to ASN.1 (more Pascal- or Fortran-like) appeals to software developers in this generation who starts learning programming with C or Java. Besides, Protocol Buffer has great integration with C++, Java, and Python. When it comes to data serialization format, it's really the implementation that counts rather than the idea, and they have a nice implementation.
I once had a signature.
OK, Corba IDL and IIOP have some quirks, but they work very well. There are excellent Open Source implementations like JacORB, TAO or IIOP.NET that interoperate very well with each other or J2EE. Google could have been compatible to all this instead of going their own way.
It's obvious that the developers were familar with the flaws of older protocols, and found ways to fix most of them.
Maybe, but no mention is made of ASN.1, which to me suggests a lack of historical awareness. I would have appreciated a comparison. A comment from Kenton Varda, attached to the announcement blog post, reads in part:
Sorry, I personally am not very familiar with ASN.1 DER. A brief look at some documentation suggests to me that it is more complicated than Protocol Buffers, which can be good or bad depending on whether you need that complication.
Plus, writing a decent XML parser is easy!
When the policeman of the tie, rule you violate, hello punishment of the kitty?
If it is pre-processed to produce C++ code, Java code, etc... it should be possible to do in XML also without affecting size and speed. It is time to come-up with an "xml to protocol buffers" (de)converter.
Hush! If you stay really really quiet you may just be lucky enough to spot another spelltard in its native environment.
Caesar si viveret, ad remum dareris.
Well now XML is wedged between YAML on the low end (e.g. config files, human readable data, ad hoc files) and ProBuff on the high end (massive structured data bases).
Individual PBs are meant to be fairly small and decode into memory in a single class/struct. The scale issue is mainly about having billions of these messages, and not wanting to overpay for storage in a less efficient format, or network bandwidth for moving them around.
Do you know of any decent open source ASN.1 code generators to compare with these google protobufs?
--jeffk++
ipv6 is my vpn
Yeah, and I'd like this for the .NET CLR and Mono as well. I looked at the code and the generators are not that complicated, maybe I'll give it a shot over the weekend.
Some people have already mentioned that on the Google Group, so its probably a good idea to go there and compare ideas / combine efforts.
Does Google accept outside contribs for projects like these?
Yes, but I think the guy running it is encouraging separate projects that are then pushed upstream. Internally, this is very mature software, so the release cycle won't be as fast as some people will want, at least while it ramps up.
Parse error: I'm sorry, but your comment should have begun with "Crikey", not "Hush".
Censorship is telling a man he can't have a steak just because a baby can't chew it. --Mark Twain
But, but .. is it compliant with sections 2.1 and 2.12 of RFC1925?
-- Reality checks don't bounce.
Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more."
Christ, I hope I'm never in an elevator with someone who would consider THAT an elevator statement.
---"What did I say that sounded like 'Tell me about your day?'"---
Not surprising either. I have done similar thiongs several times. Still, it is nice to find this augemented with a reasonable binary encoding and a cross-platform (to some degree) library support.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
That being said, it is still a very nice scripting language to do data parsing and integration on the fly.
Sig it.
I always find it amusing when people use the Web to sulk about the faults of XML.
Anyway, this is what Google has to say: .proto file).
However, protocol buffers are not always a better solution than XML â" for instance, protocol buffers would not be a good way to model a text-based document with markup (e.g. HTML), since you cannot easily interleave structure with text. In addition, XML is human-readable and human-editable; protocol buffers, at least in their native format, are not. XML is also â" to some extent â" self-describing. A protocol buffer is only meaningful if you have the message definition (the
Uh, no. Google officially deems perl unmaintainable, and its internal use is completely verboten.
That is so cool...
Bow-ties are cool.
The way Protocol Buffers encode data is similar to how I have designed protocols in the past. But, of course, there are various ways to encode data. Some examples:
- Encode your data in human-readable form, using separators. E.g. s-expressions, CSV, XML, JSON.
- Encode your data using type, length, value tuples. E.g. some encodings in ASN.1.
- Encode your data using type, value (and length, if necessary) tuples. E.g. some encodings of ASN.1.
Then there are choices as to adding indices to quickly jump to interesting parts of the data or skip over uninteresting parts, mechanisms for forward compatibility, etc.
I wonder if anyone has done a comparison of various techniques that can be used in data interchange formats and their impacts on message size, parsing performance, and other interesting quantities.
Please correct me if I got my facts wrong.
Binary encoding, none hierarchy based string list,
and simple file serialization are all faster than XML.
XML was created flexibility, commonality and human readability not speed. XSL, XQuery, and XPATH along with the DOM or SAX supply out of the box query, transformation, and manipulation capability.
The thing about "human readability" is that, just like any other binary file format (ASCII text is a binary encoding too, remember) it is not intrinsically human-readable, rather it relies upon a proper set of tools to make it human-readable.
The counter-argument here is that, while that's true enough, just about every tool in the world can read ASCII files, right? From Blender to Emacs to a simple paginator like "more"...
But except in simple cases that's not sufficient to actually work with the data. In XML for instance one would ideally like a structural representation of the data, the ability to hide a block at a time to streamline the display, etc. Or if editing you'd at least want simple validation features, maybe the ability to match opening tags with closing ones, etc... In theory you can work the data over in any text editor but in practice you would use something more specialized.
If the same specialization is made available for a compact binary format, then it'll be every bit as "human-readable" as an ASCII-encoded one.
Bow-ties are cool.
If anything, I was trying to praise Perl. Guess the mods around here just have a twisted sense of humor.
With XML and .NET today, I can write XmlSerializer(typeof(MyClass)).Serialize(myObject), and get it all for free, both ways, with validation and versioning. It keeps my code short and clear. So long as it's not a performance bottleneck, why should I bother with anything else?
I noticed a posting in the discussion group questioning the use of little-endian "on the wire" rather big-endian (network byte order). The response was that most of googles computers were little-endian, which seems pretty short-sighted.
XML's not really good at being human-readable, either.
About the only thing XML is really good at is "being widely supported by available tools", which is often quite important.
Wow, define a datatype, and generate code snipets to manage-it. Fantastic! And that's news!?
It seems people is just too bussy reading new acronims, to learn computer programming fundamentals. Go figure!
But... hey!, that's from Google, it must be great somehow... Bullshit!
What's in a sig?
Cool, thanks.
Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
Whoopee I can import a library too. The point is if you're never going to have to look at the serialized object then it makes no difference how you serialize it. But if you are ever going to look at it or hand edit it (e.g. a config file, document header, debug report, automatic mail parsing, .... ) then YAML is the right choice. Use the right tool.
Some drink at the fountain of knowledge. Others just gargle.
You missed my point. XML has very good libraries available for virtually every platform and language out there, and, in most cases, they come as part of the base library. YAML, on the other hand, only comes with Ruby.
In addition to C libraries, Bindings for YAML exist for the following languages:
* Perl .NET Framework
o YAML:: is a common interface to several YAML parsers.
o YAML::Tiny implements a useful subset of YAML; small, pure Perl, and faster than the full implementation.
o YAML::Syck Binding to SYCK C-library. Offers fast, highly featured YAML
o YAML::XS Binding to LibYaml. Better yaml 1.1 compatibility.
* PHP
o Spyc is a pure PHP implementation
o PHP-Syck (binding to SYCK library)
* Python
o PyYaml Highly featured. Pure Python or optionally uses LibYAML.
o PySyck Binding to SYCK C-Library
* Ruby (YAML included in standard library since 1.8. based on SYCK)
* Java
o jvyaml based on Syck, and patterned off ruby-yaml
o JYaml pure Java implementation
* R (programming language)
o CRAN YAML based on SYCK
* JavaScript
o native Java script emits but does not read YAML
o YAML-Javascript emitter and parser
*
o project page
* OCaml
o OCaml-Syck
* C++
o C++ wrapper for libYaml
* Objective-C
o Cocoa-Syck
* Lua
o Lua-Syck
* Haskell
o Haskell Reference wrappers
Some drink at the fountain of knowledge. Others just gargle.