New Binary Diffing Algorithm Announced By Google
bheer writes "Today Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates. Courgette achieves smaller diffs (about 9x in one example) than standard binary-diffing algorithms like bsdiff by disassembling the code and sending the assembler diffs over the wire. This, the Chromium devs say, will allow them to send smaller, more frequent updates, making users more secure. Since this will be released as open source, it should make distributing updates a lot easier for the open-source community."
can suck my diff!
courgette (kr-zht)
n. Chiefly British
A zucchini.
An interesting approach - I wonder if this would also work as well on compiled bytecode like .NET or Java uses?
Google has to pay the cost for maintaining servers and handling bandwidth for all the OS updates they push out. The more efficient they are in this process, the more money the save.
The good news is that the same benefits could be applied to Red Hat, Ubuntu, openSUSE, etc. Lower costs helps the profitability of companies trying to make a profit on Linux.
The end users also see benefits in that their packages download quicker. I'd be honestly pretty disappointed in any major distro that doesn't start implementing a binary diff solution around this.
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
Would this algorithm be useful to the rsync team?
The cool thing is, one can easily extend this to other executable formats, as long as the assembler is readily available client-side: Windows users could relate to those pesky, resource-hogging Java updates, and .NET and Mono applications could similarly benefit.
This is, interestingly, the second binary diffing innovation that affects me in the past few months. Fedora just turned on delta updates with Fedora 11, a feature borrowed from the openSUSE folks.
Michel
Fedora Project Contribut
announced a new compression technique called Courgette geared towards distributing really small updates
I just RTFA, this has nothing to do with a compression technique.
What they developed is a technique to make small diffs from *executable binary files* and it doesn't look like it's portable to anything other than x86 because the patch engine has to embed an architecture specific assembler + disasembler.
I remember having this idea a while back and thinking that it would surely be out there already implemented. When it wasn't, I didn't follow through and try to invent it myself. I was focused on another project at the time. Never really came back to that original idea. Awesome that someone did, and that they are keeping it open so everyone can use it! Thanks, Google!
Any takers that Microsoft will release their own version of this, but compile the assembly before sending it over the wire?
Maybe they can call it "Compiled Assembly from Disassembly" (CAD).
That's potentially very misleading. I can compress any document, down to a single but, if my compression algorithm is sufficiently tailored to that document. For example:
if (compressed_data[0] == 0):
return = get_Magna_Carta_text()
else:
return unzip(compressed_data[1:])
What we need to know is the overall distribution of compression ratios, or at least the average compression ratio, over a large population of documents.
It seems to me that the moniker 'binary diff' doesn't really apply here, does it? My understanding of a binary diff algorithm/program has always been one that can take any arbitrary binary file, not knowing anything about what *type* of file it was (e.g. the file could be an image file, video file, executable, word document, zip file, CAD file, or whatever) and create a diff for that binary 'data'.
It sounds like this is very specific to x86 executables?
Still, whatever you call it, it's good to see progress being made. I just wonder why you can't create small, efficient diffs for any kind of binary file?
Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates today.
Better hurry! It won't work tomorrow!
Comment of the year
...it makes you smack yourself on the head and go "why hasn't everybody been doing this for years?".
The idea is simple, and reminds me of something I learned in school regarding signals. Some operations are easy to perform in the frequency domain, so you do the Fourier transform, perform the operation, and then transform back.
This is really just the same idea applied to the problem of patches. They're small in source; but big in binary. It seems so obvious that you could apply a transform,patch,reverse process... but only when pointed out and demonstrated.
It's almost like my favorite invention: the phonograph.
The instructions for making an Edison phonograph could have been understood and executed by any craftsman going back thousands of years. Yet, it wasn't done until the late 19th century.
Are the inventors that brilliant, or are we just that stupid.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
A better binary diffing algorithm is useful for source control. But for security? If the code is so awful that the bandwidth required for security updates is a problem, the product is defective by design.
It sounds like Google tried "agile programming" on trusted code, and now has to deal with the consequences of debugging a pile of crap.
How dare people ask questions. Nothing good ever came from doing that!
If you have something that you dont want anyone to know, maybe you shouldnt be doing it in the first place -Eric Schmidt
There was a old, pre-X MacOS binary-diff program that diffed each element of the resource fork independently.
Back in those days, non-code elements usually were stored in unique "resources," one icon might be in ICON 1, another might be in ICON 2, etc.
Code was split among one or more resources.
If a code change only affected one section of code, the only 1 CODE resource would be affected.
Since each resource was limited to 32KB, a diff that only affected one resource would never be larger than twice that size.
If only a single byte changed, the diff was only the overhead to say what byte changed, the old value, and the new value, much like "diff" on text files only on a byte rather than line basis.
So, conceptually, this isn't all that new.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Sounds like a old horny midget woman.
Am I the only one who misread "Courgette"? :)
If you're not familiar with the process of binary diff (I wasn't) there's a paper linked from the article that explains some about bsdiff:
http://www.daemonology.net/papers/bsdiff.pdf
Wayback from 2007/07/09:
http://web.archive.org/web/20070709234208/http://www.daemonology.net/papers/bsdiff.pdf
Billy Brown rides on. Yolanda Green bypasses Gary White.
Please?
Utilizing the synergization of benchmark e-solutions to pre-workaround action items!
A bit related, Microsoft have provided at binary patch API for years that can take advantage of the internal structure of .EXE files to achieve better compression. It also does multi-way patching so you can specify n source files and one target file... the patch will contain enough information to patch any of those n source files up to the target file.
I mean, whats the need, really?
Cant we all be merry and unixize our minds and think: hey, all components of the sw are files, lets just separate the heavy parts that are less prone to any security issue (GUI), from all the libs and executables, and lets just send those through the wire as a package....
We do that in linux and it works(TM): we package everything, have dependencies and then only update what needs to be updated without weird binary patching.
Now, not to say that its not usefull to have a new bindiff algorithm: lets plug it into git and svn, a specialized switch (--Gelf-exec-compression) in rsync also comes to mind, and that should make them more powerfull and all...
NO SIG
This is diffs the dissembled version of the original against the update on the server, then does the opposite on the client. I couldn't help but think of this as similar to Gentoo's model ... download a compressed diff of the source and then recompile. Both have the same problem: too much client-side CPU usage (though Gentoo's is an extreme of this). Isn't Google Chrome OS primarily targeting netbooks? Can such things handle that level of extra client-side computation without leaving users frustrated?
I'd rather improve the distribution model. Since packages are all signed, SSL and friends aren't needed for the transfer, nor does it need to come from a trusted authority. Bittorrent comes to mind. I'm quite disappointed that the apt-torrent project never went anywhere. It's clearly the solution.
Use my userscript to add story images to Slashdot. There's no going back.
Torrents are problematic in that they are contrary to the way many, if not all, ISP's are set up. Nodes are intended to be consumers, not data providers. Shifting the downloaded data to the UPSTREAM is against the intent of the architects and produces problems.
Redesign the network, sure. Until then, torrents for updates are only taking your problem and shifting it to someone with fewer resources (without permission.)
The concept of a Branch-Call-Jump (BCJ) filter is well-known in the data compression community, and is a standard part of quite a few deployed compression products. Used as a front end to a conventional compression algorithm -- or, in this case, a binary compression algorithm -- does indeed give significant improvements. The application to binary diff is particularly interesting, since it means you can deal with branches and other references *over* the compressed region, so this is really rather clever.
This is diffs the dissembled version of the original against the update on the server, then does the opposite on the client. I couldn't help but think of this as similar to Gentoo's model ... download a compressed diff of the source and then recompile. Both have the same problem: too much client-side CPU usage (though Gentoo's is an extreme of this). Isn't Google Chrome OS primarily targeting netbooks? Can such things handle that level of extra client-side computation without leaving users frustrated?
I don't think this is really a problem in this case. In the time even a slow computer, by today's standards, has downloaded a kilobyte over a WAN link it has easily performed millions of CPU operations on it. The same would be true for any kind of compression really. Since Bandwidth through your pipe is just orders of magnitudes slower than anything that happens within your machine, this added level of complexity is clearly more beneficial than a direct approach. That's why it makes sense to compress files on the server (or the client uploading it to the server), transfer them and decompress them on the client, even if the client is quite slow.
I'd rather improve the distribution model. Since packages are all signed, SSL and friends aren't needed for the transfer, nor does it need to come from a trusted authority. Bittorrent comes to mind. I'm quite disappointed that the apt-torrent project never went anywhere. It's clearly the solution.
With patches between minor versions at about 80kB (as stated in TFA), I don't think that a distribution using bittorrent would really be the way to go here. Add to this the fact that google has quite a lot of bandwidth at their disposal and I don't see this happening anytime soon.
I aggree however that it may be a good idea to transfer large amounts of linux packeges that way. But with a lot of smaller packages the protocol overhead of bittorrent might become a limiting factor regarding its usefulness.
If you're going to do this in the long run, it would make more sense to distribute LLVM byte code and compile that to native code on the host... That way you have small diffs to the byte code, and also you get targeted optimization (and maybe even custom profile guided optimization).
This is a classic trade-off; you can compress better by sacrifizing platfom-indepence. It has been done before, e.g, the ExeDiff tool for the DEC Alpha. The only news here is that its being done by Google. Interested parties should refer to: G. Motta, J. Gustafson, and S. Chen. "Differential Compression of Executable Code," Proceedings of the Data Compression Conference (DCC 07), Mar. 2007, pp. 103-112 for an overview of the available tools and how they perform.
Regards,
Jacob Gorm Hansen, author of the EDelta linear-time executable differ.
its not 'binary diffing'.
does it work with gif files? jpg? png? dng? mov? mpg?
its meant for PROGRAM binary files. of a specific type of processor, at that.
that's fine.
but please say so, and don't imply its binary. its really specific and not helpful for image (pic, sound) formats.
--
"It is now safe to switch off your computer."
Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates today.
That's a pretty small market- I can't imagine that there are many teams that will release really small updates today.
Of course, if you just generate a patch file with the source changes, and zip that up, it will be even tinier.
Except for that upgrade of the underlying FooLib from 2.1 to 2.2 that's part of your hotfix. Well, all you need then is to just get a patch file for that too, and include that.
And then compile the whole sucker on the other end.
Everyone's got plenty of CPU, right? And we're just about all using trivially decompilable bytecode anyway, so if you make your patchfile based on the source changes after compilation and decompilation, you've got all the right transforms in place to come up with a pretty effectively minimal changeset.
Of course, we have plenty of bandwidth anyway, and programs are small. Media files are not, but I'm not going to get any space savings trying to disassemble that picture of your mom.
In case anyone has missed the reference of the name "Courgette", it's French for summer squash/zucchini type vegetables. So, Courgette as in squash, and squash as in make smaller.
This, the Chromium devs say, will allow them to send smaller, more frequent updates, making users more secure.
Is patch size usually a factor in determining whether to send out a patch or not?
In my experience, people update because a patch fixes some bugs or introduce some features they want. The size of the patch doesn't matter unless the user is severely bandwidth-limited.
For developers, patches are usually sent out on a schedule (e.g. Microsoft's patch Tuesday), when certain milestone is reached, or when a sufficiently dangerous bug is found and need to be fixed quickly. I am not any organization that sends out patches based on 'the size of the diffs accumulated so far', but please enlighten me if you know of one.
The sort of English people who need to pronounce "courgette" are the sort who still go to old fashioned food shops, not supermarkets, and so have to ask for things. The sort who tend to be better off, older and who have spent more time on holiday in France and Italy (which is where we learnt to eat them in the first place.)
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
Electronic Frontier Foundation
Since this will be released as open source, it should make distributing updates a lot easier for the open-source community.
We open source community don't have no F**KING business in binary distribution!
A similar approach for distributing updates to source packages has been around for years: The dynamic deltup server network. You can tell their servers which source archives you already have and which new version you want. The server then unpacks both archives and sends you a deltup diff that can be used to create a bit-by-bit copy of the desired source archive, using the deltup program.
An example use case for this are source based operating system distributions, like Gentoo GNU/Linux. The saved bandwidth is usually significant, often more than 90%.
http://linux01.gwdg.de/~nlissne/dynamic.html
Don't be so hard on yourself. I'm sure you were already modded down way before that sentence.
Hey, my favorite Compare2Crack.v.0.10.(S)1995 just got opensourced! I wonder how long will take for crackers to adopt distribution of "bugfixes" for commercial software as well.
So the OPEN SOURCE community will benefit from BINARY diffs.
uhuh. I think we're just fine with patch/diff.
oh, what's that? You meant the people that DISTRIBUTE BINARY version of OPEN SOURCE programs will benefit? Ahhh, now I see.
I think you mean "disassemble". Dissemble is not the opposite of assemble and means something quite different.
Defective design (design is defective, intentional or not)
Defective by design (intentionally defective design)
Bad/ineffective design (unintentionally defective)
The first one is ambiguous.
If Google's OS is to become open source - why the need for binary diffs at all?
Isn't it feasible that the client store all the source and the patch process just involves a diff to the original source code that gets recompiled?
Also, this dissassemble,edit,reassemble scheme isn't new. There are a few viruses out there that perform similar actions in order to create space in the executable for themselves. pretty ingenious actually.
I.O.U One Sig.
From my first pass over it this actually looks like a very strange way to describe a rather simple approach.
More generically they could achieve the same, if not more, by using a differential PPM or LZ method, which is very simple to design/implement, needs zeto knowledge of what data it is handling, and is in no way a new idea.
I suspect whomever designed this had a good idea but not that enough compression experience to know solutions were already on hand.
In effect the new binary should be compressed with a dictionary compresser pre-initialised with the contents of the old binary as an initial seed and everything needed is achieved - the initial compression time can be large for a large binary, however as this is a single compress, millions of decompress situation, it would not matter at all (and by large I mean minutes, not days).
Still, their more complex description/implementation sounds much more whizzy and gets more attention, no doubt.
As bandwidth was a tad limited in those days we too looked for an efficient way to distribute updates. The solution was to distribute the smaller bug fixes as patches, similar to debug scripts. The loader would run those debug scripts after loading the program. To apply a patch the customer would put the patch file in the same folder as the program, restart the program on the hot standby side of the cluster and provoke a switchover to the standby.
The patch was applied without any downtime. If the customer wanted to back-out the bug fix then all they had to do was delete the patch file and switch back to the unpatched side of the cluster.
Most patches were small and we only had a few hundred bytes to send out at a time. Afterwards the world upgraded to Windows and forgot such technology :-(
I may be missing something but I'm surprised that nobody raised the security issue of putting an man-in-the-middle able to disassemble the expected code and replace it by its own. What about signatures in this stuff?