New Binary Diffing Algorithm Announced By Google
bheer writes "Today Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates. Courgette achieves smaller diffs (about 9x in one example) than standard binary-diffing algorithms like bsdiff by disassembling the code and sending the assembler diffs over the wire. This, the Chromium devs say, will allow them to send smaller, more frequent updates, making users more secure. Since this will be released as open source, it should make distributing updates a lot easier for the open-source community."
Would this algorithm be useful to the rsync team?
The cool thing is, one can easily extend this to other executable formats, as long as the assembler is readily available client-side: Windows users could relate to those pesky, resource-hogging Java updates, and .NET and Mono applications could similarly benefit.
This is, interestingly, the second binary diffing innovation that affects me in the past few months. Fedora just turned on delta updates with Fedora 11, a feature borrowed from the openSUSE folks.
Michel
Fedora Project Contribut
There was a old, pre-X MacOS binary-diff program that diffed each element of the resource fork independently.
Back in those days, non-code elements usually were stored in unique "resources," one icon might be in ICON 1, another might be in ICON 2, etc.
Code was split among one or more resources.
If a code change only affected one section of code, the only 1 CODE resource would be affected.
Since each resource was limited to 32KB, a diff that only affected one resource would never be larger than twice that size.
If only a single byte changed, the diff was only the overhead to say what byte changed, the old value, and the new value, much like "diff" on text files only on a byte rather than line basis.
So, conceptually, this isn't all that new.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Fedora is already using binary diffs to speed up downloading updates - see yum-presto. With a better binary diff algorithm, the RPM updates can hopefully be made even smaller. As the Google developers note, making smaller update packages isn't just a 'nice to have' - it really makes a difference in getting vulnerabilities patched faster and cuts the bandwidth bill for the vendor and its mirror sites. Remembering my experiences downloading updates over a 56k modem, I am also strongly in favour of anything that makes updating faster for the user.
-- Ed Avis ed@membled.com
The concept of a Branch-Call-Jump (BCJ) filter is well-known in the data compression community, and is a standard part of quite a few deployed compression products. Used as a front end to a conventional compression algorithm -- or, in this case, a binary compression algorithm -- does indeed give significant improvements. The application to binary diff is particularly interesting, since it means you can deal with branches and other references *over* the compressed region, so this is really rather clever.
It would work fine, if you include Java/.NET specific disassemblers.
In fact, you can already compress Java JARs about ten times by using pack200 algorithm (it works essentially the same way).
It's pretty cool they do this. As opposed to iTunes "updates" of 70+MB--really no updates at all since they're just the monolithic install package.
An interesting approach - I wonder if this would also work as well on compiled bytecode like .NET or Java uses?
I suspect that it would. I once heard Manuel De Icaza give a talk in the early days of the Mono project. He said the "bytecode" that came out of the C# compiler was analogous to the output that comes out of the front end of a C compiler. The JIT is the equivalent of the C compiler's back end, only it runs right before execution. I suspect that what Google's decompiler is doing is reverting the binary to something similar to the C compiler's internal representation -- and if so, this method would work pretty much the same for bytecode. But that's just a guess.
Breakfast served all day!
As bandwidth was a tad limited in those days we too looked for an efficient way to distribute updates. The solution was to distribute the smaller bug fixes as patches, similar to debug scripts. The loader would run those debug scripts after loading the program. To apply a patch the customer would put the patch file in the same folder as the program, restart the program on the hot standby side of the cluster and provoke a switchover to the standby.
The patch was applied without any downtime. If the customer wanted to back-out the bug fix then all they had to do was delete the patch file and switch back to the unpatched side of the cluster.
Most patches were small and we only had a few hundred bytes to send out at a time. Afterwards the world upgraded to Windows and forgot such technology :-(