New Binary Diffing Algorithm Announced By Google

← Back to Stories (view on slashdot.org)

New Binary Diffing Algorithm Announced By Google

Posted by timothy on Thursday July 16, 2009 @05:20AM from the smartness-and-light dept.

bheer writes "Today Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates. Courgette achieves smaller diffs (about 9x in one example) than standard binary-diffing algorithms like bsdiff by disassembling the code and sending the assembler diffs over the wire. This, the Chromium devs say, will allow them to send smaller, more frequent updates, making users more secure. Since this will be released as open source, it should make distributing updates a lot easier for the open-source community."

13 of 192 comments (clear)

Min score:

Reason:

Sort:

Rsync by Gerald · 2009-07-16 05:29 · Score: 1, Interesting

Would this algorithm be useful to the rsync team?
1. Re:Rsync by Anonymous Coward · 2009-07-16 05:51 · Score: 3, Interesting
  
  Umm, many of us use rsync like mad on binaries such as ISO images or repository trees full of RPMs which are full of compressed data.
  The rsync algorithm (see the cool thesis) is actually quite adept at processing "non-text" byte sequences. It's main innovation is having a heuristic to break the stream into smaller phrases by identifying canonical start/end boundaries by running the same heuristic on both source and destination files (without ever possessing both files in the same place). It transfers a description of the source file as a sequence of phrases identified by their hash. It transfers entire phrases when the recipient doesn't already have one with the same hash. It doesn't compute "diffs" in any normal sense, e.g. comparing before/after copies and creating a patch.
  An application of rsync could be immensely improved in any situation where it operates on structured data inside a serialization "container" by cracking open the container and operating on the internal structured data stream instead of the container representation. Then, the recipient would reencode the container after generating a full copy of the internal structured data. However, if the container encoding is not normalized and deterministic, you would not get back a byte-for-byte identical container on the other side.
  This google strategy is to open up the "assembled container" and do diffs on the source representation. It sounds more like an attempt to recover easy newline boundaries for doing diffs without a clever phrase boundary heuristic like used in rsync. I wonder whether their assembler is deterministic. Operating on source code in general could be great, but compilation with optimizers and linkers is not necessarily deterministic, so a different sort of integrity protection system would be needed to validate the results of the process.
2. Re:Rsync by bcrowell · 2009-07-16 06:04 · Score: 3, Interesting
  
  Simple answer: no statistically, I think rsync has very few binary files to deal with, at least the way I'm using it. also, their technique may make the diff data smaller, but it also makes the diffing/patching process a LOT slower, something many rsync users don't want because on a LAN you don't care much about bandwidth usage.
  
  Well, your use of rsync is not necessarily typical. I use unison for file synchronization, and unison uses rsync; I have quite a few binary files that I sync. You're right that it would be a tradeoff of cpu versus bandwidth, but actually that's the whole point of rsync.
  On the other hand, I can think of at least some other reasons that this algorithm would be less appropriate for rsync than for google's intended use. (1) Google knows that their file is an executable, and knows that it's an x86 executable. Rsync would have to detect this using the kind of heuristic algorithms used by the unix "file" utility. (2) It's kind of ugly to imagine building all this x86-specific stuff into a generic program like rsync, which people may be using on ARM machines, or which people may use in the future on architectures that haven't been invented yet. (3) Google is doing one-to-many file distribution (pushing out a single patch to millions of users), so that means that the tradeoff of cpu versus bandwidth is an excellent one for them. With rsync, the typical use is one-to-one (syncing machine A with machine B), so the tradeoff isn't as awesomely spectacular.
  BTW, a completely different application of this that Google may be interested in is for Google Native Client, which runs x86 code sandboxed in a browser. Suppose google has 10^8 users using a particular web app that runs x86 code in their browser via Native Client. As long as the program doesn't change, most of the users are probably just getting it from cache, so it loads fast and doesn't cost google any bandwidth. Then on a certain day, google updates the software. Potentially this causes a bottleneck where 10^8 users are all trying to download the same code (could be especially bad on a corporate network with lots of people simultaneously downloading the same stuff on a Monday morning). If Courgette is incorporated in the browser, then potentially it could be smart enough to realize that it already has version x cached, and what it's trying to download is version x+1, and work out the patch without having to download the whole thing again.
  
  --
  Find free books.
3. Re:Rsync by NoCowardsHere · 2009-07-16 06:12 · Score: 3, Interesting
  
  Not really. Where Google's algorithm really shines is in exactly the field they designed it for: efficiently trying to update a large number of identical binary files of known content (particularly those representing code) on many remote computers, by sending a compact list of the differences.
  Rsync actually has to solve a different problem: figuring out where differences exist between two files separated over a slow link when you DON'T precisely know the content of the remote file, but know it's likely similar to a local one. Its rolling-checksum algorithm is very good at doing this pretty efficiently for many types of files.
4. Re:Rsync by mzs · 2009-07-16 06:21 · Score: 2, Interesting
  
  That was a very good high level explanation of how rsync works, thanks. It shows though that this approach is uninteresting to rsync. The hashes and heuristics are so that you do not have to compare the two ends over a potentially slow link, that is the beauty of rsync. In google's case they have all the versions locally and can do the comparisons orders of magnitude faster. So even if there was some way to get binary 1-1 (which I bet they do in fact) it would be useless for rsync as you would have to read the whole file to do the comparison, so why not just send the whole file instead of changed to begin with?
  The other point is that when google says assembler they do not mean something like MASM or gas, but rather something that can take two binary blobs (one that you have and one that they send) and run them to generate a third binary blob that matches what they intended the result to be after encapsulating it again. I think you get that but I wanted to make that clear to other readers.
The cool thing is... by salimma · 2009-07-16 05:31 · Score: 3, Interesting

The cool thing is, one can easily extend this to other executable formats, as long as the assembler is readily available client-side: Windows users could relate to those pesky, resource-hogging Java updates, and .NET and Mono applications could similarly benefit.
This is, interestingly, the second binary diffing innovation that affects me in the past few months. Fedora just turned on delta updates with Fedora 11, a feature borrowed from the openSUSE folks.

--
Michel
Fedora Project Contribut
Nothing terribly new by davidwr · 2009-07-16 05:41 · Score: 2, Interesting

There was a old, pre-X MacOS binary-diff program that diffed each element of the resource fork independently.
Back in those days, non-code elements usually were stored in unique "resources," one icon might be in ICON 1, another might be in ICON 2, etc.
Code was split among one or more resources.
If a code change only affected one section of code, the only 1 CODE resource would be affected.
Since each resource was limited to 32KB, a diff that only affected one resource would never be larger than twice that size.
If only a single byte changed, the diff was only the overhead to say what byte changed, the old value, and the new value, much like "diff" on text files only on a byte rather than line basis.
So, conceptually, this isn't all that new.

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Re:Also less overhead for Google by Ed+Avis · 2009-07-16 05:58 · Score: 2, Interesting

Fedora is already using binary diffs to speed up downloading updates - see yum-presto. With a better binary diff algorithm, the RPM updates can hopefully be made even smaller. As the Google developers note, making smaller update packages isn't just a 'nice to have' - it really makes a difference in getting vulnerabilities patched faster and cuts the bandwidth bill for the vendor and its mirror sites. Remembering my experiences downloading updates over a 56k modem, I am also strongly in favour of anything that makes updating faster for the user.

--
-- Ed Avis ed@membled.com
BCJ filter by hpa · 2009-07-16 06:11 · Score: 2, Interesting

The concept of a Branch-Call-Jump (BCJ) filter is well-known in the data compression community, and is a standard part of quite a few deployed compression products. Used as a front end to a conventional compression algorithm -- or, in this case, a binary compression algorithm -- does indeed give significant improvements. The application to binary diff is particularly interesting, since it means you can deal with branches and other references *over* the compressed region, so this is really rather clever.
Re:uses a primitive automatic disassembler by Cyberax · 2009-07-16 06:45 · Score: 3, Interesting

It would work fine, if you include Java/.NET specific disassemblers.
In fact, you can already compress Java JARs about ten times by using pack200 algorithm (it works essentially the same way).
Re:Also less overhead for Google by klui · 2009-07-16 07:00 · Score: 2, Interesting

It's pretty cool they do this. As opposed to iTunes "updates" of 70+MB--really no updates at all since they're just the monolithic install package.
Re:uses a primitive automatic disassembler by PCM2 · 2009-07-16 07:06 · Score: 2, Interesting

An interesting approach - I wonder if this would also work as well on compiled bytecode like .NET or Java uses?
I suspect that it would. I once heard Manuel De Icaza give a talk in the early days of the Mono project. He said the "bytecode" that came out of the C# compiler was analogous to the output that comes out of the front end of a C compiler. The JIT is the equivalent of the C compiler's back end, only it runs right before execution. I suspect that what Google's decompiler is doing is reverting the binary to something similar to the C compiler's internal representation -- and if so, this method would work pretty much the same for bytecode. But that's just a guess.

--
Breakfast served all day!
many ways to do that by ei4anb · 2009-07-16 20:36 · Score: 2, Interesting

I worked on a project back in the 80's where we had half a million lines of code running on a high availability machine with 384 32bit CPUs.
As bandwidth was a tad limited in those days we too looked for an efficient way to distribute updates. The solution was to distribute the smaller bug fixes as patches, similar to debug scripts. The loader would run those debug scripts after loading the program. To apply a patch the customer would put the patch file in the same folder as the program, restart the program on the hot standby side of the cluster and provoke a switchover to the standby.
The patch was applied without any downtime. If the customer wanted to back-out the bug fix then all they had to do was delete the patch file and switch back to the unpatched side of the cluster.
Most patches were small and we only had a few hundred bytes to send out at a time. Afterwards the world upgraded to Windows and forgot such technology :-(