New Binary Diffing Algorithm Announced By Google

← Back to Stories (view on slashdot.org)

New Binary Diffing Algorithm Announced By Google

Posted by timothy on Thursday July 16, 2009 @05:20AM from the smartness-and-light dept.

bheer writes "Today Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates. Courgette achieves smaller diffs (about 9x in one example) than standard binary-diffing algorithms like bsdiff by disassembling the code and sending the assembler diffs over the wire. This, the Chromium devs say, will allow them to send smaller, more frequent updates, making users more secure. Since this will be released as open source, it should make distributing updates a lot easier for the open-source community."

9 of 192 comments (clear)

Min score:

Reason:

Sort:

uses a primitive automatic disassembler by flowsnake · 2009-07-16 05:25 · Score: 4, Insightful

An interesting approach - I wonder if this would also work as well on compiled bytecode like .NET or Java uses?
1. Re:uses a primitive automatic disassembler by FunkyELF · 2009-07-16 06:38 · Score: 4, Insightful
  
  It's an interesting approach to a problem that should never have arisen. Why does everything have to be thrown into a 10MB dll? People are browsing on computers, not wristwatches, and there's no reason to abandon modular design to tangle things together for trivial performance gains.
  Just because the end result is a 10MB dll doesn't mean that the code and design isn't modular. Thats like saying Linux isn't modular because the live CDs / DVDs come in a single gigantic 4.7Gb .iso file.
Also less overhead for Google by Enderandrew · 2009-07-16 05:28 · Score: 4, Insightful

Google has to pay the cost for maintaining servers and handling bandwidth for all the OS updates they push out. The more efficient they are in this process, the more money the save.
The good news is that the same benefits could be applied to Red Hat, Ubuntu, openSUSE, etc. Lower costs helps the profitability of companies trying to make a profit on Linux.
The end users also see benefits in that their packages download quicker. I'd be honestly pretty disappointed in any major distro that doesn't start implementing a binary diff solution around this.

--
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
Bad explanation by DoofusOfDeath · 2009-07-16 05:37 · Score: 4, Insightful

Courgette achieves smaller diffs (about 9x in one example)
That's potentially very misleading. I can compress any document, down to a single but, if my compression algorithm is sufficiently tailored to that document. For example:
if (compressed_data[0] == 0): return = get_Magna_Carta_text() else: return unzip(compressed_data[1:])
What we need to know is the overall distribution of compression ratios, or at least the average compression ratio, over a large population of documents.
Like many brilliant ideas... by istartedi · 2009-07-16 05:38 · Score: 4, Informative

...it makes you smack yourself on the head and go "why hasn't everybody been doing this for years?".
The idea is simple, and reminds me of something I learned in school regarding signals. Some operations are easy to perform in the frequency domain, so you do the Fourier transform, perform the operation, and then transform back.
This is really just the same idea applied to the problem of patches. They're small in source; but big in binary. It seems so obvious that you could apply a transform,patch,reverse process... but only when pointed out and demonstrated.
It's almost like my favorite invention: the phonograph.
The instructions for making an Edison phonograph could have been understood and executed by any craftsman going back thousands of years. Yet, it wasn't done until the late 19th century.
Are the inventors that brilliant, or are we just that stupid.

--
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
Re:wait a minute by flowsnake · 2009-07-16 05:45 · Score: 5, Informative

Not really a problem. Every n releases you push a complete patch - a bit like key frames in MPEG. People who keep their stuff reasonably up-to-date benefit from the smaller patches, those who don't just have to go back to the 'key frame' equivalent. And on the client - the latest version on the host is effectively the sum of all the diffs up to that point. OK so there is not enough information there to revert to an arbitrary earlier version, but usually we don't revert to older versions of executables. If we absolutely have to revert, maybe to undo a bad update, we can always just download a complete version of the required version.
Re:The cool thing is... by mzs · 2009-07-16 05:46 · Score: 5, Informative

You didn't RTFA before posting did you? When they say assembler they mean something of their own that takes a binary blob and one you have already and reassembles the original binary. It just so happens that the disassembler knows a lot about windows executables and the archive format that google uses breaks it up into some portions and assigns labels to addresses. Then it runs bsdiff on this smaller subset.
The code outlined in the blog post is really in these files:
win32_x86_generator.h
win32_x86_patcher.h
Notice these names? This is the disappointing aspect to all of this, it is one more new reason that Chrome is x86 and primarily Windows. You would need one for Mach-O and ELF to do this on other platforms and then if you were on another processor, say ARM or PPC, you would need something that understood that as well. Then there is the issue about 64-bit. In any case the assembler is not something like gas or MASM which is what you imagined.
Re:Can a layman get an explanation in English? by six · 2009-07-16 05:58 · Score: 5, Informative

Binary executable files contain a lot of addresses (variables, jump locations, ...) that are generated by the assembler at compile time.
Now consider you just add one 1-byte instruction somewhere in the middle of your program (let's say "nop"). When you compile it again, all the code that reference addresses beyond the insert point will have changed because the address has been incremented. So these 4 bytes added to your source code could mean addresses that get incremented in the compiled file in thousands of places.
What they do basically is take the binary file, disassemble it back to pseudo source code (not real asm I guess), and diff that against old version. The patch engine on the client end does the same disassembling, applies the patch, and reassembles the patched source code to an executable file.
This means diffs gets much smaller (4 bytes vs. 1000s in my extreme example), but also makes the diff/patch process much more complex, slower, and not portable.
Re:Can a layman get an explanation in English? by PCM2 · 2009-07-16 06:56 · Score: 4, Insightful
But, if the compilers are similar enough to create the same pseudocode/bytecode/ASM, or smart enough to save the source code, and use it for future comparisons, then wouldn't one patch be just as portable as the original source code?
It's a good theory and you're a smart person, but:
1. The compilers probably wouldn't be similar enough. Even developers who use GCC to compile something for Linux usually use Visual Studio to compile the same code for Windows. (The source code for Chrome, for example, shipped as a Visual Studio project.) Mac OS X likes to have everything written in Objective C, so that output would probably be very different.
2. Different operating systems rely on different shared libraries to do the same things. So a function call that opens a file in Linux might not look like a function call to do the same thing on Windows -- it might take a different number of arguments, for example, which means it would look rather different in machine language.
Portability doesn't appear to be Google's primary concern, though. They seem to be keen on the idea of delivering binaries over the wire (real binaries, not bytecode) -- see Google Native Client.
--
Breakfast served all day!