Slashdot Mirror


New Binary Diffing Algorithm Announced By Google

bheer writes "Today Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates. Courgette achieves smaller diffs (about 9x in one example) than standard binary-diffing algorithms like bsdiff by disassembling the code and sending the assembler diffs over the wire. This, the Chromium devs say, will allow them to send smaller, more frequent updates, making users more secure. Since this will be released as open source, it should make distributing updates a lot easier for the open-source community."

21 of 192 comments (clear)

  1. uses a primitive automatic disassembler by flowsnake · · Score: 4, Insightful

    An interesting approach - I wonder if this would also work as well on compiled bytecode like .NET or Java uses?

    1. Re:uses a primitive automatic disassembler by hattig · · Score: 3, Insightful

      Why not just ship the affected .class files for Java? Disassemble the Jar that is being updated, replace/add/remove the class(es) as per the update instructions and rebuild the Jar.

      Chrome's problem is that a massive, bulky, chrome.dll file needs to be sent out with each update, and that it isn't easy to split it up into its constituent parts because it is lower level.

      It's not a new idea, but quite impressive that they've actually gone and worked it all out so that it is reliable. And nicer than patching via jumps (noop the old code, add a jump to new code, append new code to end).

    2. Re:uses a primitive automatic disassembler by FunkyELF · · Score: 4, Insightful

      It's an interesting approach to a problem that should never have arisen. Why does everything have to be thrown into a 10MB dll? People are browsing on computers, not wristwatches, and there's no reason to abandon modular design to tangle things together for trivial performance gains.

      Just because the end result is a 10MB dll doesn't mean that the code and design isn't modular. Thats like saying Linux isn't modular because the live CDs / DVDs come in a single gigantic 4.7Gb .iso file.

    3. Re:uses a primitive automatic disassembler by Cyberax · · Score: 3, Interesting

      It would work fine, if you include Java/.NET specific disassemblers.

      In fact, you can already compress Java JARs about ten times by using pack200 algorithm (it works essentially the same way).

  2. Also less overhead for Google by Enderandrew · · Score: 4, Insightful

    Google has to pay the cost for maintaining servers and handling bandwidth for all the OS updates they push out. The more efficient they are in this process, the more money the save.

    The good news is that the same benefits could be applied to Red Hat, Ubuntu, openSUSE, etc. Lower costs helps the profitability of companies trying to make a profit on Linux.

    The end users also see benefits in that their packages download quicker. I'd be honestly pretty disappointed in any major distro that doesn't start implementing a binary diff solution around this.

    --
    http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
  3. The cool thing is... by salimma · · Score: 3, Interesting

    The cool thing is, one can easily extend this to other executable formats, as long as the assembler is readily available client-side: Windows users could relate to those pesky, resource-hogging Java updates, and .NET and Mono applications could similarly benefit.

    This is, interestingly, the second binary diffing innovation that affects me in the past few months. Fedora just turned on delta updates with Fedora 11, a feature borrowed from the openSUSE folks.

    --
    Michel
    Fedora Project Contribut
    1. Re:The cool thing is... by mzs · · Score: 5, Informative

      You didn't RTFA before posting did you? When they say assembler they mean something of their own that takes a binary blob and one you have already and reassembles the original binary. It just so happens that the disassembler knows a lot about windows executables and the archive format that google uses breaks it up into some portions and assigns labels to addresses. Then it runs bsdiff on this smaller subset.

      The code outlined in the blog post is really in these files:

      win32_x86_generator.h
      win32_x86_patcher.h

      Notice these names? This is the disappointing aspect to all of this, it is one more new reason that Chrome is x86 and primarily Windows. You would need one for Mach-O and ELF to do this on other platforms and then if you were on another processor, say ARM or PPC, you would need something that understood that as well. Then there is the issue about 64-bit. In any case the assembler is not something like gas or MASM which is what you imagined.

  4. wait a minute by six · · Score: 3, Insightful

    announced a new compression technique called Courgette geared towards distributing really small updates

    I just RTFA, this has nothing to do with a compression technique.

    What they developed is a technique to make small diffs from *executable binary files* and it doesn't look like it's portable to anything other than x86 because the patch engine has to embed an architecture specific assembler + disasembler.

    1. Re:wait a minute by flowsnake · · Score: 5, Informative

      Not really a problem. Every n releases you push a complete patch - a bit like key frames in MPEG. People who keep their stuff reasonably up-to-date benefit from the smaller patches, those who don't just have to go back to the 'key frame' equivalent. And on the client - the latest version on the host is effectively the sum of all the diffs up to that point. OK so there is not enough information there to revert to an arbitrary earlier version, but usually we don't revert to older versions of executables. If we absolutely have to revert, maybe to undo a bad update, we can always just download a complete version of the required version.

  5. Bad explanation by DoofusOfDeath · · Score: 4, Insightful

    Courgette achieves smaller diffs (about 9x in one example)

    That's potentially very misleading. I can compress any document, down to a single but, if my compression algorithm is sufficiently tailored to that document. For example:

    if (compressed_data[0] == 0):
          return = get_Magna_Carta_text()
    else:
          return unzip(compressed_data[1:])

    What we need to know is the overall distribution of compression ratios, or at least the average compression ratio, over a large population of documents.

  6. Today? by Blakey+Rat · · Score: 3, Funny

    Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates today.

    Better hurry! It won't work tomorrow!

  7. Like many brilliant ideas... by istartedi · · Score: 4, Informative

    ...it makes you smack yourself on the head and go "why hasn't everybody been doing this for years?".

    The idea is simple, and reminds me of something I learned in school regarding signals. Some operations are easy to perform in the frequency domain, so you do the Fourier transform, perform the operation, and then transform back.

    This is really just the same idea applied to the problem of patches. They're small in source; but big in binary. It seems so obvious that you could apply a transform,patch,reverse process... but only when pointed out and demonstrated.

    It's almost like my favorite invention: the phonograph.

    The instructions for making an Edison phonograph could have been understood and executed by any craftsman going back thousands of years. Yet, it wasn't done until the late 19th century.

    Are the inventors that brilliant, or are we just that stupid.

    --
    For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
    1. Re:Like many brilliant ideas... by merreborn · · Score: 3, Insightful

      Like many brilliant ideas... it makes you smack yourself on the head and go "why hasn't everybody been doing this for years?".

      Probably because the old solution was:

      A) Simple
      B) Good enough for most purposes.

      Sure, you can shave 80% off your patch filesize... but unless you're as big as google, patch bandwidth probably isn't a major priority -- you've likely got much more important things to dedicate engineering resources to.

      You know how they say "Necessity is the mother of invention"? Well, when an invention isn't really necessary for most folks, it tends to show up a little later than it might otherwise have.

    2. Re:Like many brilliant ideas... by xenocide2 · · Score: 3, Informative

      I've been reviewing various proposals like this, and basically, it's a tradeoff mirrors don't want. This sorta stuff has been proposed for ages. I listened to a recording of the author of rsync give an introduction to the algorithm and program, and among the questions was "is this suitable for .deb?". The answer was "Not unless the archive is completely recompressed with gzip patched to be rsync compatible".

      Eventually that patch landed, and you could conceivably do this. Except it expands the entire archive by 1-2 percent. And there's a lot of CPU overhead where there was none before.

      Then someone cooked up zsync. It's the same thing as rsync except with a precalculated set of results for the server side. This looked like a winner. But someone else really wants LZMA compressed packages to fit more onto pressed CDs. So now we're at a fundamental impass: optimize for the distribution of install media to new users, or optimize for the distribution of updates to existing users.

      The best resolution I've seen is to use LZMA compression on the entire CD volume, but that requires the kernel to get their ass in gear and allow yet another compression in the kernel. That may have finally happened, I haven't checked recently. But generally LZMA requires more RAM to operate, so that could raise the minimum requirements on installs.

      In short, it's a balancing act of effort, bandwidth, CPU and RAM. What works for some may not work for all.

      --
      I Browse at +4 Flamebait

      Open Source Sysadmin

  8. Re:Rsync by Anonymous Coward · · Score: 3, Interesting

    Umm, many of us use rsync like mad on binaries such as ISO images or repository trees full of RPMs which are full of compressed data.

    The rsync algorithm (see the cool thesis) is actually quite adept at processing "non-text" byte sequences. It's main innovation is having a heuristic to break the stream into smaller phrases by identifying canonical start/end boundaries by running the same heuristic on both source and destination files (without ever possessing both files in the same place). It transfers a description of the source file as a sequence of phrases identified by their hash. It transfers entire phrases when the recipient doesn't already have one with the same hash. It doesn't compute "diffs" in any normal sense, e.g. comparing before/after copies and creating a patch.

    An application of rsync could be immensely improved in any situation where it operates on structured data inside a serialization "container" by cracking open the container and operating on the internal structured data stream instead of the container representation. Then, the recipient would reencode the container after generating a full copy of the internal structured data. However, if the container encoding is not normalized and deterministic, you would not get back a byte-for-byte identical container on the other side.

    This google strategy is to open up the "assembled container" and do diffs on the source representation. It sounds more like an attempt to recover easy newline boundaries for doing diffs without a clever phrase boundary heuristic like used in rsync. I wonder whether their assembler is deterministic. Operating on source code in general could be great, but compilation with optimizers and linkers is not necessarily deterministic, so a different sort of integrity protection system would be needed to validate the results of the process.

  9. Re:Solving the wrong problem by Joe+Random · · Score: 3, Insightful

    If the code is so awful that the bandwidth required for security updates is a problem, the product is defective by design.

    No one is saying that the bandwidth is a problem. They're just saying that the bandwidth is unnecessary. FSM-forbid that anyone try to optimize something.

    Plus, as the article points out, smaller updates mean more people can receive the update per unit-bandwidth, which means faster distribution of security updates when something critical is fixed.

  10. Re:Can a layman get an explanation in English? by six · · Score: 5, Informative

    Binary executable files contain a lot of addresses (variables, jump locations, ...) that are generated by the assembler at compile time.

    Now consider you just add one 1-byte instruction somewhere in the middle of your program (let's say "nop"). When you compile it again, all the code that reference addresses beyond the insert point will have changed because the address has been incremented. So these 4 bytes added to your source code could mean addresses that get incremented in the compiled file in thousands of places.

    What they do basically is take the binary file, disassemble it back to pseudo source code (not real asm I guess), and diff that against old version. The patch engine on the client end does the same disassembling, applies the patch, and reassembles the patched source code to an executable file.

    This means diffs gets much smaller (4 bytes vs. 1000s in my extreme example), but also makes the diff/patch process much more complex, slower, and not portable.

  11. Re:Can a layman get an explanation in English? by chris_eineke · · Score: 3, Informative

    A compiler takes source codes and turns them into assembler code. That's lines of human-readable machine instruction mnemonics (for example, "Copy from here to here." "Is that bigger than zero?"). The assembler takes those lines and turns them into machine instructions, a sequence of binary numbers.

    Finding the difference between two huge gobs of binary numbers is difficult. Instead, they turn the binary numbers back into lines of mnemonics and use a algorithm that finds the difference between two huge listings of mnemonics.

    That method is easier because the listings of a program that has been changed slightly can be very similar to the listing of a unmodified program. That has to do with how compilers work.

    Capiche? ;)

    --
    "All you have to do is be fragile and grateful. So stay the underdog." Chuck Palahniuk, Choke
  12. Re:Rsync by bcrowell · · Score: 3, Interesting

    Simple answer: no statistically, I think rsync has very few binary files to deal with, at least the way I'm using it. also, their technique may make the diff data smaller, but it also makes the diffing/patching process a LOT slower, something many rsync users don't want because on a LAN you don't care much about bandwidth usage.

    Well, your use of rsync is not necessarily typical. I use unison for file synchronization, and unison uses rsync; I have quite a few binary files that I sync. You're right that it would be a tradeoff of cpu versus bandwidth, but actually that's the whole point of rsync.

    On the other hand, I can think of at least some other reasons that this algorithm would be less appropriate for rsync than for google's intended use. (1) Google knows that their file is an executable, and knows that it's an x86 executable. Rsync would have to detect this using the kind of heuristic algorithms used by the unix "file" utility. (2) It's kind of ugly to imagine building all this x86-specific stuff into a generic program like rsync, which people may be using on ARM machines, or which people may use in the future on architectures that haven't been invented yet. (3) Google is doing one-to-many file distribution (pushing out a single patch to millions of users), so that means that the tradeoff of cpu versus bandwidth is an excellent one for them. With rsync, the typical use is one-to-one (syncing machine A with machine B), so the tradeoff isn't as awesomely spectacular.

    BTW, a completely different application of this that Google may be interested in is for Google Native Client, which runs x86 code sandboxed in a browser. Suppose google has 10^8 users using a particular web app that runs x86 code in their browser via Native Client. As long as the program doesn't change, most of the users are probably just getting it from cache, so it loads fast and doesn't cost google any bandwidth. Then on a certain day, google updates the software. Potentially this causes a bottleneck where 10^8 users are all trying to download the same code (could be especially bad on a corporate network with lots of people simultaneously downloading the same stuff on a Monday morning). If Courgette is incorporated in the browser, then potentially it could be smart enough to realize that it already has version x cached, and what it's trying to download is version x+1, and work out the patch without having to download the whole thing again.

  13. Re:Rsync by NoCowardsHere · · Score: 3, Interesting

    Not really. Where Google's algorithm really shines is in exactly the field they designed it for: efficiently trying to update a large number of identical binary files of known content (particularly those representing code) on many remote computers, by sending a compact list of the differences.

    Rsync actually has to solve a different problem: figuring out where differences exist between two files separated over a slow link when you DON'T precisely know the content of the remote file, but know it's likely similar to a local one. Its rolling-checksum algorithm is very good at doing this pretty efficiently for many types of files.

  14. Re:Can a layman get an explanation in English? by PCM2 · · Score: 4, Insightful

    But, if the compilers are similar enough to create the same pseudocode/bytecode/ASM, or smart enough to save the source code, and use it for future comparisons, then wouldn't one patch be just as portable as the original source code?

    It's a good theory and you're a smart person, but:

    1. The compilers probably wouldn't be similar enough. Even developers who use GCC to compile something for Linux usually use Visual Studio to compile the same code for Windows. (The source code for Chrome, for example, shipped as a Visual Studio project.) Mac OS X likes to have everything written in Objective C, so that output would probably be very different.
    2. Different operating systems rely on different shared libraries to do the same things. So a function call that opens a file in Linux might not look like a function call to do the same thing on Windows -- it might take a different number of arguments, for example, which means it would look rather different in machine language.

    Portability doesn't appear to be Google's primary concern, though. They seem to be keen on the idea of delivering binaries over the wire (real binaries, not bytecode) -- see Google Native Client.

    --
    Breakfast served all day!