Delta Compression for Linux Security Patches?
cperciva asks: "For people without fast internet connections, it is often impractical to download large security patches. In order to avoid to reduce patch sizes, some operating systems -- starting with FreeBSD over a year ago, and recently followed by Mac OS X and Windows XP SP2 -- have started to use delta compression (also known as binary diffs, which constitutes a portion of my doctoral thesis), and can often reduce patch sizes by over a factor of 50. In light of the obvious benefits, I have to ask: When will Linux vendors follow suit?"
As soon as binary diffs get hacked into RPM then it might happen. binary diffs of one rpm to another later version wont really work as binary diffs are only small when they are produced on uncompressed, unecrypted data. The real issue is that linux doesnt really need binary diffs. Linux distros already have fine grain packages ( lots of little packegs not a few bigs ones). Security updates usally just require a one or very few packegs to be updated. Binary diffs only really make sense when you have huge packages that require a whole new package for upgrade. I bet the average RPM is about the same size as the minium binary diff from MS.
Now with broadband being so popular, and still on the rise, is this really an issue?
Yes, it is. I just switced to broadband less than two months ago. A lot of my friends are still on dialup. Also, do not forget rural areas which do not have access to broadband. You would be surprised how many people still have dialup, I believe the number of broadband users just recently surpassed the number of dialup users. This means, obviously, that nearly half of all internet users are still on dialup.
In light of the obvious benefits, I have to ask: When will Linux vendors follow suit?
You make it sound like it's a sweeping trend and Linux is in the dark ages for not doing it. This is the first I've heard of this!
Also, wouldn't normal source patches be compressed quite a bit more anyway b/c of the nature of redundancy in text? This is a benefit for binary-only systems as you say. Are there really a lot of users hurting because they just can't download all the new patched binaries?
Ok before I get berated by the karma (whoring) police I do realize these are not binary diffs. But, seriously, linux has been using diff's as a way to save bandwidth before Windows even offered 'updates'. Another example of Windows 'innovation' I guess.
Yes, I see how it is neat that there is a binary version of this process with Windows but linux is primarily a source based operating system. It is that way becuase the software is designed to be compiled for a variety of systems and setups and work with all of them.
I do understand the authors question though, but it really should be reworded. Linux is not a OS in the sense that Windows is an OS. He should perhaps be more correctly asking when one of the 'binary' distributions of Linux (or of a Linux 'based' OS to be exact) will plan on offering this. Binary packages are really only offered on a per distribution basis with the binaries not being very compatible between distro's and systems (although some basic compatibility is generally there). As to that question who knows and who cares I use Gentoo, and after trying almost every one of the binary distro's
"Take that Lisa's beliefs!" - Homer Simpson
do NOT programs like yum get rid of the dependicy issues?
They definitely should have done whatever was necessary to keep the name as just "bzip".
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
And how is this different from source code patches? It seems to me that they'll only provide patches from version to version, like they do with GNU Emacs. If you need to update multiple versions then you have to make a decision about going through 10 patches, or doing a full download of the desired version.
Used to do this back in ye olde DOS shareware days. I think RTPatch was the most common of the commercial ones.
Opportunity knocks. Karma hunts you down.
It would help a lot if tar would do it if you just provided -z instead of having to remember to provide -j. Come to think of it, it would be nice if tar just detected compression and you did not have to give it -z either! Can this be done?
Firstly, linux programs tend to be smaller than windows programs (do one thing, and do it well). Even a huge beast like tetex is 'only' 14.4MB -- compare to SP2... This has reduced the demand for delta compression.
Secondly, in the windows world people release rarely. However, the opposite is true in the linux world -- projects with daily releases are not unheard of, and weekly releases are fairly common. This means enumerating patches (v 3.4 -> v. 3.7) is infeasible in Linux where it is feasible in Windows.
More sophisticated algorithms than delta checksums do exist (as I guess you know if your thesis is on them) -- rolling checksums have been used in several projects I know of. However, there is a widespread rumour that these techniques are patented. I have never seen any evidence, but it puts a damper on any implementations.
There is a semi-vapourware project implementing all of this (part of the apache project IIRC). However the project fizzled away several years ago.
While it's more difficult to set up a system with Gentoo than Windows 2000, it's easier to maintain.
This is probably because of portage. Precompiled packages coming from all differnet sources can be a bitch to maintain. Mandrake is my example for this if you ever want to update a package they don't have RPMs for. And as for compile time, I'd rather let the computer sit for a hour or two overnight compiling a huge package than having to deal with the dependencies myself.
What's so new about it? I remember working with InstallShield, RTPatch, and others, way back in the Windows 3.11 days... New? <yawn>
I always for example grab the "regular" tar.gz version of the kernel for two reasons,
1) I always forget the j option to tar, since bz2 packages are not that common. It should autodetect it.
2) I have the perception that the combined download time and unpacking is longer for bz2
Point two was subjective up until now, but just for the hell of it I decided to measure it. I used the time command to measure how long it took to download the kernels and how long it took to unpack them:
time to download linux-2.6.8.tar.bz2 1m4.414s
time to download linux-2.6.8.tar.gz 1m9.706s
time to unpack linux-2.6.8.tar.bz2 2m05.457s
time to unpack linux-2.6.8.tar.gz 0m26.309s
This is on a P4C 3.2GHz, 1GB RAM, 8Mbit connection. So there you have it, with a fast enough connection the difference is significantly in favor of the old gz format. The size difference between the bz2 and gz kernel, about 8.8 MB, is not nearly good enough to merit the slower unpacking. If you have a slower machine but also a slower connection the result is likely in the same ballpark.
This goes to show that if you want to provide faster (subjective) update times to users, especially in the future with faster connections, you have to study the problem in detail and not just blindly try to optimize some aspect of the process (size in this case) since the global performance might in fact perform worse. Premature optimization and all that... What's the time for patching using delta compression any way? If a 600KB RPM update can be delta compressed to 10KB, but the patching process takes longer than 15 seconds, I'm likely see a slow down in system update time.
It's like deja vu all over again.