Delta Compression for Linux Security Patches?
cperciva asks: "For people without fast internet connections, it is often impractical to download large security patches. In order to avoid to reduce patch sizes, some operating systems -- starting with FreeBSD over a year ago, and recently followed by Mac OS X and Windows XP SP2 -- have started to use delta compression (also known as binary diffs, which constitutes a portion of my doctoral thesis), and can often reduce patch sizes by over a factor of 50. In light of the obvious benefits, I have to ask: When will Linux vendors follow suit?"
I think Suse does this.
Patchers and crackxors have been using binary diffs for some time, to get rid of copyright protection, which is often just NOPing out a couple bytes. Linux is behind the times...
I.O.U One Sig.
Certainly for your primary commercial auto-updated Linux distributions it does, but for anything else it usually doesn't. What makes more sense (because it's easier) is breaking up media and programs, and distributing them separately so you don't have to update one when you update the other. Some projects do this already, and even package their sources this way.
Personally I'd prefer to see binary distributions move to a model of using something like cvs, so you can just do a cvs up (or equivalent) and update everything. Some files would have to be marked to always be overwritten, while config files would be merged. This solves both your differential update problem (if the right system is used - I'm thinking that's pretty much not CVS but I don't know if there's a way to make it do all of that - CVS doesn't handle binaries amazingly intelligently from what I understand) and your updates in general. Plus, you can use it both for source and binary updates.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
You go over to a friends house that has broadband and a CD-Writer (both are very popular these days) and download the patches onto a CD-R and take it home?
The following statement is false.
The previous statement is true.
Welcome to my world.
... their biggest customers start using dialup.
You mean to tell me that beast I downloaded was just a diff? Jesus H. Christ!
As soon as binary diffs get hacked into RPM then it might happen. binary diffs of one rpm to another later version wont really work as binary diffs are only small when they are produced on uncompressed, unecrypted data. The real issue is that linux doesnt really need binary diffs. Linux distros already have fine grain packages ( lots of little packegs not a few bigs ones). Security updates usally just require a one or very few packegs to be updated. Binary diffs only really make sense when you have huge packages that require a whole new package for upgrade. I bet the average RPM is about the same size as the minium binary diff from MS.
The folks at mindvision made an installer/installer creation tool that allowed one to scan two different sets of files and directories to find differences between them (binary differences) and it would just package up those differences in the installer archive. In fact you could use it to diff and package delta between several versions at once. When the user ran the installer (really and updater) it would apply the binary patch to the file set as needed.
I was using this tool over 7 years ago now on Mac OS so I don't see what is so new about this concept... but I am glad is looks like it starting to be used more.
Now with broadband being so popular, and still on the rise, is this really an issue?
Yes, it is. I just switced to broadband less than two months ago. A lot of my friends are still on dialup. Also, do not forget rural areas which do not have access to broadband. You would be surprised how many people still have dialup, I believe the number of broadband users just recently surpassed the number of dialup users. This means, obviously, that nearly half of all internet users are still on dialup.
In light of the obvious benefits, I have to ask: When will Linux vendors follow suit?
You make it sound like it's a sweeping trend and Linux is in the dark ages for not doing it. This is the first I've heard of this!
Also, wouldn't normal source patches be compressed quite a bit more anyway b/c of the nature of redundancy in text? This is a benefit for binary-only systems as you say. Are there really a lot of users hurting because they just can't download all the new patched binaries?
Linux makes it very easy to install new packages and upgrade packages from sources father away from the vendor. If a vendor tried to release a patch using delta versioning, it could totally wreck a system. Since neither RPM nor DPKG are designed to handle checking md5sum hashs against each file, and making sure the patch can be installed safely, it will have to wait until this feature is incomporaited into either system.
This signature was left intentionally blank.
SUSE already does this.
RPM in general, however, doesn't nicely support this feature. Either RPM needs to be extended/modified, or a new format needs to be made. While I favor a new format for many reasons other than this, modifying RPM is probably the best solution in order to provide backwards compatibility.
On that topic, why does almost everybody distribute source code as gzipped tars instead of bzip2'ed tars (just about everybody that does use bzip2 also distributed gzips)? Sure, in the beginning gzip made more sense for people on slow machines, but nowadays the difference in the time it takes to decompress is trivial, whereas the compression benefits of bzip2 on text are phenomenal in my experience.
delta based patch distribution on linux platform is quite easy. Just use RSYNC to sync application file to the source. I have used this technique of patching (i.e. RSYNC), to provide updates/patches to a in-house built application. Work very nicely.
Consensus is good, but informed dictatorship is better
Perhaps, right after they get a good package management system...
I can't even imagine the mess that would be cause if someone tried to uninstall a binary-diff RPM/DEB.
There are some rsync servers out there, which provide essentially the same service, and then some.
Also, if download size is your #1 concern, why not download the source patches, and compile? A whole 10K may need to be downloaded...
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Imagine my suprise when my servers, patched scrupulously on a rigorous schedule, suddenly were listed as needing 114 patches this morning!
Yup, for the last month or so up2date has been failing to install packages - it downloads 'em, and makes lots of pretty hash marks like it is installing, then DOESN'T INSTALL THEM!
Red Hat fixed this LESS THAN 24 HOURS AGO
CHECK YOUR SERVERS PEOPLE
OR BE ROOTED!
Yes - not everyone got broadband infrastructure like in America/Canada/Korea/Japan. Even in Australia, where I live, broadband uptake is slow and have capped on most of the accounts - that's another story. Most of the ppl I know and friends are still on dialup due to cost.
I have reason to believe that Daniel Lyons, author of the Forbes article is in fact the pen name of one John Doe, a 12 year old trailer park dwelling asexual dwarf who suffers from a severe dehabilitating mental illness known as "stupidity".
Hope that helps.
If the update is just a patch to the source, there's sometimes a minor revision made and an updated gentoo ebuild file and source code patch added to the portage tree, which is of course done via rsync. All in all, it's decently efficient. This mostly(I think) happens with unstable package versions, where a security update may make it into portage before the official project bumps their release, but that's not the case with stable stuff.
I think for basic systems, compile time complaints are slightly exaggerated. My -original- celeron 450 isn't shabby at all at compiling most of the more basic system packages and server apps. Even glibc and gcc build with relative ease, and when I set up distcc amongst my three systems, it became even less of a hassle. Even without distcc, the time to clear out 50 packages of updates on a mail server is surprisingly low on a low-powered system.
Please help metamoderate.
It makes a lot more sense for non open source operating systems because when you have the source, it means that there are going to be more people who compiled programs by themselves and thus are unable to use binary diffs. I'm sure the fact that FreeBSD has it is more because it's a neat hack than because a lot of people find it necessary (although I'm sure a few people do). So it's not really a necessity for linux distros to have this, but I bet that in the near future Debian and Redhat (maybe a few others) will setup functionality for binary diff patching just because.
Yeah, I guess you are right. Hey wait a second, what about the direct quote from those people who really did pick Windows over Linux due to the lower TCO?
Ok before I get berated by the karma (whoring) police I do realize these are not binary diffs. But, seriously, linux has been using diff's as a way to save bandwidth before Windows even offered 'updates'. Another example of Windows 'innovation' I guess.
Yes, I see how it is neat that there is a binary version of this process with Windows but linux is primarily a source based operating system. It is that way becuase the software is designed to be compiled for a variety of systems and setups and work with all of them.
I do understand the authors question though, but it really should be reworded. Linux is not a OS in the sense that Windows is an OS. He should perhaps be more correctly asking when one of the 'binary' distributions of Linux (or of a Linux 'based' OS to be exact) will plan on offering this. Binary packages are really only offered on a per distribution basis with the binaries not being very compatible between distro's and systems (although some basic compatibility is generally there). As to that question who knows and who cares I use Gentoo, and after trying almost every one of the binary distro's
"Take that Lisa's beliefs!" - Homer Simpson
Delta compression requires the vendor to create a delta for each older version that you can upgrade from. So if a package has had ten updates, the next yupdate will need to have eleven deltas. I don't think so. Unless you want to do something like Windows Update where an agent scans your binaries and compares the difference with the update and then downloads individual files ... but that's a lot more complicated and isn't justified by the bandwidth savings.
Yes, because one of the nice things about Linux is that it has relatively low hardware requirements.
This was supposed to be the last word.
can someone explain for those people who have no idea what delta compression is and how it differs over something like zip/rar/gz/7z etc
--ajs
...toggle their diffs in from the front panel.
Sheesh, evil *and* a jerk. -- Jade
It makes a lot more sense for non open source operating systems
Yes. Sun does this with Solaris, just SGI did with IRIX.
To-do List: Receive telemarketing call during a tornado warning. Check.
Maybe compiling source for every little update is fine for a hobbyist running Debian Unstable, but people just trying to run a server and get work done really would just like a quick little download to be done with it.
XDelta3 recently reached its first public release.
http://xdelta.org/xdelta3.html
XDelta3 is a library which is designed to foster exactly this kind of functionality. If distrobutions integrate the xdelta functionality into their package management framework we would be well on our way to what the poster is looking for.
Suse already does this, but partially. Suse sends out patches for ascii configurations etc. and probably some non-ascii stuff as well. But, for things like kernel upgrade, KDE version upgrade etc. (where the diff is really needed due to the large size) - this is not done. I do not know why it is not done. It is either that it is not really feasible to do a binary diff across kernel version or KDE version changes or the technology is just not in place yet. Osho
Used to do this back in ye olde DOS shareware days. I think RTPatch was the most common of the commercial ones.
Opportunity knocks. Karma hunts you down.
Is another useful technique. Downloads only the changed portions of ISO images.
Specifix' Linux distribution, based on Conary, does similar things when transferring updates from the repository to the client machine. It only transfers files that have actually changed. It doesn't do xdelta-style binary diffs for various reasons, but that functionality could theoretically be implemented.
http://wiki.specifixinc.com/
All the others are proprietary, and don't distributed source. Linux distributes source, and is totally free-as-in-beer, so a) source=text=good compression, b) not monolithic, so you can update just those parts you need, and c) everyone can get CDs from their friends without any worries.
What is the big deal? Download the stuff at work and take it home. Everybody does it! I barely use the Internet at home, I download everything I need at work. My employer is Stanford University, and I had broadband 12 years ago,when I started work here, I have never ever thought getting broadband at home, I still use dailup, I dont need higher speeds at home.
It works beautifully but I can't help but think it is a waste of bandwidth.
Maybe compiling source for every little update is fine for a hobbyist running Debian Unstable,
I don't see how a Debian user would do that. Maybe you meant to say Gentoo. But Debian provides binary packages by default (although each is paired with its source code on the package server).
Is it really possible to instruct Debian to compile all updates locally? You'd need a decent quantity of custom scripting to get it going.
as one might usually think. Example: each compiler uses some sort of optimization to create executable code. I guess that has a similar effect on the overall binary like modifying one byte and then gzipping the file: nothing will be the same...
I believe the number of broadband users just recently surpassed the number of dialup users. This means, obviously, that nearly half of all internet users are still on dialup.
FFS, not everyone lives in the USA -- and those figures were for the USA only. Do some more research. (I'd guess it's more like 75% worldwide still on dialup, but who knows?)
What makes you think you need to take a server out of production while code compiles on it? I never have.
Firstly, linux programs tend to be smaller than windows programs (do one thing, and do it well). Even a huge beast like tetex is 'only' 14.4MB -- compare to SP2... This has reduced the demand for delta compression.
Secondly, in the windows world people release rarely. However, the opposite is true in the linux world -- projects with daily releases are not unheard of, and weekly releases are fairly common. This means enumerating patches (v 3.4 -> v. 3.7) is infeasible in Linux where it is feasible in Windows.
More sophisticated algorithms than delta checksums do exist (as I guess you know if your thesis is on them) -- rolling checksums have been used in several projects I know of. However, there is a widespread rumour that these techniques are patented. I have never seen any evidence, but it puts a damper on any implementations.
There is a semi-vapourware project implementing all of this (part of the apache project IIRC). However the project fizzled away several years ago.
http://www.daemonology.net/bsdiff/
bsdiff and bspatch are tools for building and applying patches to binary files. By using suffix sorting (specifically, Larsson and Sadakane's qsufsort) and taking advantage of how executable files change, bsdiff routinely produces binary patches 50-80% smaller than those produced by Xdelta, and 15% smaller than those produced by
http://sourceforge.net/projects/diffball
A general delta compression/differencing suite for any platform that supports autoconf/automake, written in c, w/ builtin support for reading,writing, converting between multiple file formats, and an easy framework to drop in new algorithms.
I would think that someone working on their doctoral thesis would be able to find answers all by themselves.
when will Gentoo get this? ;)
Your CPU is not doing anything else, at least do something.
I run (too) many servers and in fact there is no distribution that ships the configurations we require, even excluding cases where we are running modified codebases. patching and rebuilding is trivial, supporting idiot users is not.
While it's more difficult to set up a system with Gentoo than Windows 2000, it's easier to maintain.
This is probably because of portage. Precompiled packages coming from all differnet sources can be a bitch to maintain. Mandrake is my example for this if you ever want to update a package they don't have RPMs for. And as for compile time, I'd rather let the computer sit for a hour or two overnight compiling a huge package than having to deal with the dependencies myself.
What patches???
Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
With xdelta the computer science part is already done. Just throw an uncompressed version of the old and new packages at it, and add that difference file to ftp://update.$DISTRIBUTION.org's overburdened harddisk. Apt-emerge now only has to reconstitute the already installed package (maybe tricky), or at least insist on the package from the install disk. Rub the two files together with xdelta, maybe run a checksum, and you have your new package.
In order to avoid to reduce patch sizes Maybe you should've done a doctoral thesis on reading over what you write before submitting it to Slashdot.
Also, don't forget what you have to use (dial-up) when your cable does go out. It once took a full week after my cable went out for a cable repair guy to come out and fix the problem.
That's not the only problem with the name. Some x86 kernels have a `make bzimage` build option to build a special big zipped image that does tricks to get around BIOS memory limits. But someone working on the m68k port, which doesn't have the limitation, thought that bzimage meant bzip'ed image, and so for a while the m68k kernel had the option of using a superior compression scheme in its bootloader. I think they finally removed it, though.
SuSE has patch.RPMs
There is one source for Windows security updates, its called Windows Update, and it takes care of dependencies too. And it installs right when its done downloading, no need for compiling, and it downloads quickly due to the technology described in the original article. No need to leave your production servers up with known, exposed security holes while compiling the update that would fix it; as soon as the patch is out, you can install it and you're good to go.
Especially since the license of bsdiff is not even close to a BSD license (don't let the name of BSD Protection License fool you). Unless the license is changed to something like BSD, BSDiff is not going to be implemented anywhere except in closed source software. Debian cannot even package this software becauses it is non-free.
I guess the bottom line is if you want to have something accepted in open source *and* in propriatary software, you want to license under BSD. You want to cater to one group (closed source in this case), you will lose the other.
While windows update is generally good, one can see with SP2 that's part of windows update that it can break your computer. If this happened in linux, you might have a chance in fixing it. If this happens in windows, your chances of fixing it are less because of the lower free support. In effect, you're system becomes a hostage.
Just use a simple gui tool like Ark or probably file roller to do it. Or write a simple shell script and call it "extract"
/dev/null && echo $TYPE | grep tar &> /dev/null; then :)
#!/bin/bash
TYPE=`file $1`
if echo $TYPE | grep bzip &>
tar xvjf "$1"
else if....
fi
Simple stuff
Someone correct me if I'm wrong. But isn't the reason RPMs are so particular about dependancies because whoever does the packaging doesn't research whether their app will actually work with an _older_ version of a distribution. Then, if it did work, they could define a broader set of other packages it would work with in the spec file.
Other than the RPMs needlessly not installing in older environments, applications like urpmi, yum, yast and redcarpet take care of other dependancies painlessly.
Sun just sends out tarballs with funky headers, for both installation packages and patches thereof.
The only replace individual files, never binary diff.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
The current Fedora Core 2 DVD ISO is about 5GB beast. It would sure be nice to only have to download a 500MB Fedore Core 3 DVD ISO patch when FC3 does become available later this year.
Having used Redhat AS, etc., Fedora, and SuSE, and even a while back BSD... I am so incredibly impressed w/ Gentoo
Why don't you tell your boss that you have left systems up in production use for days at a time with known security exploits on them?
Or even better, rzip.
You can set it to have a buffer of up to 900 megs, as opposed to bzip2's 900k. So instead of looking for redundant information in small blocks of 900k, it looks for it in everything you compress (up to 900 megs).
And surprisingly, I haven't found it to be noticeably slower than bzip2, even on my ancient hardware (the only thing is that if you want to use it to it's full potential, you need a lot of ram, but it'll work anyway without that.. just slower).
Treehugger? Treehugger... Treehugger!
Because it's retarded.
Think about it. If you have to download binary diffs, then each update and each patch needs to be updated incramentally. And even if it isn't I still end up using lots of third party software that is aviable thru things like apt-get but not neccicarially thru official channels.
For instance I use Fedora with Dag rpm's add-on. Buy updating my OS against both repositories I get the best of both worlds. I get easy reliable software from Fedora and all the extras that you don't normally get like libdvdcss from Dag.
Now I as part of the update using apt-get against Dag, I updated APT itself.
Now imagine that Fedora releases a binary diff against the official apt, and it updates against my apt-get from dag.
I'd be SCREWED!!!
When you update a Linux distro you not just doing system files.
When Windows does a update it's only against the core OS. That makes diffs managable, but if your running Apache on W2k no Microsoft patch for it will ever exist.
But Linux updates both core system AND all applications!
The way of using Apt-get, Yum, Portage is MUCH superior then what any vendor out their can possibly provide using closed source software.
With RPM's I can download them automaticly using Apt/Yum. I can burn them to a Cd. I can skip revisions when updating. I can choose what to update and what not to, I can update against 3rd parties, I can setup my own mirror with a simple ftp site for a business campus.
Whatever.
These diffs may sound like a good idea, but would make things a unmanageble mess. The developer's time and effort is worth a lot more then shaving 10 miuntes off of a download.
Especially with Linux we actually have a REAL MULTITASKING OS. Which Ms still doesn't have.
I can update my OS + play games + browse the internet + compile code + whatever without ever having to reboot (except kernel updates) or even stop what I am doing. Hell I can have it running in the background and not even give a shit about when it updates, realy.
What we have is not nearly as time-critical as what hell you have to go thru when patching a MS OS.
FreeBSD may do it, but they are much more tightly controlled enviroment then what linux is.
The FreeBSD link looks like some dude's pet project. Cool, but it is not the official method for distributing patches.
What's so new about it? I remember working with InstallShield, RTPatch, and others, way back in the Windows 3.11 days... New? <yawn>
I see some people talking about linux isos being so huge, and patches and all;
.net conference...
:D
.. spell check on computer? yes! check version with online version, is there update? no! keep using this one
:|
so it brought me to think on something I saw at a
the guy had an app check for an updated dll on a server each time it was used, so that it automatically patched itself...
so in taking this one step further, take an office suite for example: i never use most of the features (and i know ms office has that install on first use option) why not extend this idea to all software, but have it dl itself... this would rid users of the typical bloat for unused features while keeping downloads and updates very modular
think of it like a page table or something, you run word.exe... is it on the computer? no (miss) check the online location and download... now inside word user clicks insert word art.. is it on computer? no (miss) download from remote path and cache it
user clicks spell check
that kinda thing. sounds fun! and complicated to impliment
but really a tech support departments dream since all you have to say for a user to patch is: connect to the internet
-judging another only defines yourself
no
text
Patches are by definition deltas (== a difference between two things, comes from physics).
Vendors only mis-used the term for whole-file downloads/updates.
I've downloaded huge files including SP2 over a shitty rural modem connection (right now I'm at 28.8. Sometimes, when the planets are in the correct alignment and Satan accepted my sacrifice, I even conect at 41.1. Sadly, it actually does feel noticably faster) The key is patience! A modem is a very good teacher of patience.
Smoking is an expensive, slow, and unreliable method of suicide.
I don't think it will happen on Linux for the reason that it is "too free".
o Gentoo - builds from sources so you can't ship binary diffs
o RPM based - symlinking and nature of open source (lots of individuality between systems running the same version of OS; such as workarounds and such)
o APT-GET - similar to RPM
o Others - wouldn't know but it just doesn't sound feasible
Some may call this insignificant but when you have to patch kernel for vulnerability then every minute could be important. Downloading a 30MB RPM to hundreds of systems, I don't like that... Well, binary diffs are definitively a better idea.
On Windows this comes extremely useful with anti-virus definitions - I know before Norton used to have huge downloads whereas Sopohos used binary diffs that would significantly shorten exposure ot new viruses (especially for big corporations with hundreds and thousands of desktops).
On Linux it's certainly possible but because of the way it is, it may take years before that can be done reliably...
That's a standardization advantage that closed and semi-closed OS'es enjoy.
What in god's name are you talking about? If Mandrake does not have a package,you build one.
That is the proper way to maintain an rpm distribution. And it would serve you well to learn the power of urpmi.
By the way, when was the last time that you fail to find a package for Mandrake? Put together, contrib,plf, elsac, just to name a few and you have thousands of packages available.
Do you need to berate other distributions to feel better about whatever it is you run, Gentoo in this case?
Pragmatism as an ideology is not particularly pragmatic in the long term. Keep it in mind when you dismiss Free Software
Some distros do already. I hope the expectations are better than a factor of fifty, cause all largish systems are patching out of control.
Enter "deltup" a tool that looks at to tarrballs and gives you a diff between the 2 that you can use to "transform the old tarball to a exact copy of the new tarball", it even preserves MD5 checksums compatibility. Now some enterprising gentoo user create a "dynamic deltup server" that automates the creation of these delta files, and people can reuse the delta files that other people used.
Using this technique in combination with gentoo portage people can reduce there traffic with on average 75%.
Have a look at the following URL's for more information:
http://forums.gentoo.org/viewtopic.php?t=215262
m e.html
http://linux01.gwdg.de/~nlissne/deltup-status.ati
Rigolo
Well you know *snort* us gentoo users are just the most 3733t mother fuckers on the planet.. we just have to say gentoo rules everything so we can remind you stupid fuckers we took days to compile all of our software from the source! Becuase it is faster and you are slow.
urpmi is stupid emerge rulez. urpmi on;y installs binaries emerge can take days to make a binary file
get with the program get with gentoo, but by that time i will move to hurd, just to be a bad ass like i am now saying that i run gentoo...
is going to present Microsoft a lot of problems in the future I think. All it takes is for some virus/malware to get funky with a file that is targetted by a patch and the patch application will fail.
Think about it:
step 1. a worm gets hold starts taking over blaster style - and changes the file used to break into the system in random ways
step 2. microsoft releases a binary difference patch to correct the issue
step 3. patch fails because target file does not match original hash
step 4. virus continues unabated
step 5. ???
step 6. profit?
I have already played around alot with the tech MS is using to do this - it is built into Windows Installer - and it is EXTREMEMLY finnicky.
SP2 supposedly includes Windows Installer 3.0 - maybe things are different there I haven't checked yet.
According to the link http://www.daemonology.net/bsdiff/ you get binaries 50%-80% smaller - not 50 times smaller. Obviously 50% is still good ... :)
Here's a snippet from a patch description used by the online update, to give you some idea of what it does:
This signature is not in the public domain.
MPEG basically makes use of this concept, only encoding the differences from frame to frame. I always wondered if this couldn't be used on a series of similar photos. Many model photoshoots tend to be a huge number of very similar photos. Could they not all be encoded as a series of diffs? Start with photo one, then store a diff between photo one and 2, then the diff between photo 2 and 3, etc. I would think this coudl have some really dramatic space savings. You maybe could even have the software automatically sort the pics, choosing the most similar photos to diff, but I assume this would require diffing every pic with every other, a nasty O(n^2) problem.
Smoking is an expensive, slow, and unreliable method of suicide.
Apparently, most distributions have you download the entire package for each update, although there are efforts underway to break up sections a bit more (if I'm wrong, I apologize - I use BSD).
It sounds like what's really needed is to build packages of just the updated files. The install manifest would just specify the files in the archive, so there shouldn't be any complaints about missing files. Or does that show my ignorance?
Actually, if you wanted a more general scheme, the update server would build packages on the fly. The updater would send a list of files in a package to the server, which would return a set of files that need updating. You could use this to upgrade any system, regardless of distribution. You would just have to update the database for whatever to show that it was at the latest version.
Yes, this would take up a lot more CPU time, and be pretty slow on response time, but the savings in bandwidth should be worth it. All the time the servers were waiting for the network card could be used compressing files.
Java: the bastard demon spawn of C++ and Ada
Sure it is always nice to have faster downloads. But is it worth the extra work involved in setting this up both at the distribution point and on the client side?
I am not being rethorical. I am just wondering.
WTF are you compiling on a server that takes days to finish? Odds are it'll be on a timescale of minutes/10s of minutes, not hours. Unless it's X or KDE or something, shit doesn't take that long to compile on a decent spec system.
PAR is good for protecting your files, and as far as i understand PAR2 it could be used in the same way to update older fileversions to new ones, right?
http://parchive.sourceforge.net/
that would save some bandwith...
More to the point... As this is an issue generally for just 56k users, the linux kernel would need (proper) support for modems (read:winmodems) before this would be as effective as intended.
I like the sound of using delta compression for making an "upgrade" iso. Say I wanted to upgrade from fedora core 2 to fedora core 3, I could just download an upgrade iso at a fraction of the size etc.
I'm quite sure Novell has been doing it in the past. At least with the older versions of Netware (3.x and 4.x versions).
You had the whole Novell NOS + couple of services in, lets say 100Mb or so, and you needed to update tons of NLM's. Just needed to download a quite small patch file (over a POTS line) that usually could fit on a floppy or so) and then it updated the loose NLM's.
Nothing new I guess..
Yep. Gentoo can't use binary diffs, because they'll be guaranteed to fuck up the system.
;)
There are several reasons for this:
1. CFLAGS. The users can adjust the optimizations of every binary compiled. Which means that *many* systems will not have identical binaries.
2. Varying versions of gcc/binutils. GCC will produce different binaries depending on which revision you have used. So unless *everyone* uses the same version of GCC, you can be damn sure things will get stuffed.
3. USE-flags. Every user can adjust the dependancies of every single package by adjusting the USE-flag. Which results in differences in the binary.
Now, *source* patches would make a great difference. But at the moment, portage isn't intelligent enough to do this. I know there has been some discussion about this, and I was actually asked to submit this as a bug to bugzilla. I haven't done this yet (*blush*).
I will submit it later today, unless it's already been done
Of course not direct binary patches, since that'd be virtually impossible to implement on gentoo (Every system has different binaries).
( downloads only a patch to the sources you already have)
But *do* read this: http://forums.gentoo.org/viewtopic.php?t=215262
Even it if did take "days at a time" to compile a simple security fix.. (which I've never seen take more than 30 minutes)
It's still beats the hell out of waiting months for Microsoft to get arround to deciding to do something about the gaping security holes they have in thier offering, and praying that they didn't create a bigger problem with the fix than the original problem.
While implementing this is nothing for the faint of heart and would suggest some huge resources on the server side at first sight, it would make xdelta updates possible, while taking care of all the different package combinations on all the possible systems to update.
Now, would SuSE/Novell, RedHat or whoever please hire me in order to do this? ;)
Any comments?
As somebody pointed out before in this article, there is rsync which minimizes transmitted data using some xdelta-like algorithm. This is not really new, and some sites offer anonymous rsync downloads for exact this reason.
(Rumours were that some people actually use rsync in the following way to get the latest Debian ISOs from a collection of old, already downlod packages: They cat'ed all their packages together to one huge binary file and then ran rsync against the remote ISO image and that local file. Since most data was already in that file, only transmission of a few megabytes of new data and some data arranging had to happen....)
Here I uttered a few quick&dirty thoughts (which most certainly somebody else has had before, as usual) on how rsync could help in mass patching, don't know if they are worth reading for you... .)
I did my final year project / dissertation on delta compression and created a java web service & GUI to allow the distribution of delta files that users could download and apply.
It still requires a fair bit of work to make it very usable (hardly the best software engineering development ever), the Swing code is awful because I had to learn it in a week and it could do with some object serialisers on the data it returns. It worked ok though.
If anyone's interested they can read my report [PDF] (2.3MB). The point of doing that project was the reason that it is a technology that was massivly under exploited. It is quite limited for some things however, especially compressed archives and to a certain extent binary compiled files. However if you want to compress tars of source code it's brilliant and massivly improves over zip technology.
The package I used at the time was a java port of the xdelta project, javaxdelta. It had some bugs in it at the time however which meant that it didn't always work, I think from the discussion on their maling list that they've been fixed recently. I don't think it's as fast as the normal C++ xdelta implementation and xdelta as an algorithm isn't as good as some others, notably vcdiff and zdelta (see Suel & Memon, Algorithms for Delta Compression and Remote File Synchronization (2002))
I'm happy though, there may be some money making opertunities for my project =)
I did something like this for 10.0 (I wanted the patch indexes to be incremental) but my lone voice wasn't enough. If other people join Cooker, link to the article, and express themselves it may come to pass for 10.2 (10.1 is too far down the pipe already).
It's not as simple as it seems, 'coz those RPMs are bzip2-compressed internally, so a simple binary diff isn't likely to help much; they'll need to do a special stream of RPMs that are binary diffed before compression, which will probably require (more) surgery to RPM as well.
Got time? Spend some of it coding or testing
Not all diffs are alike. The simplest diffs are literal - "find string of bytes to replace and replace them with newbytes."
Back in 1985, Apple did something a bit more sophisticated for their System Update 2.0 patch.
Their binary files were structured. The particular structure was called a "resource fork" and had a collection of hundreds or thousands of usually-tiny "resources" which could be individually modified. As a made-up example, replace String ID 50 in the file "System" to "Version 2.0" where it previously was "Version 1.0" or replace one linked-in graphic with another.
The patch program updated, deleted, and added individual resources.
This is important for historical reasons:
If Microsoft or anyone else gets the funny idea to patent "replacing parts of a structured file in an update mechanism" as a broad-scope patent and the USPTO grants the patent [which they probably will, out of ignorance], the patent will need to be challenged and narrowed significantly.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Senator Frist, your Psot would be more "Fristy" if you weren't so damn slow.
I'm not berating Mandrake. I've used both Mandrake and Gentoo and I'm giving my person opinions on their packaging systems. I had times with mandrake when I couldn't find precompiled binaries for packages and I had to go back to hunting down dependencies and dealing with "./config" errors.
I've found that portage offers a lot more packages that are more up to date.
My concern is people hear mandrake is the easiest distro, try it out, and go back to windows prematurely because they figured if they had trouble with mandrake, they'd surely have more trouble with other distros. I just want people to experiment around. I did, and I feel my opinions on the matter could help people out if they were frustrated with mandrake as I was.
RPM can support this. You need to package the rpm with rsync friendly gzip then on the target box assemble the bits you have on disk from the original and rsync the two. Thats cpu costly for the server end unfortunately.
Rusty did a talk on the same things with dpkg a few years ago using rsync friendly dpkg formats to cut down international costs for Debian mirrors in AU
One reason could be that gzip can be made to produce rsync friendlier archives.
Often, changing even a small part of an archive will change it sufficently that rsync is forced to retransfer the entire lot. However, there is a patch for gzip (which I believe is in Fedora Core 2 and is scheduled for mainline gzip inclusion too) that produces rsync friendly archives at the cost of slightly larger archives.
I already do this. The utility I use to apply the patches is called "make".
Or that gentoo-emerge thingie.. But isnt that only source?
Redhat's updater does that too.. and i know its binary..
But the point is that concept of upgrading 'parts' is already here..
---- Booth was a patriot ----
The method I came up with was to use essentially the rsync algorithm, but I reversed it so that the computationally expensive parts were performed on the client side. The results of each computation were stored on the server side as a "patch" so that the computation was performed only once. This provided a patch system that was dynamic but without generating large server load.
The advantages are:
1. Patches are generated dynamically so the files can be in any state (truncated, too big, filled with garbage, missing enitirely, etc).
2. The heavy computation is performed on the client side so that patch generation does not drive up server load.
3. Computed patches are stored and reused, so a database of patches it quickly built up.
4. Patches are efficient (based on binary diff).
A detailed example follows; knowledge of rsync is required for understanding.
For example, let's suppose you are releasing a new content upgrade. A particular file's signature has changed from F4A3 to 26B1. (For brevity I am using a 16-bit signature, in practice it is much much larger.) When the first client connects to the patch server it receives the updated list of file signatures. The client notices that the file is now old so it requests a patch for F4A3->26B1. The patch does not exist yet, so the reverse-rsync algorithm activates and the client calculates the F4A3->26B1 patch. When that patch has been generated it is returned to the patch server and all future clients can just download the patch and skip the reverse-rsync. After applying a patch, the result is checked to make sure that you actually ended up with 26B1. If you did not, extra rounds of patching are performed.
Some Notes: These extra rounds are consequences of rsync and a security check as well to prevent bad patches from being uploaded to the server. And, normally the release maintainer would "pre-seed" the patch server by patching a clean current version to the new version just before release.
Rather than giving the computer the data, or even
a delta patch from old data to new data, just tell
it what 2k data blocks change to what new values,
where the set of data block values is indexed by a
hash and registration index. Store a map between
and data blocks in
DNS.
OK, so I'm only half-joking.
-I like my women like I like my tea: green-
SuSE provides .patch.rpm files which are RPMs containing only the changed files in it. It's not a binary diff though.
If it's a security patch, it's probably a good idea, since compiling can take awhile and you don't want to leave your box on the net while there's a known vulnerability. On the other hand, you can just cross compile on another server (or better yet, your personal workstation, where the CPU time probably isn't needed and you can work interactively with the files quickly and offline), and upload the change later. Still doesn't eliminate the need to take the box down to keep out the script kiddies, but it does avoid the problem of loading down the server.