Subversion as Automatic Software Upgrade Service?

← Back to Stories (view on slashdot.org)

Subversion as Automatic Software Upgrade Service?

Posted by Cliff on Friday September 16, 2005 @09:20AM from the thinking-out-of-the-box dept.

angel'o'sphere asks: "I'm working on a contract where the customer wants a automated, Internet-based check-for-updates, update and install system. So far we've considered a Subversion based solution. The numbers are: a typical upgrade is about 10MB in size. Usually it's about 30 to 50 new files (which have an average size of about 200kB) and 2 database files (which can be anywhere from 500MB to 2GB) that change regularly. Upgrades are released about every 3 months, and this will probably become more frequent as the system matures. The big files are the problem as we estimate about 100-300 changes in every file. The total user base is currently 2000 users, creeping up to probably 5000 over the next year, and might be finally end up at some 30,000 users. Any suggestions from the crowd about setting up a meaningful test environment? How about calculating the estimated throughput of our server farm? Does anyone know of projects that have tried something similar using an RCS or a configuration management system?" "We want to support as many concurrent users as possible (bandwith is not an issue). We use an Apache front end as a load balancer and as many Subversion servers as necessary on the backend. My largest worry, from my calculations, is disk access on the Subversion server. We could not run meaningful tests, because a typical PC kills itself if you try to run more than 4 or 5 parallel Subversion clients doing an upgrade (due to insanely high disk IO, and high seek times)."

41 comments

Min score:

Reason:

Sort:

rsync by ¡ · 2005-09-16 09:31 · Score: 4, Insightful

Why not use rsync instead of Subversion? Subversion wasn't really designed for this, where as rsync is used for mirroring and syncing large repositories all over the place all the time.
1. Re:rsync by davecb · 2005-09-16 10:17 · Score: 1
  
  Actually use both: distribute binaries as binaries, configuration and xml files as subversion files, both via rsync.
  On the customer site, run a script that applies a visual file merge to any config files that have changed both places. The customer will have a good chance of recognizing changes they've made, and if there are clashes will tend to call you on the phone and ask what to do next.
  --dave
  
  --
  davecb@spamcop.net
2. Re:rsync by photon317 · 2005-09-16 10:36 · Score: 1
  
  I tend to agree with the parent. You might want to do version control on your software releases with subversion, but ultimately you should check out the new stable copy you want everyone upgraded to and then distribute it via other means, like rsync. rsync is particularly a good choice because it will only send the minimum amount of data neccesary to get the job done efficiently.
  
  --
  11*43+456^2
3. Re:rsync by commanderfoxtrot · 2005-09-16 12:10 · Score: 3, Informative
  
  Subversion uses binary diffs in a similar way to rsync. The original poster pointed out bandwidth was not an issue- therefore any bandwidth advantages rsync gives (and yes, there are plenty) are meaningless.
  
  Subversion gives excellent control (tags anyone?) of binary installations. We use it at for things way beyond the usual source code storage.
  
  I have also found disk IO is the main killer. I would suggest looking in to caching. The subversion client sends straightforward HTTP commands to the server. I have a custom PostgreSQL backend which does some caching- in his place, I would have a Squid set up to cache some basic data fetches- obviously, you need to be careful to not cache old data but that's not hard.
  
  So yes, Subversion is excellent for this, and with a little thought, the heavy disk IO can be reduced. Cache, cache, cache.
  
  --
  http://blog.grcm.net/
4. Re:rsync by mcclungsr · 2005-09-16 13:37 · Score: 1
  
  If it's economically feasible in this case, I would suggest a better disk subsystem. The more spindles, the better. Something fibre channel, if possible. A memory size large enough to get to a supercached state will certainly help, but disks are cheap in quantity and using more of them in a RAID configuration is an orthodox solution to high service times.
When all you have is a hammer.... by wowbagger · 2005-09-16 09:35 · Score: 1, Redundant

This sounds to me a bit like "All I Have Is A Hammer, So Everything Is A Nail".

You want to update large files over the 'net, Files which have changes in the middle of the file.

Why use Subversion? Why not use rsync?

--
www.eFax.com are spammers
Transfer file to compare, then change file by Hey,+Retard... · 2005-09-16 09:43 · Score: 2

Sounds like twice the work for thrice the price.
1. Re:Transfer file to compare, then change file by ElGameR · 2005-09-19 05:52 · Score: 1
  
  Not really, to compare all you need is a hash of the file, which is much smaller than the average file. And if the hashes match, theres no need to send the whole file.
Rsync? by Karora · 2005-09-16 09:45 · Score: 3, Informative

Wouldn't Rsync be better for what you want? Why do you need to be able to choose different versions to fetch?
If the files contains parts that are constant along with parts that vary then rsync will in many cases only transfer the partial file. With Subversion that won't apply for binary files, but rsync will still recognise partial matches even on those.

--

...heellpppp! I've been captured by little green penguins!
1. Re:Rsync? by bran880 · 2005-09-16 10:14 · Score: 1
  
  also, in my experience svn is slow with large files.
2. Re:Rsync? by halfnerd · 2005-09-16 23:31 · Score: 1
  
  Subversion does employ a binary delta algorithm, xdelta. Older versions used some different algorithm, but that was also capable of binary deltas.
3. Re:Rsync? by Anonymous Coward · 2005-09-18 06:13 · Score: 0
  
  Subversion uses a binary delta algorithm (another poster mentioned xdelta); the issue you're thinking of applies to CVS, which only deltas text (using diff, essentially).
  
  I agree that rsync is a better solution here, but in theory, Subversion could actually beat rsync. rsync needs to compute what portions of the file are the same on every invocation, which it does using hashes, while Subversion can rely on previous revision information to send the same information based on information stored on the servers only.
  
  This post actually hits most of the points I raised:
  http://ask.slashdot.org/comments.pl?sid=162475&cid =13580773
  
  So while Subversion wasn't designed for this task, you could certainly improve on rsync by adding revision history, at least on the "push" side.
  
  Incidentally, another problem with Subversion is that it stores local copies of the working files (to allow local revert), essentially doubling storage requirements. For a huge database, that's a really bad thing. :)
times two by Lord+Bitman · 2005-09-16 09:48 · Score: 3, Informative

remember that svn always uses more than double the actual space required to hold the files for a "working copy". For "one-way" updates, svn is _NOT_ the answer.

--
-- 'The' Lord and Master Bitman On High, Master Of All
1. Re:times two by abartlett_219 · 2005-09-16 10:27 · Score: 1
  
  double the actual space required to hold the files for a "working copy"
  
  True. However, using an export (svn export), you can just get a non-working copy of the code.
  
  rsync is probably a better solution anyway. If you want to track what went into each release, maybe a subversion backend, with a cronjob to update everything to a rsync server.
2. Re:times two by saurik · 2005-09-16 22:08 · Score: 2, Insightful
  
  By non-working it should be noted that you also mean non-upgradable. Once you do an export, you dan't do an update, which makes that feature useless for this purpose.
3. Re:times two by commanderfoxtrot · 2005-09-17 04:49 · Score: 1
  
  They're also looking at using compression in upcoming versions for the local "hidden" originals.
  
  --
  http://blog.grcm.net/
4. Re:times two by angel'o'sphere · 2005-09-17 06:41 · Score: 1
  
  We know that and we accept that.
  
  Even worse, we make (as client configuration option) a third copy to allow a local rollback to reverse changes without need for accessing the upgrade server via the internet.
  angel'o'sphere
  
  --
  Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
If this was in java... by hexghost · 2005-09-16 10:03 · Score: 3, Insightful

You would use java web start. Maybe you should consider writing something like it for this project?
Apt? by cortana · 2005-09-16 10:07 · Score: 2, Funny

Can the clients run dpkg and apt? A daily apt-get update && apt-get upgrade is very convenient. Server-side, you don't need anything more complicated than a web server.
1. Re:Apt? by Jussi+K.+Kojootti · 2005-09-18 19:59 · Score: 1
  
  Why is this funny?
  It's not very convenient though: apt doesn't do binary diffs as far as I know, so the 2GB file would have to be downloaded every time it's changed... With 30000 users that would be 60 terabytes per update.
Not Subversion by the+eric+conspiracy · 2005-09-16 10:11 · Score: 2, Insightful

rsync is excellent at this, and rdist can have benefits too if you are updating a bunch of servers at once.
How about bsdiff/patch and some scripts? by Fweeky · 2005-09-16 10:14 · Score: 5, Interesting

This is the technique used by portsnap; basically you generate binary diffs from a known starting point, and the client keeps track of what new patches it needs to keep in sync. Since you're just serving static files, scaling it should be as easy and cheap as it gets.

rsync is highly general purpose; your servers will end up generating hashes for every n-bytes of every file for every client, which is a lot more heavyweight than just serving patches you generate once. SubVersion may be more effecient since it should know something about the files it's checked out previously, but it's still going to end up dynamically generating diffs between whatever versions each client has and the latest; this likely gets worse if your clients aren't tracking HEAD.

Also note that a custom solution can likely get away with a single tag file detailing the latest patches; rsync and svn are going to be scanning their directory trees religiously. Both you and your users will probably appreciate a single GET to a small file on a webserver than a load of CPU use and disk thrashing.
1. Re:How about bsdiff/patch and some scripts? by cperciva · 2005-09-16 12:23 · Score: 1
  
  Yes, this might be the best approach; but it's hard to say without knowing more details.
  
  I think the right solution for the submitter is "talk to someone with experience in this area" -- ideally, me. I'm no longer looking for a job, but I'd still be happy to hear details about a problem and offer my opinion on how best to attack it.
  
  --
  Tarsnap: Online backups for the truly paranoid
Agreed, rsync rocks by Anonymous Coward · 2005-09-16 10:27 · Score: 2, Informative

I have several apps like this. One is deployed to more than a dozen locations around the country, each having roughly 5000 users. It's a mod_perl app on BSD.

My general routine: I have a "development server", and a staging farm (set up exactly like one of the customer's locations, right down to the network hardware). After changes are made and unit-tested, the changes are pushed to the staging servers using rsync. When all the various remaining tests pass, the software is pushed out to a customer's location (if they need to review the changes), or out to all locations.

Note that I use rsync to PUSH changes on a regular schedule. The apps do not ever "phone home".

My rsync script basically copies all the files except for unit tests, photoshop files, data, all that stuff, just the stuff it needs for run-time. It depends on an SSH key (which exists only on two machines and has a passphrase, so a key agent is required). It has a "fan-out" setting which allows up to N machines to be done in parallel.

Also, my app is completely relocatable and cross-platform. I can check it out in any directory on any Mac, BSD, or Linux box and get to work. I can then push my changes directly from that development area to the staging server if needed. I use CVS and Darcs but that's not important, except to note that the rsync script needs to skip those "CVS" or "_darcs" files.

Works great, very powerful. Of course I am leaving out details like choosing CVS tags, database schema migration, restarting/upgrading/installing daemons (hint, if you don't use daemontools, your apps will never be reliable), handling 3rd-party open source packages, pulling in changes that were made on the customer's machine (in an emergency for instance) etc., etc. But rsync is the core of it.
CVS by alexpach · 2005-09-16 10:27 · Score: 2, Insightful

I have been using CVS to manage many different websites and/or projects on various servers. It doesn't store more then it needs (just the CVS folders) and it add, updates, patches and removes the files according to your repository.

Additionally you can use branches and sticky tags to keep track of files that don't need to be updated, or files that vary from client to client.

It is also easy to trigger and update over ssh or cron.

One downside compared to SVN is the lack of a binary diff mechanism, but I have been able to get by fine without it managing projects up to a GB in size.

Alex
1. Re:CVS by matheny · 2005-09-16 16:10 · Score: 1
  
  CVS updates are not atomic, unlike subversion. If integrity of data is important to your customers, don't consider CVS. As far as using Subversion is concerned, I would be wary of giving customers that type of access to my systems.
2. Re:CVS by angel'o'sphere · 2005-09-17 06:45 · Score: 1
  
  CVS lacks in our eyes easy access via HTTP and by that easy circumvention of firewalls on the client site.
  
  Second drawback is user management on the server.
  
  Regarding binaries, CVS might not be able to merge binaries, and probably its default configuration does not even DIFF them, but: it can do binary diffs!
  
  Also, we can't work without diffs, if everything would fail us, we likely would diff the big files manualy and distribute them as "new release" of a patch file.
  
  angel'o'sphere
  
  --
  Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
3. Re:CVS by NuShrike · 2005-09-17 17:29 · Score: 1
  
  CVS is better than SVN here because SVN lacks the 'obliterate', or 'admin -o' ability that Perforce and CVS have.
  
  This is important because you DON'T need to be storing 100 large revisions of your software release in the repo with no way to ever remove it.
  
  Of course CVS sucks when tagging a huge repo, and removing releases is a PITA, but you got no such options in SVN.
MOD PARENT UP! by Anonymous Coward · 2005-09-16 10:31 · Score: 0

Subversion needs a local working-copy, in other words: every file is present 2 times on the client.
Disk Accesses by Anonymous Coward · 2005-09-16 11:13 · Score: 2, Informative

My largest worry, from my calculations, is disk access on the Subversion server.

Put enough ram in your server, and the changed portion will likely fit in cache. If that's not an option, use RAID to speed up disk accesses.

Others have mentioned rsync. You might also consider xdelta.
Disk I/O by pete-classic · 2005-09-16 12:15 · Score: 2, Insightful

Let's see. You have a ceiling of 2.01GB worth of updates. You have disk I/O problems.

Your problem is either that you don't have enough RAM in the system, or you have an OS that doesn't do a rational job of caching disk.

Or both.

-Peter
perhaps by /dev/trash · 2005-09-16 12:47 · Score: 2, Informative

rdiff-backup
cfengine by Anonymous Coward · 2005-09-17 01:25 · Score: 1, Informative

First of all, it's obvious you are not using enough RAM on the servers. Get 8 GB. Don't do the balancing with Apache. If you are using Linux, resort to IPVS instead. For the large database files you'll want to use rsync. After the transfer, though, most likely you'll still need to perform the actual update. That's where cfengine comes in. You set it up to run rsync every N hours, then perform operations (restarting programs, cleaning up, whatever) when there's new data. You can also use it to restart dead istances of your application, etc.
Some clarifications, especially about rsync by angel'o'sphere · 2005-09-17 01:43 · Score: 2, Informative

First of all, thanks for so many replies!

First I like to clarify a bit, probably my original question was not clear enough!

The clients of the system are customers. They have Windows PCs as the software runs on windows. On the server side we need to be able to authenticate every client as there are several region and user level restrictions about who may access which file.

You can assume there are simply 5 to 10 user levels, where a user on level 10 may access everything and a user on level 5 only a subset.

So far SVN looks good:

* authentication via the Apache front end, probably via a LDAP server

* structuring the "download area" into directories with user level appropriated content

Regarding, rsync:

* first off all, I did not know about it :D

* my first investigation indicates several draw backs

It seems not to run on Windows (without Cygwin), users need to be unix/linux users on the server, building a distribution seems "more complicated" than making a tag/version with SVN.

Please consider: from the point of view of the service provider the system is just the same like hosting a hugh pile of sourcecode. The starting distribution probably has 3000 files and is about 2.5 GB big.

The users need to have the ability to fall back on a later revision in case of errors during distribution.

Users need to be able to upgrade to the latest HEAD (there is only one main thrunk anyway).

Regarding performance of SVN, yes we are clear we need to put a lot of RAM into the servers. But we cant get rid of the disk IO it seems as SVN does not cash requests (in this case all clients allways want the same release to upgrade to, and most of the time they either have the previous or the second oldest release installed)

However: alternatives to SVN are very welcome! I only wanted to make clear why we considered DVN in the first place.

angel'o'sphere

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
1. Re:Some clarifications, especially about rsync by NuShrike · 2005-09-17 17:39 · Score: 2, Interesting
  
  Here's a combination of available strategies:
  
  o DON'T use SVN (imo)
  o check out your latest rev to a staging 'folder'
  o rename your previous release 'folder' to backup name
  o rsync the data from your staging 'folder' to all your clients one by one.
  
  If you have issues with the release, just roll back to the previous release 'folder'.
  
  There other thought is to use rsync a .torrent file and use something like bittornado to distribute from your 'staging' folder.
  
  All this should let you get by with a 1GB or less ram master file server, and crappy i/o too.
  
  You figure out a security-scheme to wrap around this.
2. Re:Some clarifications, especially about rsync by angel'o'sphere · 2005-09-18 03:45 · Score: 1
  
  I can't rsync to my clients.
  If at all, the clients can rsync from me, and as rsync does not run natively on windows, we can't rely on rsync, imho.
  
  Strange, did I use the wrogn term? No one of you has a program that has an automated check for updates from vendor option?
  
  Thats what we want to do. A client, over the internet, not via LAN, has to be able to use HTTP!!! and needs to be athenticated and it's pull and not push distribution.
  
  A bit torrent is completely out of option as we have several different access rights and most customers have a firewall which very likely blocks torrents.
  
  But probably we could figure, like you suggest, a security sheme around this :D
  
  angel'o'sphere
  
  --
  Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
3. Re:Some clarifications, especially about rsync by jrockway · 2005-09-18 06:53 · Score: 3, Insightful
  
  > Regarding performance of SVN, yes we are clear we need to put a lot of RAM into the servers. But we cant get rid of the disk IO it seems as SVN does not cash requests (in this case all clients allways want the same release to upgrade to, and most of the time they either have the previous or the second oldest release installed)
  
  Subversion doesn't need to cache requests -- the OS* does this itself. With plenty of RAM, whatever isn't being used by processes is used for cache. If you don't trust the disk caching algorithm, just make a 2.5G ramdisk and copy your files over to that when you want to release them. Then the disk won't be a problem.
  
  * Assuming you're using a Real OS, and not Windows. Don't use Windows for anything that requires speed or reliability.
  
  --
  My other car is first.
4. Re:Some clarifications, especially about rsync by Anonymous Coward · 2005-09-18 12:17 · Score: 0
  
  You may want to look into CVSup. It is used in FreeBSD for users to download the source and ports trees, which many, many users do on a regular basis. It is very efficient (it uses the rsync algorithm internally), and it understands CVS tags, so it would solve the multiple revision problem for you.
  
  The downside for you is that it has not been ported to Windows. You may be able to get it to compile using cygwin (keep in mind that a user does NOT have to have cygwin installed to run a cygwin program -- you just have to distribute the cygwin DLL file).
5. Re:Some clarifications, especially about rsync by eklitzke · 2005-09-18 19:06 · Score: 3, Informative
  
  You may be interested in the Unison project. More info can be found here: http://www.cis.upenn.edu/~bcpierce/unison/
  
  --
  #include ".signature"
rsync on Windows by Kaseijin · 2005-09-18 15:17 · Score: 1

If at all, the clients can rsync from me, and as rsync does not run natively on windows, we can't rely on rsync, imho.
All one needs to run a Cygwin binary in general is the cygwin1.dll library. rsync in particular requires cygpopt-0.dll from the libpopt0 package. It can be daemonized with srvany.exe and instsrv.exe from the Windows 2003 Resource Kit. You might have to adjust the timestamp window to account for client time zones or the two-second resolution of FAT32, but it doesn't require exceptional wizardry.
Consider CFEngine by garyebickford · 2005-09-20 09:28 · Score: 1

A previous poster mentioned cfengine briefly. If I understand cfengine correctly, it may be just what you're looking for.

Also, if you're the sort who can/does go to conferences, the LISA '05 conference (Dec. 4-9 2005) features several sessions on cfengine by Mark Burgess. (LISA is the "Large Installation System Administration Conference", put on by USENIX and SAGE. There's also a conference BLOG, and this is the link to the tech program info.

--
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/