Subversion as Automatic Software Upgrade Service?
angel'o'sphere asks: "I'm working on a contract where the customer wants a automated, Internet-based check-for-updates, update and install system. So far we've considered a Subversion based solution. The numbers are: a typical upgrade is about 10MB in size. Usually it's about 30 to 50 new files (which have an average size of about 200kB) and 2 database files (which can be anywhere from 500MB to 2GB) that change regularly. Upgrades are released about every 3 months, and this will probably become more frequent as the system matures. The big files are the problem as we estimate about 100-300 changes in every file.
The total user base is currently 2000 users, creeping up to probably 5000 over the next year, and might be finally end up at some 30,000 users.
Any suggestions from the crowd about setting up a meaningful test environment? How about calculating the estimated throughput of our server farm? Does anyone know of projects that have tried something similar using an RCS or a configuration management system?"
"We want to support as many concurrent users as possible (bandwith is not an issue). We use an Apache front end as a load balancer and as many Subversion servers as necessary on the backend.
My largest worry, from my calculations, is disk access on the Subversion server. We could not run meaningful tests, because a typical PC kills itself if you try to run more than 4 or 5 parallel Subversion clients doing an upgrade (due to insanely high disk IO, and high seek times)."
Why not use rsync instead of Subversion? Subversion wasn't really designed for this, where as rsync is used for mirroring and syncing large repositories all over the place all the time.
This sounds to me a bit like "All I Have Is A Hammer, So Everything Is A Nail".
You want to update large files over the 'net, Files which have changes in the middle of the file.
Why use Subversion? Why not use rsync?
www.eFax.com are spammers
Sounds like twice the work for thrice the price.
Wouldn't Rsync be better for what you want? Why do you need to be able to choose different versions to fetch?
If the files contains parts that are constant along with parts that vary then rsync will in many cases only transfer the partial file. With Subversion that won't apply for binary files, but rsync will still recognise partial matches even on those.
remember that svn always uses more than double the actual space required to hold the files for a "working copy". For "one-way" updates, svn is _NOT_ the answer.
-- 'The' Lord and Master Bitman On High, Master Of All
You would use java web start. Maybe you should consider writing something like it for this project?
Can the clients run dpkg and apt? A daily apt-get update && apt-get upgrade is very convenient. Server-side, you don't need anything more complicated than a web server.
rsync is excellent at this, and rdist can have benefits too if you are updating a bunch of servers at once.
This is the technique used by portsnap; basically you generate binary diffs from a known starting point, and the client keeps track of what new patches it needs to keep in sync. Since you're just serving static files, scaling it should be as easy and cheap as it gets.
rsync is highly general purpose; your servers will end up generating hashes for every n-bytes of every file for every client, which is a lot more heavyweight than just serving patches you generate once. SubVersion may be more effecient since it should know something about the files it's checked out previously, but it's still going to end up dynamically generating diffs between whatever versions each client has and the latest; this likely gets worse if your clients aren't tracking HEAD.
Also note that a custom solution can likely get away with a single tag file detailing the latest patches; rsync and svn are going to be scanning their directory trees religiously. Both you and your users will probably appreciate a single GET to a small file on a webserver than a load of CPU use and disk thrashing.
I have several apps like this. One is deployed to more than a dozen locations around the country, each having roughly 5000 users. It's a mod_perl app on BSD.
My general routine: I have a "development server", and a staging farm (set up exactly like one of the customer's locations, right down to the network hardware). After changes are made and unit-tested, the changes are pushed to the staging servers using rsync. When all the various remaining tests pass, the software is pushed out to a customer's location (if they need to review the changes), or out to all locations.
Note that I use rsync to PUSH changes on a regular schedule. The apps do not ever "phone home".
My rsync script basically copies all the files except for unit tests, photoshop files, data, all that stuff, just the stuff it needs for run-time. It depends on an SSH key (which exists only on two machines and has a passphrase, so a key agent is required). It has a "fan-out" setting which allows up to N machines to be done in parallel.
Also, my app is completely relocatable and cross-platform. I can check it out in any directory on any Mac, BSD, or Linux box and get to work. I can then push my changes directly from that development area to the staging server if needed. I use CVS and Darcs but that's not important, except to note that the rsync script needs to skip those "CVS" or "_darcs" files.
Works great, very powerful. Of course I am leaving out details like choosing CVS tags, database schema migration, restarting/upgrading/installing daemons (hint, if you don't use daemontools, your apps will never be reliable), handling 3rd-party open source packages, pulling in changes that were made on the customer's machine (in an emergency for instance) etc., etc. But rsync is the core of it.
I have been using CVS to manage many different websites and/or projects on various servers. It doesn't store more then it needs (just the CVS folders) and it add, updates, patches and removes the files according to your repository.
Additionally you can use branches and sticky tags to keep track of files that don't need to be updated, or files that vary from client to client.
It is also easy to trigger and update over ssh or cron.
One downside compared to SVN is the lack of a binary diff mechanism, but I have been able to get by fine without it managing projects up to a GB in size.
Alex
Subversion needs a local working-copy, in other words: every file is present 2 times on the client.
My largest worry, from my calculations, is disk access on the Subversion server.
Put enough ram in your server, and the changed portion will likely fit in cache. If that's not an option, use RAID to speed up disk accesses.
Others have mentioned rsync. You might also consider xdelta.
Let's see. You have a ceiling of 2.01GB worth of updates. You have disk I/O problems.
Your problem is either that you don't have enough RAM in the system, or you have an OS that doesn't do a rational job of caching disk.
Or both.
-Peter
rdiff-backup
First of all, it's obvious you are not using enough RAM on the servers. Get 8 GB. Don't do the balancing with Apache. If you are using Linux, resort to IPVS instead. For the large database files you'll want to use rsync. After the transfer, though, most likely you'll still need to perform the actual update. That's where cfengine comes in. You set it up to run rsync every N hours, then perform operations (restarting programs, cleaning up, whatever) when there's new data. You can also use it to restart dead istances of your application, etc.
First of all, thanks for so many replies!
:D
First I like to clarify a bit, probably my original question was not clear enough!
The clients of the system are customers. They have Windows PCs as the software runs on windows. On the server side we need to be able to authenticate every client as there are several region and user level restrictions about who may access which file.
You can assume there are simply 5 to 10 user levels, where a user on level 10 may access everything and a user on level 5 only a subset.
So far SVN looks good:
* authentication via the Apache front end, probably via a LDAP server
* structuring the "download area" into directories with user level appropriated content
Regarding, rsync:
* first off all, I did not know about it
* my first investigation indicates several draw backs
It seems not to run on Windows (without Cygwin), users need to be unix/linux users on the server, building a distribution seems "more complicated" than making a tag/version with SVN.
Please consider: from the point of view of the service provider the system is just the same like hosting a hugh pile of sourcecode. The starting distribution probably has 3000 files and is about 2.5 GB big.
The users need to have the ability to fall back on a later revision in case of errors during distribution.
Users need to be able to upgrade to the latest HEAD (there is only one main thrunk anyway).
Regarding performance of SVN, yes we are clear we need to put a lot of RAM into the servers. But we cant get rid of the disk IO it seems as SVN does not cash requests (in this case all clients allways want the same release to upgrade to, and most of the time they either have the previous or the second oldest release installed)
However: alternatives to SVN are very welcome! I only wanted to make clear why we considered DVN in the first place.
angel'o'sphere
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
A previous poster mentioned cfengine briefly. If I understand cfengine correctly, it may be just what you're looking for.
Also, if you're the sort who can/does go to conferences, the LISA '05 conference (Dec. 4-9 2005) features several sessions on cfengine by Mark Burgess. (LISA is the "Large Installation System Administration Conference", put on by USENIX and SAGE. There's also a conference BLOG, and this is the link to the tech program info.
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/