Performance Tuning Subversion
BlueVoodoo writes "Subversion is one of the few version control systems that can store binary files using a delta algorithm. In this article, senior developer David Bell explains why Subversion's performance suffers when handling binaries and suggests several ways to work around the problem."
I know it can handle binaries, but I cannot think why I would want to. Can anyone help?
Have a look at soylentnews.org for a different view
Not quite - but thanks for your contribution to an intelligent discussion....
Have a look at soylentnews.org for a different view
Subversion fails to follow symbolic links that point to code that other projects share for the sake of a minority that still develops using Windows (which doesn't have real symbolic links).
S ystem has prooven itself to be superior and far more intuitive.
CVS http://en.wikipedia.org/wiki/Concurrent_Versions_
You have code that many projects share, like multi-platform-compatibility-layers? Just use symbolic links and CVS will follow them.
In SVN you have to create a repository for these shared source files and write config files by hand to make it include these files your repository.
I hardly see SVN reach the point of flexibility CVS has. They support Windows (which doesn't have symbolic links) and give up usability.
Except this difference SVN and CVS are the same. There are marginal differencies in features but these affect no real world use. So if you want a version control system where you don't need to write config files by hand you choose CVS. If you want the latest hype you choose SVN.
There wasn't really a need for SVN.
for me performance is (currently) the least of my problems with subversion.. .. http://subversion.tigris.org/servlets/ReadMsg?list Name=users&msgNo=65992 .. and noone seems to be too bothered..
.. and i use it for my open source projects.. but currently CVS is way better.. just because of the tools and a few unnecessary annoyances less)
more that you lose changes without any warning or whatsoever during merging
(don't get me wrong, i love subversion
Find me at http://herbert.poul.at
In short: Use git-svn
Long version: The fraction of a few speedup described in the article is blown away by the several orders of magnitude you get by using git. Then there are all the other goodies, like real branches and merges, git-bisect, and visualization with gitk. Subversion is just for people who are forced to use it, or those not exploring all their options these days.
Plus if the master connection is set to compress data ( -C ) , then you get transparent compression.
Now if only I could expand all this to fit 2 pages....Profit!!!
for the last time people, I am "frodo from middle eaRTH", not "middle eaST".
A really great way to optimize your SCM is to upgrade to git.
-- bartman
Well if you use a customized version of GForge or the Advanced Server edition linked to some content engine or even someting more basic like NCftp at a professional level.
CVS and Subversion both work with GForge according to their web site.
Customized editions are available as far as I know. They have a roster of high end clientele, Cisco, MIT and others, so they must work pretty smart and well for people.
--
http://www.aisnota.com/slashdot/ Welcome to Logic and the Future
why use subversion only as import/export? That's the complaint here right? (the slow import/export speeds?) I thought the point in using revision control is to checkout then do commit/update commands???
It is still the wave of the future. I've worked in it extensively, and it is still the best version control system I've ever used. Because of its other strengths, it is continuing to expand its user base and gain popularity. You can tell this because Microsoft is now actively attempting to copy Subversion's concepts and ways of doing things. Ever used Team Foundation Server? It is just like Subversion, only buggier (and without a good way to roll back a changeset... you have to download and install Team Foundation Power Tools to do it). I'm a new employee at my company (which uses Microsoft technology), and yet I've been explaining how the TFS system works to seasoned .Net architecture veterans. The reason I can do this? I worked extensively with Subversion, read the Subversion book a few times (the O'Reilly book maintained by the Subversion team), and worked on a project for my previous company that basically had the goal of making versions of the TFS wizards for Subversion on the Eclipse platform. It only took me about one day of using TFS to be able to predict how it would respond, what its quirks would be, etc, because it's technical underpinnings are just like Subversion. So even with performance issues, if even Microsoft is abandoning its years of efforts on Source Safe and jumping all over this, you can know that its strengths still make it worth adopting over the other alternatives. After all, if Microsoft was going to dump source safe, it had its pick of other systems to copy, as well as the option of trying to make something new. What did it pick? Subversion.
Beware of bugs in the above code; I have only proved it correct, not tried it.
My version control system is so fucking slow. It pisses me off to no end. I mean I'm all like trying to check stuff in and it takes forever. Thank god someone took the time to speed these bitches up.
I've been using Subverison for 2 years on game related projects. Most of our assets are binary (photoshop files, images, 3D models, etc), plus all the text based code. I love subversion. Best thing out there that doesn't cost $800/seat.
What I don't like about this article is that it implies I should have to restructure my development environment to deal with a flaw in my version control. The binary issue is huge with subverison, but most of the people working on subversion don't use binary storage as much as game projects. Subversion should have an option to store the head as a full file, not a delta, and this problem would be solved. True, it would slowdown the commit time, but commits happen a lot less than updates (at least for us). Also the re-delta-ing of the head-1 revision could happen on the server in the background, keeping commits fast.
Okay, I know this is completely off-topic but I'd really like to get some responses or some discussion going on what makes version control suck.
I mean, is it just me or is revision control software incredibly difficult to use? To put this into context, I've developed software that builds websites with integrated shopping cart, dozens of business features, email integration, domain name, integration, over 100,000 sites built with it, (blah blah blah) but I find revision control HARD.
It feels to me like there is a fundamentally easier way to do revision control. But, I haven't found it yet or know if it exists.
I guess for people coming from CVS, Subversion is easier. But with subversion, I just found it disgusting (and hard to manage) how it left all these invisible files all over my system and if I copied a directory, for example, there would be two copies linked to the same place in the repository. Also, some actions that I do directly to the files are very difficult to reconcile with the repository.
Since then, I've switched our development team to Perforce (which I like much better), but we still spend too much time on version control issues. With the number, speed of rollouts and need for easy accessibility to certain types of rollbacks (but not others), we are unusual. In fact, we ended up using a layout that hasn't been documented before but works well for us. That said, I still find version control hard.
Am I alone? Are there better solutions (open source or paid?) that you've found? I'd like to hear.
Sunny
Be my Friend
Based on the headline, I was expecting some great method for tuning Subversion for increased performance. This article was about performance tuning your processing, not Subversion.
The reasons given here are valid and pretty obvious reasons why you'd want to store binaries in version control. But what is the big advantage of storing deltas of binaries, instead of complete files like CVS? Is it just disk space savings?
Disk space is stupidly cheap.
If you put the toolchain into CM, do you also put the operating system in? Just as the sourcecode is no good if you don't have the right toolchain to build it, the toolchain is no good if you don't have the right OS to run it.
I suspect the answer (if you really need it) is to save a 'Virtual PC' image of the machine that does the build each time you make an important baseline (or each time the build machine configuration changes). Since the image is likely to be in the GB size range, you might want to store it on a DVD rather than in your CM system.
Doing so means you have to unzip them to use them. Not very handy. Most users want to use Subversion the way they should be able to use version control- a checkout should give you all of the files you need to work with on a given project, with minimal need to move/install pieces after checkout. Implementing the 'best' suggested workaround would mean needing a script or other way to get the binaries unpacked. Programmers are often annoyed enough by the extra step of *using* version control, now you have to zip any binaries you commit to the repository?
I'm unimpressed by their performance testing methodology... they give shared server and desktop performance numbers, but have no idea what 'else' those machines were doing? Pointless. I'd like more details regarding what they're doing in their testing. Their tests were done with a "directory tree of binary files", but don't say what size or how many files?
My tests on our server show a 28MB binary checkout ( LAN, SPARC server, Pentium M client ) takes ~20 seconds. Export takes ~2sec. That must be a big set of files to cause a 9 minute *export*... several gigs, am I wrong? It'd be nice for them to say. Most of us, even in a worst case, won't have more than a few hundred MB in a single project.
The only *real* solution will be a Subversion configuration option which lets you say "please, use all my disk space, speed is all I care about when it comes to binary files". CollabNet is focused enough on getting big-business support contracts that it shouldn't be long before we see this issue addressed in one manner or another. You -know- they're reading this article!
In it's simplest form... just keeping a history of changes, it really isn't that bad.
where it becomes complicated is when you start talking about branching, merging, or trying to deal with dependencies across projects, etc.
But if done well, version control helps more than hurts.
Comment removed based on user account deletion
If you actually care about your code and making proper releases, use Vesta. Transparent version control that even tracks changes between proper check-ins (real "sub" versions). Built-in build system that beats the pants off of Make. It even has dependency tracking to the point that you not only keep your code under version control, but the entire build system. That's right. You can actually go back and build release 21 with the tools used to build release 21. It's sort of like ClearCase but without all the headache. Did I mention it's open source?
The first time I used Vesta, it was a life-changing experience. It's nice to see something that isn't a rehash of the 1960s
.. that the article is glaringly absent *actual check-in times.* Or, where *actual check-in times* are available, the details of whether it's the same file as in previous tests is glaringly absent. This leaves open the question as to whether the data set they were working on was identical or whether it was different between the various tests.
.ODF typically stored in compressed form? If not, then small changes wouldn't necessarily affect the entirety of the file (as it would in a gzip file if the change were at the beginning) and SVN might be able to store the data very efficiently. Uncompressed PDF would certainly benefit.
Questions that remain:
1. Does the algorithm simply "plainly store" previously-compressed files, and is this the reason why that is the most time-efficient?
2. What exactly was the data for the *actual check-in* times? (What took 28m? What took 13m?)
3. Given that speedier/efficient check-in requires a large tarball format, how are artists supposed to incorporate this into their standard workflow? (Sure, there's a script for check-in, but the article is absent any details about actually using or checking-out the files thus stored except to say it's an unresolved problem regarding browsing files so stored.)
The amount of CPU required for binary diff calculation is pretty significant. For an artistic team that generates large volumes of binary data (much of it in the form of mpeg streams, large lossy-compressed jpeg files, and so forth) it would be interesting to find out what kind of gains a binary diff would provide, if any.
Document storage would also be an interesting and fairer test. Isn't
Btw, CVS does do binary difference storage. Ever done a diff between two versions of .doc files?
SVN will never beat CVS on space-efficiency in the long run.
SVN does not have granular history obliterate whereas CVS/Perforce does, so CVS might be bigger initially, but you can always delete very old versions. These old binary versions are the ones you can rebuild from source or you really don't care anymore.
It exists forever in SVN.
Although Subversion does a great job of being a better CVS than CVS, yes, it is hard to use. Let me clarify: It is easy to use for a small project with just a few developers. But for large projects with many developers scattered all over, it, or any centralized revision control system becomes a nightmare (to me, anyway). The biggest problem I have with Subversion/CVS-type systems is that eventually managing the branches becomes a nightmare, and it becomes really easy to screw stuff up.
My work became a lot easier when I started using distributed revision control systems. My favorite is Mercurial, a very fast and lightweight system written in Python. The main reason that I like it is that is by far the easiest to use revision control system that I have worked with. In addition to being fast, intuitive, supporting completely disconnected operation, and other great features, branching and merging is a breeze. And, most importantly, it makes it very easy for the developers on the large projects I work on to keep from stepping on each other's toes because everything is a branch . Whenever I checkout ("clone") the repository that we consider the "central" (or "trunk") repository, all of my commits happen in my local mirror of the repository, and when I am finished I "hg push" those changes back, merging them back into the "trunk". (My explanation may seem a little confusing, but the Mercurial development model is explained pretty well here.) The great thing about this model is that branching is the most natural thing in the world (in fact, everyone essentially always works on their own "branch") so it actually gets used. I have experienced too many cases with CVS or Subversion where something should have happened on its own branch but didn't because it was too confusing, too slow (with the bottleneck of the central server), etc.
Although Mercurial is still pretty young, it is mature enough that some very large projects (e.g., Mozilla) have moved to it. I urge everyone who is looking for a powerful, but intuitive and easy-to-use revision control system to take a look at it. I have used several revision control systems and Mercurial is the first one that really makes me feel more productive.
You don't like sex?
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
s/lose/have to check another file for/
Yes, I'm working on a 100k-line project (in CVS) that's undergone significant directory restructuring, and no, I've never found this to be a problem. If anything was to push me to Subversion, it'd be the fact that the CVS logs are split up among files in the first place, so I can't get a concise log of changes to the project as a whole (without maintaining a separate ChangeLog, and then why do VC in the first place?).
My main beef with Subversion, from what I've read of it so far (correct me if I'm wrong), is that it insists on using some form of database to store the project data, rather than using ordinary files as CVS does. This may improve the efficiency of accesses, but it also makes it harder to recover the data when catastrophic failure occurs. With CVS, even if part of the repository gets nuked, I can still recover anything that's left, at worst by just comparing the ,v file in the repository with my working copy; I'd be pretty nervous about using a VC system in which that sort of last-ditch fallback wasn't available.
Haven't used subversion yet, but have used Perforce, clearcase & CVS.
Frankly CVS just doesn't cut it for me. It lacks too many features.
1) Atomic checkins/submits
I am trying to submit changes in 5 files as a single bugfix.
A submit/checkin should either succeed for all 5 or fail for all 5.
CVS doesn't do this. The end result is that I may end up submitting
a change in the header without submitting a correspond change in the
implementation file.
2) Changelists
After checking in multiples files together, at any point in time, I should
be able to find out all the changes that were checked in at the same time.
CVS has no way of doing this - Submitting 5 files together is the same as
submitting 5 files separately as far as CVS is concerned.
3) More Changelist features for non-submitted changes
Let us say I am working on 3 different bugfixes. Perforce allows me
group together my changes in different changelists even before I
submit the changes. That is I can create changelist A B & C.
In changelist A - I have files a.c & a1.c changed, in changelist
B, I have b.c & b1.c changed & so on. So I decide I am done with
all the changes required in the subset A, I can submit it very easily
or undo all changes in changelist B.
4) Merges
Merges between branches are a breeze with Perforce. With CVS it's
a pain. Perforce stores a lot of information about merges which have
already happened which in invaluable. In CVS, merges between branches
are very little more than changes manually copied from one branch to
another.
I can do a lot of stuff which I can't do with CVS
- I can very trivally merge Bugfix 1111 (comprising of 5 files
checked into changelist XXXX) from a branch to another branch or
the main trunk.
- Because Perforce stores information about merges, I can do periodic
single command merges very easily between a branch & the trunk - perforce
will not try to merge in changes which have already been merged the last
time I did a merge.
I could go on & on, but the point is that something Perforce makes
a developers life so much more easier. I could work around all these
things in CVS (i.e. do it in multiple steps) but the ease is something
worth paying for I think.
I haven't used subversion, so I can't comment on it.
SVN would be great if it had merge tracking (and true renames.) As much as I like SVN, the merge issues are a deal breaker:
- No merge tracking. You have to manually record merge information in the checkin comment, which is inherently error prone. If merge tracking isn't done or is done incorrectly (e.g. merge -r 100:HEAD) there's no way to recover except to redo the merge with extra double checking.
- The svnmerge.py merge tracking script only considers the current directory. It doesn't do any recursive analysis so you want to do all your merges at the project's root dir to be accurate.
- Lack of true renames. When you rename or move a file, it does a delete + add, which leaves you open to missed merges. Ex: Branch. Rename branch/a.java to branch/b.java. Make an enhancement change to branch/b.java. Make a bug fix in trunk/a.java. Merge branch to trunk. SVN will delete a.java (which has the bug fix) and add b.java. Congrats, you just lost the bug fix change. SVN should have merged b.java with a.java.
- Bi-directional merges. When you merge between branches multiple times, any merge conflicts resolved in previous merges get re-flagged as conflicts, thus giving you an ever increasing number of spurious merge conflicts that hide the real, new merge conflicts. The workaround is to skip merge revisions, which has the drawbacks of requiring multiple merges, and any extraneous changes made during a merge (such as a quick and simple bug fix) are not merged.
- Serious training to understand merges. You basically need a merge-meister or two who understands the implications and pitfalls of SVN's merging and merge-tracking.
- There's also no way to 'lock down' merges via hooks/triggers. (Such as requiring svnmerge.py to be used for all merges.)
Once merge tracking is added to SVN (maybe in 1.5?) it would be great. Until then, I wouldn't use it except on small teams using Agile and few, short lived branches in order to minimize the merge issues.Neither the article nor the replies tell me anything useful. .tar.gz files are small, meaning they are fast to move through a network, but do they diff well? Good compression algorithms turn data into statistically random streams of bits, so I suspect that different generations of uncompressed .tar files would have smaller deltas than the compressed versions. Similar questions abound for GIF and JPEG files.
Nothing for 6-digit uids?
Sorry, but this whole paper points at i/o-bound problems and they didn't even look for what with a 99% chance is the root issue. (see numbers for dedicated workstation)
just painful.
Its delta-compression algorithm doesn't treat line breaks any different from other bytes, so if the files contain large chunks of duplicate data, they can be delta-compressed.
Of course, data compression masks similarities, so pre-compressed objects defeat this, but it still manages to work on a lot of things.
Anyway, just FYI.
License Perforce.