Linus on GIT and SCM
An anonymous reader sends us to a blog posting (with the YouTube video embedded) about Linus Torvalds' talk at Google a few weeks back. Linus talked about developing GIT, the source control system used by the Linux kernel developers, and exhibited his characteristic strong opinions on subjects around SCM, by which he means "Source Code Management." SCM is a subject that coders are either passionate about or bored by. Linus appears to be in the former camp. Here is his take on Subversion: "Subversion has been the most pointless project ever started... Subversion used to say, 'CVS done right.' With that slogan there is nowhere you can go. There is no way to do CVS right."
Well Linus didn't have anything bad to say about MS Source Safe. . .
;-)
[ducking] Sorry, I couldn't resist the urge.
He is only human. Just because he is the head of a huge software project doesn't make him infallible.
Just look at the whole 'RMS vs Linus' thing.
His opinions should carry some weight, especially since he should know more than anyone what the limitations of SCM software is when it comes to larger projects like the linux kernel. But a lot of SCM comes down to the way a project is managed, the preferences of the people involved, and how they deal with their project. I doubt there is a blanket solution... a 'one SCM package to rule them all' so to speak.
Especially in the software industry you can always find someone just as good as yourself that strongly holds opinions that are the polar opposite of yours.
Linus isn't saying that CVS and Subversion have fixable bugs or missing features. It's not about the code.
He is saying that they solve the wrong problem. The Subversion team wants to solve Problem A, and Linus wants to solve Problem B. No amount of code will turn the solution to Problem A into a solution for Problem B. Bothering the Subversion team with code addressing Problem B will only irritate them, since they're working on Problem A.
The right way to handle differing goals is to start a different project. That's what he did.
Don't be confused by the labels. Source Code Management means different things to different people, and there isn't always much overlap in how each person defines it. Ships and airplanes are both 'vehicles', but that doesn't mean that a few changes will turn one to the other.
I took a look at git a while ago and was completely underwhelmed. The UI was so bad it was useless, and it didn't "seem" to do anything that Darcs didn't do. (I used to love Darcs because of the automatic patch dependency computations).
.git dir and shell scripts that combine very simple low-level functions. For instance, you can create a branch just by saving the SHA1 ID of the tip into a file in .git. You can branch off any point in the history this way, including branches you've deleted in the past (git keeps all the old commit objects by default, even ones that aren't pointed to by any branch or tag.. this is very simple and understandable model, like reference-counting in a way).
Now that all the "next generation" SCM tools have matured somewhat, I took a look at all of them again. I had to stop using Darcs because of the "patch of death" problem, which basically is this: after using Darcs on a project with long-lived parallel branches, the repository may eventually enter a wedged state you can't get out of, due to exponentially complex patch dependencies. Oops.
At this point I had an idea of what an SCM should do, how it should work, what the "mental model" should be. I want to create changesets, add them to branches, combine multiple branches (and keep track of renames and so forth between branches), re-order changesets, collapse multiple changesets into one, discard old branches, etc.
Of course, CVS and close cousin Subversion are SO UTTERLY USELESS I didn't even consider them. Seriously, Subversion is like gold-plated shit. Looks nice but it's still shit. Reading people say stuff like "Subversion is awesome" makes me wince. How can something that doesn't have "real" branches, and doesn't have tags OF ANY KIND, be useful for anything? How do you keep track of multiple merges between branches? Answer: you don't. Or you keep track of revision numbers using svnmerge and pray it all works. Even the Subversion docs sortof hand-wave this away. I.e., they hand-wave away one of the FUNDAMENTAL ASPECTS of source code management: branching and merging. It's like hearing people talk about OO databases. They mean well but they just don't comprehend the generality of the underlying problem.
That's why I was so excited about Darcs: the author "gets it". Unfortunately the implementation is flawed.
I checked out a few more (Mercurial, bzr) but finally settled on git because it let me do all the things I needed to do, and it did them FAST. Once I figured out the underlying model I was pretty impressed. Git can be viewed at many levels: very low-level plumbing, or UI-level, or in between. The UI and documentation is still pretty shitty, but thankfully they are working on improving it and are moving away from the idea of having interchangeable UIs. Just focus on improving "core git".
One great thing about git is that so much of it is just files in the
The other great thing about git is how easy it is to sling changes around and reorder them and combine them. For instance let's say you add a file to your project as commit "A". Then you add some code that uses this file as commit "B". Then you fix a bug in the file as commit "C". So you have A-B-C. Now you'd like to combine A and C into a single patch A', and put B on top of it, like this: A'-B. In git, this is super-easy. I can think of two ways to do it off the top of my head.
I was checking into a CVS project the other day (for a client) and wanted to do this. Then I realized, you can't move things around in CVS like this *twitch*. So nowdays I do everything in git and only after the changes are beautiful and self-contained and well-commented do I check them into CVS one at a time.
Okay so they point is, check out git (or honestly? Checkout out ANYTHING that isn't CVS or svn). Even if you think Linus is an asshole (which he is) or you don't like the git UI (it's not that bad now), check it out anyway.
And if you don't use SCM at all? You suck. Start learning. It's a best practice that you can't live without, once you start.
The ultimate reason why Linus dislikes SVN, CVS, etc. is that it is centralized. Everyone checks out source from a central server and commits their changes to the same centralized area. This has problems: your workspace is not versioned. By this I mean, you cannot track local changes to your workspace without committing them to the central server.
A common pattern in development is to try one approach, test it, tweak it, and possibly try another approach if the first did not work out, perhaps reverting to a prior approach. With decentralized version control, you can commit your changes to a local repository and work from there. All the locally changes you make are versioned, and be committed, checked out, examined all without contacting a central repository. This is ideal, because you often want to try various options to find the one that works best, before pushing your changes to the rest of the world. In centralized version control, you can use a branch for this purpose, but often branches in these systems are difficult to either create, merge, or maintain, so they are rarely used. The end result is that with centralized version control, developers version their workspace in their head. DVCS systems remove the mental burden.
Fortunately, FOSS developers are realizing the usefulness of DVCS and major projects are converting to some form of DVCS. Mozilla is switching to Mercurial. The Pidgin project, which just released 2.0.1, is using Monotone. (Linus favorably mentioned both of these distributed version control systems in his Git talk, as they are both are distributed).
Once you accept that DVCS is better than the centralized model (which may not be true for some situations), only a few (but growing number of) version control systems are viable. This is currently a hot area in open source development, with software such as GNU Arch, Monotone, Mercurial, Git, Darcs, Bazaar, and more paving the way. Many open source DVCS's are still in development and not ready for general usage. I can't speak for Mercurial, but Monotone doesn't have the greatest performance, instead preferring integrity over speed. This led Linus to write git, since speed is very crucial for a large project like the Linux kernel.
Whatever the actual program (git, Mercuial, or Monotone), more and more open source developers are realizing the advantages that distributed version control can offer. I encourage all developers that haven't used any DVCS to try it -- once you do, you won't go back.
Tired of free ipod spam sigs? Opt ou
When you're starting out, just remember "git commit -a" and you'll be fine. Also check out "git reflog" to see the linear history of your repo. The pulling/pushing stuff can get a lot more complex but it's damn powerful. If you can figure out Arch (yeesh) you can figure out git!
SLASHDOT SEZ: you have too few characters per line. Okay, slashdot, here's part of the man page for git-rebase:
If is specified, git-rebase will perform an automatic git checkout before doing anything else. Otherwise it remains on the current branch. All changes made by commits in the current branch but that are not in are saved to a temporary area. This is the same set of commits that would be shown by git log
Well, probably those two go together: being an amazing creator, and being an amazing ass with huge ego. Who knows.
I disagree entirely that those two traits must go together. I'm living proof that you're wrong, in fact. I don't have a creative bone in my body.
"Gratuitous complexity is akin to chaos" - True Vox
Most distributed version control systems exhibit this phenomena, because by "checking out" you are actually doing two operations: pulling the latest changes from someone else, and updating your workspace. For example, in Monotone you would type (I imagine git operates similarly):
The first command retrieves revisions from the server, and the second updates your workspace with those new changes. To "commit" a change, in a distributed version control system you first 1) commit the change to your local repository and then 2) push it to someone else:
It is often useful to keep these operations separate. For example, you can commit without pushing. Make a bunch of changes, commit each one separately, and only push once you're satisfied with the result. Other developers can still see each change you made individually, but only after you've pushed, so they won't be stuck with an incomplete in-progress version of the tree.
Similarly, by being able to update without pulling, you can revert to any revision you would like without contacting the network. Likewise, since commit does not require network access, it is no extra effort to work offline. Once an Internet connection is available, you can synchronize your repositories, but in the meantime you can make any change you want - even with no network connection.
The main disadvantage of a decentralized version control system is that it requires workflow changes to get the most out of it. If you are only familiar with centralized version control systems, it will take some time getting used to. But I'm glad to say, an increasing number of projects are making the change to distributed version control, among them, Mozilla and Pidgin. They are not using Git (but Mercurial and Monotone, respectively) but they're all distributed. Git is being used by the Beryl project, among others. Subversion has momentum in FOSS because it is familiar for those used to centralized version control (everyone knows CVS), and SourceForge provides free SVN hosting. Once a free open source hosting site provides hosting for a distributed version control system, I expect more low-resource open source projects to use it.
Tired of free ipod spam sigs? Opt ou
The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
Why? It doesn't have to be. At least if you use something that isn't horribly broken.
Yes, they will. Because this is a monumentally stupid idea. Because the entire *purpose* of revision control systems (note: "CVS" stands for "Concurrent Versioning System") is to make it possible for developers to work on things at the same time. The idea is that you can get more benefit from the concurreny than you get difficulties from merging.
Rules like "merge early, merge often", perhaps? Fixes the problem, and *doesn't* cripple development horribly like your idea would.
Distributed version control the way git does it (conceptually, not necessarily the implementation) is the best idea in SCM since concurrent development and optimistic merge conflict resolution on check-in.
Notice how, even years after better ideas superceded the lock-modify-unlock paradigm, many tools and shops still use exclusive-lock SCM.
It could be quite a while before you see anything like the way git does SCM in use in the majority of programming shops.
The advantage is that MergePrivileges can be fine-grained: there can be many answers for "merge into what?" There's a -mm tree, a -stable tree, a -linus tree, a -rt tree, and a lot of vendor and distro trees. Each of these has a different maintainer, and can have a different idea of what is acceptable. And only the maintainer can merge things into their tree, and they can decide based on a variety of features of the things they're considering. For example, Linus only merges from a few people directly: maintainers of various subsystems. And he doesn't even trust them completely; if the SD/MMC maintainer has a change which changes x86 architecture code in the tree Linus is asked to merge, he'll notice and ask what's up with that. And if there are changes that look too intrusive for the current point in the development cycle, he'll put it off until the next cycle, and ask for a tree with just fixes. And -linus isn't special, except that almost everybody trusts him implicitly and merges his stuff into their trees (the main exception being -stable, which is why a new 2.6.20.x kernel isn't derived from 2.6.21; and vendor and distro kernels are generally based on -stable of some sort, and only get new stuff from Linus when they go to a new series). Also, maintainers of subsystems know the people who work in their areas, and can apply the same sorts of rules: the guy from Intel who works on their network drivers can get e100 changes into the the -netdev tree, because the maintainer knows they know what they're doing for e100 changes. And Linus sees that the e100 changes are coming in through -netdev, and the network maintainer knows what policy to apply to the drivers around there, so they're fine, even if Linus has no clue who should be allowed to do what in e100.
It's not that the politics go away. It's that the policy is no longer a binary "yes or no" decision, so the technical arrangement mirrors the social arrangement. This doesn't work with CommitAccess because people wouldn't commit the same change everywhere they should, and they couldn't be restricted to only making changes they're trusted to make (there are people who are trusted to correct spelling in comments in any file in the tree, and Linus can look through the total changes they send and verify that they only change spelling in comments).