Microsoft Introduces GVFS (Git Virtual File System) (microsoft.com)
Saeed Noursalehi, principal program manager at Microsoft, writes on a blog post: We've been working hard on a solution that allows the Git client to scale to repos of any size. Today, we're introducing GVFS (Git Virtual File System), which virtualizes the file system beneath your repo and makes it appear as though all the files in your repo are present, but in reality only downloads a file the first time it is opened. GVFS also actively manages how much of the repo Git has to consider in operations like checkout and status, since any file that has not been hydrated can be safely ignored. And because we do this all at the file system level, your IDEs and build tools don't need to change at all! In a repo that is this large, no developer builds the entire source tree. Instead, they typically download the build outputs from the most recent official build, and only build a small portion of the sources related to the area they are modifying. Therefore, even though there are over 3 million files in the repo, a typical developer will only need to download and use about 50-100K of those files. With GVFS, this means that they now have a Git experience that is much more manageable: clone now takes a few minutes instead of 12+ hours, checkout takes 30 seconds instead of 2-3 hours, and status takes 4-5 seconds instead of 10 minutes. And we're working on making those numbers even better.
Microsoft's repos *are* that large. That's why they implemented this.
Microsoft Office's repository is over 1 TB in size. Yes, terabyte. For *office*. They absolutely cannot (could not, I suppose now) use Git on it.
> Why take away git's biggest advantage?
Because "clone now takes a few minutes instead of 12+ hours, checkout takes 30 seconds instead of 2-3 hours, and status takes 4-5 seconds instead of 10 minutes."
That is problem is not unique to Git. JÃrg Sonnenberger tried importing the NetBSD repository into Fossil, and "the rebuild step which (re)creates the internal meta data cache took 10h on a fast machine." There are ways to make Fossil skip the rebuild on clone, which results in a suboptimal DB, but it still takes hours to clone. NetBSD's project history goes back something like a quarter century; it's going to take time to pull and organize all that.
DVCSes are great when you can afford their associated costs â" namely, the very advantages you refer to â" but for very large repos, those costs can be very high.
Do you really need every single version going back a quarter century? And if you do, do you need it 5 minutes after the initial clone?
One idea that's come up on the Fossil mailing list is to do a shallow clone initially, then trickle the back history in over time. I'd like a DVCS that gave me the past 30 days of history at the tip of every open branch, then over the next day or so back-filled the rest.