Slashdot Mirror


Microsoft Introduces GVFS (Git Virtual File System) (microsoft.com)

Saeed Noursalehi, principal program manager at Microsoft, writes on a blog post: We've been working hard on a solution that allows the Git client to scale to repos of any size. Today, we're introducing GVFS (Git Virtual File System), which virtualizes the file system beneath your repo and makes it appear as though all the files in your repo are present, but in reality only downloads a file the first time it is opened. GVFS also actively manages how much of the repo Git has to consider in operations like checkout and status, since any file that has not been hydrated can be safely ignored. And because we do this all at the file system level, your IDEs and build tools don't need to change at all! In a repo that is this large, no developer builds the entire source tree. Instead, they typically download the build outputs from the most recent official build, and only build a small portion of the sources related to the area they are modifying. Therefore, even though there are over 3 million files in the repo, a typical developer will only need to download and use about 50-100K of those files. With GVFS, this means that they now have a Git experience that is much more manageable: clone now takes a few minutes instead of 12+ hours, checkout takes 30 seconds instead of 2-3 hours, and status takes 4-5 seconds instead of 10 minutes. And we're working on making those numbers even better.

32 of 213 comments (clear)

  1. Meh... by the_skywise · · Score: 3, Insightful

    There aren't THAT many repos with over 3 million files in them.

    The great majority of projects I've been on have been around the 100k-300k range and doing a build (to properly test the product) required ALL of them.

    And even then, once you've got all of them the first time, GIT does the diffing automatically so it "scales" already.

    Maybe MS could put some of their vast R&D efforts to to something more useful... like having their free Visual Studio Code editor handle files bigger than 1gb?

    1. Re:Meh... by Transcendent · · Score: 5, Interesting

      Microsoft's repos *are* that large. That's why they implemented this.

      Microsoft Office's repository is over 1 TB in size. Yes, terabyte. For *office*. They absolutely cannot (could not, I suppose now) use Git on it.

    2. Re:Meh... by serviscope_minor · · Score: 2, Informative

      And if you have a million?

      --
      SJW n. One who posts facts.
    3. Re:Meh... by caseih · · Score: 2

      The link is apparently slashdotted so I can't view it, but I think you misread it. The ACM link apparently says there is a billion *lines of code* not a billion files in one repo. Big difference! The OP would appear to be right.

    4. Re:Meh... by Anonymous Coward · · Score: 5, Funny

      all right, you've clearly nominated yourself to untangling a 1TB repository. get on it bud.

    5. Re:Meh... by cdrudge · · Score: 2

      The ACM article headline is correct. The post that mentions billions is correct. You just missed it in the article.

      Fourth paragraph (emphasis added):

      The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google's entire 18-year existence. The repository contains 86TB of data, including approximately two billion lines of code in nine million unique source files.

    6. Re:Meh... by tepples · · Score: 2

      But if multiple applications in Office share a library, where do you put that library so that the build process for each Office application can see it? Are submodules or subtrees a good choice, and if "yes," which is more appropriate?

    7. Re:Meh... by Penguinisto · · Score: 2

      They likely store their comments as separate files - one per comment.

      (no, really... has no one in Redmond ever heard of making their shit modular?)

      --
      Quo usque tandem abutere, Nimbus, patientia nostra?
  2. Did they just turn git into svn? by lucasnate1 · · Score: 5, Insightful

    The whole point of git is that you have identical copy on your machine. Why take away git's biggest advantage?

    1. Re:Did they just turn git into svn? by Aaron+B+Lingwood · · Score: 2

      Microsoft are just getting efficient. They have simply skipped "Embrace".

      --
      [Rent This Space]
    2. Re:Did they just turn git into svn? by thegarbz · · Score: 4, Insightful

      The whole point of git is that you have identical copy on your machine. Why take away git's biggest advantage?

      Because it's biggest advantage is also one of it's greatest inefficiencies and frankly on a large project chances are you may not need it all. The whole point is you have an identical copy on your machine of what you're working on

    3. Re:Did they just turn git into svn? by Anonymous Coward · · Score: 3, Insightful

      The whole point of git is that you have identical copy on your machine. Why take away git's biggest advantage?

      "A Clone now takes only minutes instead of 12+ Hours!"
      Ja, that's because you're NOT making a copy.

    4. Re:Did they just turn git into svn? by Anonymous Coward · · Score: 3, Insightful

      No, the whole point of git is that every file version is immutable and referenced by a globally unique hash. This means that it doesn't matter where the actual data is located - until you need the actual data for some actual reason. This model has been copied by countless systems since git, because it is extremely robust and has multiple benefits, and none of those other systems expect the local user to download the entire database before he even begins work. Nonetheless, such systems can also support downloading the entire database, so I'm puzzled as to why you think this work on git object caching "takes away" a feature which it quite clearly still in fact supports.

      The important thing here is the new use cases which the new caching strategy enables.

    5. Re: Did they just turn git into svn? by tangent · · Score: 5, Interesting

      > Why take away git's biggest advantage?

      Because "clone now takes a few minutes instead of 12+ hours, checkout takes 30 seconds instead of 2-3 hours, and status takes 4-5 seconds instead of 10 minutes."

      That is problem is not unique to Git. JÃrg Sonnenberger tried importing the NetBSD repository into Fossil, and "the rebuild step which (re)creates the internal meta data cache took 10h on a fast machine." There are ways to make Fossil skip the rebuild on clone, which results in a suboptimal DB, but it still takes hours to clone. NetBSD's project history goes back something like a quarter century; it's going to take time to pull and organize all that.

      DVCSes are great when you can afford their associated costs â" namely, the very advantages you refer to â" but for very large repos, those costs can be very high.

      Do you really need every single version going back a quarter century? And if you do, do you need it 5 minutes after the initial clone?

      One idea that's come up on the Fossil mailing list is to do a shallow clone initially, then trickle the back history in over time. I'd like a DVCS that gave me the past 30 days of history at the tip of every open branch, then over the next day or so back-filled the rest.

    6. Re:Did they just turn git into svn? by Daltorak · · Score: 2

      Microsoft are just getting efficient. They have simply skipped "Embrace".

      No they didn't. For one thing, Git has been supported in TFS for four years now. And then there's this:

      "Among them, we learned the Git server has to be smart. It has to pack the Git files in an optimal fashion so that it doesn’t have to send more to the client than absolutely necessary – think of it as optimizing locality of reference. So we made lots of enhancements to the Team Services/TFS Git server. We also discovered that Git has lots of scenarios where it touches stuff it really doesn’t need to. This never really mattered before because it was all local and used for modestly sized repos so it was fast – but when touching it means downloading it from the server or scanning 6,000,000 files, uh oh. So we’ve been investing heavily in is performance optimizations to Git. Many of them also benefit “normal” repos to some degree but they are critical for mega repos. We’ve been submitting many of these improvements to the Git OSS project and have enjoyed a good working relationship with them."

      https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-git-and-some-back-story/

  3. Ah nostalgia by DrXym · · Score: 2
    I had to use Clearcase as my source control system for one company I worked for. The idea was you set up a view spec (a bit like a branch), mapped a drive letter to it and you never had to pull again because it would always reflect that branch. Your local changes went over the top and when it was time to commit you could merge up and commit. In practice what it meant was the source code was constantly changing under your feet, and binaries were constantly stale or in a mystery state because you didn't know what they were compiled against. And because this was IBM software it was unusably slow across WANs, memory hungry and enjoyed triggering random blue screens.

    While a vfs sounds like a great idea, I think in theory it's only of use for very, very large repos. Even then I wonder if the exact same issues that made Clearcase suck would make it suck even with Git.

    1. Re:Ah nostalgia by Anonymous Coward · · Score: 5, Informative

      In practice what it meant was the source code was constantly changing under your feet, and binaries were constantly stale or in a mystery state because you didn't know what they were compiled against.

      Then you had a piss-poor release engineer who didn't understand how to construct config specs based on a stable baseline, label & promote stable builds regularly, and use clearmake properly, or manage dependencies and allow you to do a clean, fast local build.

      I love git, and I work with it daily, and the monorepo craze baffles the shit out of me, to be honest. But I used and supported ClearCase for 14 years at a large financial services company, and I can assure you that the problems you're complaining about are not limitations of the tool - they are limitations of your team's release engineers. ClearCase has many failings, but the issues you're describing simply reflect poor implementation and design choices.

      And because this was IBM software it was unusably slow across WANs, memory hungry and enjoyed triggering random blue screens.

      It stemmed from fundamental concepts cribbed from Apollo's DSEE environment. HP's acquisition of Apollo prompted what would then become the ClearCase team to leave Apollo/HP and form Pure, then they combined with Atria to form PureAtria, then Rational acquired PureAtria, and then IBM acquired Rational -- so ClearCase was a thing long before it was IBM software, and the features you're griping about were extant long before the IBM acquisition. The IBM era mostly saw them continue to focus on jamming ClearCase into their "Application Lifecycle Management" toolset, Rational Team Concert, wrapping everything in a ghastly blue Eclipse RCP client, and making it more of a pain in the ass to use.

      Dynamic views as you're talking about were not - and never were - intended for use across WANs, their Admin & Deploy guides specifically stated that it required a fast connection to a local server. If you wanted WAN connectivity, you either used RTC (Rational Team Client) to pull web views, or you used snapshot views, or you ponied up for MultiSite licenses and set up a sync scheme so that each site could have local copy on a VOB & View server they had a fast connection to.

      Again - poor implementation by your release team. It's like complaining that a hammer makes a giant hole in the drywall when you put screws in with it - it doesn't mean there's a problem with the hammer, it means there's a problem with the operator. If you use the tool in a way it's not intended to be used, then don't be surprised when it does a shitty job.

    2. Re:Ah nostalgia by Anonymous Coward · · Score: 3, Informative

      I had to use Clearcase as my source control system for one company I worked for. The idea was you set up a view spec (a bit like a branch), mapped a drive letter to it and you never had to pull again because it would always reflect that branch. Your local changes went over the top and when it was time to commit you could merge up and commit. In practice what it meant was the source code was constantly changing under your feet, and binaries were constantly stale or in a mystery state because you didn't know what they were compiled against. And because this was IBM software it was unusably slow across WANs, memory hungry and enjoyed triggering random blue screens.

      While a vfs sounds like a great idea, I think in theory it's only of use for very, very large repos. Even then I wonder if the exact same issues that made Clearcase suck would make it suck even with Git.

      To be fair to IBM, ClearCase had this behavior before the three mergers that made it part of IBM. (Pure + Atria -> PureAtria, PureAtria + Rational -> Rational, IBM + Rational -> IBM)

      I actually liked the concept of "wink-in" where derived objects that came from the same source objects and build environment could just be pulled from someone else's build instead of rebuilt. But the system as a whole required a zippy network.

      I don't hold out hope that a vfs on top of another scm solution would be even as fast as ClearCase, and certainly not faster.

    3. Re:Ah nostalgia by AuMatar · · Score: 4, Insightful

      The fact you needed a release team and release engineers to manage a clear case implementation is why its considered one of the worst systems out there, remembered with hatred by almost everyone who used it. A version control system should be easily set up by one admin in an hour or two, and then usable without reams of documentation by any of the engineers. ClearCase failed that.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    4. Re: Ah nostalgia by Mortimer82 · · Score: 2

      You should check out Git LFS (large file storage). It's an extension which only stores references in the Git repo, and then fetches the actual large file off a web server on checkout. It was built for games with large amounts of assets. https://git-lfs.github.com/

  4. Ah, Microsoft by Kierthos · · Score: 2, Interesting

    "Hey, how can we do what GitHub does, only stupider?"

    --
    Mr. Hu is not a ninja.
  5. Split Your Repo by Luthair · · Score: 2

    If your developers aren't using all the files then you should probably split your repository.

  6. It's the hook to make your repositories break by Ungrounded+Lightning · · Score: 2, Insightful

    The whole point of git is that you have identical copy on your machine. Why take away git's biggest advantage?

    Because it's biggest advantage is also one of it's greatest inefficiencies and frankly on a large project chances are you may not need it all. The whole point is you have an identical copy on your machine of what you're working on

    So buy a bigger disk. They're cheap.

    Why did they do it? It's obvious: it's the bait on the hook to get you to break git and your open source projects (even CURRENT ones) that compete with them.

    By keeping you from having a full copy of the repository, they break git: If there are files that you didn't use in recent checkouts, they're not stored locally or not brought up to date when you pull. If something goes wrong externally - like loss or corruption at a cloud site (such as the recent lost-update debacle) you have no non-microsoft-git-internals-expert way to recover - maybe no way to recover at all.

    You lose the ability to work offline. You lose the ability to look at history, or parts of the repository you haven't been to yet, without being back on line to a working and trustworthy external server, and so on.

    --
    Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
  7. Re:MS Linux ??? by stooo · · Score: 3, Funny

    Wait a second.
    MS just invented an efficient way to checkout the Linux kernel on windows, so you can get the kernel sources, compile it, and then run Linux and ditch Windows ?
    That's great !!

    --
    aaaaaaa
  8. Re:MS Linux by Billly+Gates · · Score: 2

    Would be a dream come true. Ditch the abomination (Windows) and do like Jobs did by putting a nice GUI on top of a "Unix".

    Yes it's fashionable to bash MS here. However, like IBM MS got nicer when losing their monopoly.

    Anyway like any organization or company they make great and shitty software. MS makes great office and development tools. THeir operating systems and browsers are mediocre at best.

    With GNU they make great operating systems and development tools but shitty office

  9. Re:MS Linux by ckatko · · Score: 3, Interesting

    You must have never used their enterprise Dynamics CRM and Dynamics NAV software.

    If you can get it to run at all, half the shit is broken. Hell, the 2013 edition of CRM actually told you NOT to install the newest version of IE because it was "unsupported at this time." Yeah. IE (11?) didn't support CRM. Now I've got to explain to my clients why Windows Update completely broke their brand new system they paid thousands of dollars for.

    Another great "feature" of CRM 2013 was a completely broken IMPORT system. So if you're trying to import anything other than mind-numbingly simple data like "addresses." You have to add stuff with timestamps, dates, and so on. You surely don't want ALL USER MESSAGES to lose their order and timestamps, right? TOO BAD. Even though CRM supports setting the timestamp, for certain record types the importer is completely broken and they never cared to fix it. So the "simple" solution? All you have to do is create a C# plugin, based on non-compiling code from an obscure blog. Oh wait, you can't just write a C# plugin. You have to use their HUGE SDK, their tools to "attach" the plugin to CRM and even that requires hours of reading manuals to figure out the right triggers. And if something goes wrong? ENJOY ZERO USEFUL ERROR MESSAGES. And yes, I turned on tracing (Which requires CHANGING THE REGISTRY in various places.) and debug mode.

    Or how about SQL 2014/2015, which STILL doesn't properly support DPI scaling. The hallmark of Windows 10, and if you use a high resolution with a small laptop screen, random dialog boxes will not only be shrunk and force you to squint to read them... no... that'd be too easy. Some of them are so broken that you can't physically view all of the contents of the dialog AND YOU CAN'T SCROLL TO SEE IT. The dialog dimensions are shrunk and the data is to the right of a window you can't resize!

    THANKS MICROSOFT. I love fixing your shit at my job while having to explain to clients that Microsoft's "It Just Works (TM) if you stay within the MS ecosystem!" is all a bunch of bullshit and the "It works" trademark is actually paved with the blood of IT workers.

    Microsoft could make great products. Too bad they never bother to finish any of them.

  10. Re:MS Linux ??? by tobiasly · · Score: 2

    Wait a second.
    MS just invented an efficient way to checkout the Linux kernel on windows, so you can get the kernel sources, compile it, and then run Linux and ditch Windows ?
    That's great !!

    Seeing as how the only purpose of IE/Edge is to download Chrome/Firefox, I guess they figured that was the next logical step...

  11. Re:MS Linux by Billly+Gates · · Score: 2

    Exactly their OSes are mediocre.

    I bought a 4K screen and good GOD what a nightmare. I hate Apple ALOT, but give Apple Kudos no problems when Retna hit MacOSX in 2011. It is freaking 2017 so who the hell uses 100 DPI anymore?! Really, a cheap ass phone has a better screen than a $900 PC.

  12. Re:MS Linux ??? by arth1 · · Score: 4, Informative

    That's really bad naming practice.

    It's consistent naming for that project.
    Any kernel configuration for netfilter with match support gets lower case names, and with target support it gets upper case names. In some cases there is support for both.
    And the only real problem with this is ... Windows.

  13. Re:MS Linux by Blaskowicz · · Score: 2

    Eh, you bought a 4K monitor? Joke's on you. Sorry, but I think it's expected to be crappy, unless you're a dictatorship that obsoletes all hardware/software every few years (Apple) or have only legacy-free DPI independant GUIs and software (Android, web)

    Well, something to blame MS for, and which might be the source of some scaling crappiness.. They still don't allow fonts anti-aliasing without RGB Cleartype? I just hate that. Especially when I just want to use Windows 7 or something on a CRT monitor, which doesn't even have subpixels (yea I don't care, I'm free to use one, 19" size is good in particular). The choice is garbage aliased fonts from 1996 or garbage technicolor fonts. And if going HiDPI, you ought to be able to get rid of that subpixel technicolor garbage anyway. Is that still the case on Windows 10.1 or Windows 10.2? (whatever the pseudo hidden version is)

    Do they still render a low DPI application with Cleartype, then scale it up 200% displaying artifacted garbage fonts, then call it a day?

  14. Re:MS Linux ??? by Lost+Race · · Score: 3, Informative

    It may be consistent, but it is terrible.

    Better would be:

    xt_match_hl.c
    xt_target_HL.c

    Just because you can, doesn't mean you should.

  15. Re:MS Linux ??? by arth1 · · Score: 2

    It may be consistent, but it is terrible.
    [...]
    Just because you can, doesn't mean you should.

    If you grew up with and are used to case sensitive file systems, and aren't aware of limitations in other systems because they've never been part of your work and life, why is this terrible?

    The practice of Makefile + makefile is far from uncommon. With Makefile being the "production" one, and makefile having local modifications.

    If I remember correctly, one language used to have file names like Net::NIS and CRC::CCITT too, until porting was startted, and someone discovered that this would break in some other OSes.

    And I'm sure that one or more projects have had a subdirectory named con - which works fine, except in Windows, where con is a reserved word.

    And back in the 80s/90s, it was not uncommon to have both pack and compress on a system, with both .z and .Z extensions. One system I used had its compressed man pages as both .z and .Z, in the same directories. Where .z was used for local pages due to the unpack speed, and .Z was used for NFS mounted pages, due to higher compression. Was that terrible?

    Unless you plan to port something to another OS or are familiar with it, I don't think it's terrible at all if you don't make adjustments for it.

    Or to put it another way, I don't see a lot of Windows users take care to always be consistent with case, so things won't break on case sensitive systems. And I'm fine with that as long as they intend to keep their work in Windows.