Slashdot Mirror


Linux Kernel Archives Struggles With Git

NewsFiend writes "In May, Slashdot discussed Kerneltrap's interesting feature about the Linux Kernel Archives, which had recently upgraded to multiple 4-way dual-core Opterons with 24 gigabytes of RAM and 10 terabytes of disk space. KernelTrap has now followed up with kernel.org to learn how the new hardware has been working. Evidently the new servers have been performing flawlessly, but the addition of Linus Torvalds' new source control system, git, is causing some heartache by having increased the number of files being archived sevenfold."

45 comments

  1. This is normal. by A+beautiful+mind · · Score: 4, Insightful

    GIT is focused on trading more filespace for less bandwith. This is important for a lot of scattered developers who can afford 1-2 GB more on a harddrive, but 200-300 mb more would suck on a dsl or dialup connection.

    --
    It takes a man to suffer ignorance and smile
    Be yourself no matter what they say
    1. Re:This is normal. by A+beautiful+mind · · Score: 1

      Also, another necessity was to have files which can be handled without a lot of binary hacking, for example in a case when doing merges, recovery, rollback, etc. This is one of the reasons why there are a lot of files, not one big binary blob.

      --
      It takes a man to suffer ignorance and smile
      Be yourself no matter what they say
    2. Re:This is normal. by jZnat · · Score: 1

      Which is why they have 10 TB of space. Is the server only for kernel development/source code, or is this also a mirror for downloading snapshots/compiled sources?

      --
      'Yes, firefox is indeed greater than women. Can women block pops up for you? No. Can Firefox show you naked women? Yes.'
    3. Re:This is normal. by A+beautiful+mind · · Score: 1

      Well, i was talking about GIT as a developer's tool, not about the git services offered by kernel.org, but the reasoning above is the cause for the increased file-count.

      Answering your question, kernel.org holds a lot of stuff, not only kernel related things, but everything from distributions to various utilities, so yes.

      --
      It takes a man to suffer ignorance and smile
      Be yourself no matter what they say
  2. Linus needs to add 2 more programs to Git by mhesseltine · · Score: 1

    Then he would be able to Git-R-Done

    --
    Overrated / Underrated : Moderation :: Anonymous Coward : Posting
    1. Re:Linus needs to add 2 more programs to Git by mageofchrisz · · Score: 1

      Sounds a little gittish

    2. Re:Linus needs to add 2 more programs to Git by Enrico+Pulatzo · · Score: 1

      Those are obviously flags to the git command line program.

  3. So they're having problems by Anonymous Coward · · Score: 0

    Remind me again why switching from free bitkeeper was such a great idea??

    1. Re:So they're having problems by Anonymous Coward · · Score: 0

      Remind me again why switching from free bitkeeper was such a great idea??

      Because BitMover rescinded the free license.

  4. same reason I dislike Subversion by kwoff · · Score: 2, Interesting

    `grep -r`ing source code under Subversion takes much longer than with CVS, due to all the .svn files.

    1. Re:same reason I dislike Subversion by Anonymous Coward · · Score: 1, Informative
      Just exclude .svn from your grepped files. For further details, look at 'man grep' or 'info grep'.
      -d ACTION, --directories=ACTION If an input file is a directory, use ACTION to process it. By default, ACTION is read, which means that directories are read just as if they were ordinary files. If ACTION is skip, directories are silently skipped. If ACTION is recurse, grep reads all files under each directory, recursively; this is equivalent to the -r option.
    2. Re:same reason I dislike Subversion by kwoff · · Score: 1

      How does that work? I said `grep -r`. If I set the ACTION to not read directories, then it won't recursively grep. I do `| grep -v .svn` to remove output of grep for .svn directories, but it still greps the directories which takes a long time.

  5. Re:why blame git? by allanc · · Score: 1

    I actually decided to go ahead and read the article.

    Their two problems are:
    (1) rsync takes a long-ass time to run when it has to compare a crapload of files. The solution they're working on is to build a better rsync that saves its state.
    (2) The i386 architecture sucks. FTFA: "master.kernel.org is still an i386 machine. It's constantly hurting for lowmem since the dentry and inode caches can only live in lowmem." The solution for that is to upgrade master.kernel.org to a 64bit machine.

  6. reiser4 + VCS? by OmniVector · · Score: 2, Interesting

    (sightly) offtopic. wasn't reiser4 supposed to have 'plugin' support, so things like version control could be built directly into the file system? the prospect of being able to say type:

    touch bar
    echo 'foo' > bar
    revisions bar
    output of revision history
    cp bar/revision/1 bar-version-1.0.backup

    granted yes, the storage requirements and cpu usaged might be horrific, but i think something like this is inevitable in file systems, and certainly i welcome the day it becomes a reality.

    --
    - tristan
    1. Re:reiser4 + VCS? by A+beautiful+mind · · Score: 1

      Hm, i thought there is a thing called etcfs(?), a VMS styled revision system, for config files, but i don't really remember the details...

      --
      It takes a man to suffer ignorance and smile
      Be yourself no matter what they say
  7. Really? by TheAngryMob · · Score: 2, Funny

    I've been struggling with stupid gits for years now. (Da-dum-dum). Thank you! I'll be here all week.

    --

    Don't just game, Dungeoneer
    1. Re:Really? by Xionn · · Score: 2, Funny

      The hell you will! Everyone, get your torch and pitchfork!

  8. File System Scalabilty? by the+eric+conspiracy · · Score: 1

    Aren't file system scalability issues why people start using databases?

    Sounds like a software engineering issue.

    1. Re:File System Scalabilty? by A+beautiful+mind · · Score: 1

      I think someone suggested on the LKML in the early development talks, to use an SQL database. According to my foggy memory, Linus replied something along the lines of that solution being much worse in terms of productivity and speed than using simple files. Basically it would be adding another (unnecessary) layer.

      --
      It takes a man to suffer ignorance and smile
      Be yourself no matter what they say
    2. Re:File System Scalabilty? by jbolden · · Score: 1

      Linus ideas on CM are really really bad. While I think he's a great leader in terms of the kernel I really hope he doesn't end up having much influence on the CM side. Having used database driven CM systems (Rational, Borland's StarTeam) they are far and away better at just about everything than file based systems. There is simply no comparing the level of complexity of what you can pull out and how you can configure merges and changes.

    3. Re:File System Scalabilty? by A+beautiful+mind · · Score: 2, Insightful

      Except that you're ignoring speed, the need to be decentralised(i cannot stress this enough, it is very needed in an environment like the kernel is developed in) and low system requirements. Currently git needs only a few basic c libraries and bash.

      Actually i was spending hours to grasp his ideas about GIT, it clearly shows that he gave it a lof of though. Actually i think another SCM already started integrating GIT code into their SCM.

      --
      It takes a man to suffer ignorance and smile
      Be yourself no matter what they say
    4. Re:File System Scalabilty? by jbolden · · Score: 2, Interesting

      I agree he's given this a lot of thought. Linus wouldn't have such non mainstream views if he didn't care. Bad ideas can be well thought out.

      Next, I'm not ignoring speed you can scale a database system up infinitely large. Since database systems support acid transactions (i.e. line/file source code locking during transaction) you can have multiple merges going on at once and thus effective speed is much much better. For example Amazon.com uses Oracle as their backend. Think about the number of users and how snappy amazon feels. Do you really believe that worldwide kernel development is even a small fraction of what amazon.com has to handle in terms of volume?

      I don't see any reason whatsoever for low system requirements for the database servers. However the clients can run on junk hardware under a database system quite easily. Again think Amazon.

      Finally decentralized, I don't really believe that is needed at all. What Linus seems to want is the ability for people to:

      1) Create forks without him knowing about them
      2) Merge parts of those forks back into his trees at will

      Again high end database based CM systems support that. The trees can be on different schemas within the same database or exported at the database level to different servers. Merges are specific to the schema.

      Seriously I have yet to hear of anything that Rational doesn't do that Linus wants as a programmer or as a project lead. Decentralized is the best example of this. He certainly claims to need it but I have to understand why.

  9. Re:why blame git? by Mad+Merlin · · Score: 1
    (wouldn't it be cool to store data from your SQL tables in easy-to-parse flat files for instance? That would make recovery and manipulation a lot simpler)

    ...I really hope you're not involved with databases in *any* way.

    The whole point of a database is to isolate you from the actual representation of the data on disk and to make querying for data easy, so you don't have to parse those files at all! For disaster recovery, I pity you if you prefer to try and extract the data manually from the files on disk themselves when you could use one of the (many) tools (that are part of the DBMS!) designed for exactly this purpose.

  10. Well, what filesystem are they using? by jd · · Score: 1
    If kernel.org is running over ext2 or ext3, then it would seem to be a format problem, not a Git problem. These are not designed to be high-performance filesystems.


    On the flip-side, if kernel.org is using XFS, JFS, Reiserfs (I doubt they'd risk Reiser4 yet) or any other very high-performance filesystem, then maybe the problem is one of organization.


    It is rare that you actually need large numbers of files holding very small amounts of data or metadata. What is probably wanted is a virtual layer that allows the software to see those many small files, but where the files are bundled together to be more efficient to access in this kind of a setting.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  11. See? I Told You So! by Larry+McVoy · · Score: 3, Funny

    Bow before my might, l1nux l00s3rs!

  12. Re:why blame git? by Ezdaloth · · Score: 1

    This remark 'i386 sucks because there is little lowmem' is stupid ... let them improve linux to need less lowmem. Why would you always want dentry and inode's in lowmem anyway; it think they could be swapped into highmem just a well. That will still be faster then not having them in mem...

    using a 64bit arch is a workaround, a solution would be to fix their memory management. I you don't need huge user processes, just increase kernel address range to 3Gb, and userspace to 1Gb. No more problems with lowmem ...

  13. Filesystem? by RealBorg · · Score: 5, Interesting

    Maybe kernel.org should finally consider moving to a more appropriate filesystem than ext3, preferably reiserfs for it beeing optimized to handle a lot of small files. Tail packing not only saves disk space but more important a lot of memory in block cache.

    1. Re:Filesystem? by Yenya · · Score: 4, Informative
      Disclaimer: I run one of the kernel.org mirrors.

      Ext3 vs. Reiser is not an issue here. FWIW, I use XFS on my mirror volume, and I have also noticed how the git repository increases load on my server. See the CPU usage graph of ftp.linux.cz - look especially at the yearly graph and see how the CPU system time has been increasing for last two months.

      The problem is in rsync - when mirroring the remote repository it has to stat(2) every local and remote file. So the directory trees have to be read to RAM. Hashed or tree-based directories (reiserfs or xfs) can even be slower than plain linear ext3 directories, because you have to read the whole directory anyway, so linear read is faster.

      --
      -Yenya
      --
      While Linux is larger than Emacs, at least Linux has the excuse that it has to be. --Linus
  14. 10 TB by jo42 · · Score: 3, Funny
    That's a pretty decent sized pr0n collection they gots there...

    Kernel sources take up, what, only a handful of gigabytes?

  15. seven fold = 2^7? by dtfinch · · Score: 1

    So the rate of files being archived was multiplied by 128?

    1. Re:seven fold = 2^7? by telecsan · · Score: 1

      No, I'm pretty sure that sevenfold just means multiplied by seven.

      http://dictionary.reference.com/search?q=sevenfold

      Same reason a trifold wallet has three sections, not four.

    2. Re:seven fold = 2^7? by Anonymous Coward · · Score: 0

      You probably mean "not eight"

  16. Re:why blame git? by rossifer · · Score: 4, Insightful

    (wouldn't it be cool to store data from your SQL tables in easy-to-parse flat files for instance? That would make recovery and manipulation a lot simpler)

    *snicker*

    *laugh*

    *great rolling peals of laughter*

    *sigh*

    *wipes tear from eye*

    You haven't done much work that actually required databases (or that would massively benefit from a relational programming model). The whole point of moving from flat files to a database is so that the data is stored already parsed, recovery is done by a tool provided by the db vendor, and manipulation is done within rules (constraints) that prevent "programming accidents" (bugs) or "pilot error" (users) from breaking relationships between parts of your data. That eliminates most of the need for recovery right there.

    CM systems get much more powerful and IMHO, simpler, when you start using a decent database as the backend. As for distributed work, there are plenty of good databases that inexpensively and easily fit onto any modern workstation (PostgreSQL is my personal favorite) that can act as a local backing store, giving you fully detached functionality and the benefits of a relationally organized system.

    Regards,
    Ross

  17. Re:why blame git? by Anonymous Coward · · Score: 0

    "*snicker*

    *laugh*

    *great rolling peals of laughter*

    *sigh*

    *wipes tear from eye*"


    *yawn...*

    "sqlfs"

    (that's developer-speech for a boring but feasible project that would make you shove your buzzword db-admin-speech up your arse)

    End of story

  18. perhaps this might help by Lord+REL · · Score: 1

    http://kerneltrap.org/node/5070

    this interview with the maintainers has a comment from sombody who claims he asked by email and got the reply that ext3 is used

    if thats not a good enough perhaps guessing that as "At this time, the servers run Fedora Core and use the 2.6 kernel provided by RedHat." they might be using ext3 that is the default.

    1. Re:perhaps this might help by jd · · Score: 2, Interesting
      Actually, I believe Fedora Core 3 has most of the other filesystems compiled in, you just won't get the main partition formatted with them.


      Since the "smart" way to run such a server is to have the main FS on one disk and the data on another (this avoids tracking the head back and forth), the data partition can be just about anything.


      Now, the fact that the maintainers have said they are using Ext3 is rather more convincing to me. Foolish beyond belief, but convincing. I would rather use a "less reliable" FS like XFS and a RAID array to deal with errors, as I would have the performance benefit with no significant risk.


      I also regard Red Hat's obsession with Ext3 (even though Linux is all about choice, and it is choice that makes Linux different) as unhealthy. SGI, for some time, produced XFS-aware installer replacements for Red Hat Linux, and it would have been very easy for Red Hat to roll the differences in.


      (In fact, it would likely help a lot, as they could likely have worked out some kind of sponsorship deal with SGI, where SGI helped fund some of Red Hat's work, in exchange for Red Hat promoting SGI's software for Linux.)


      Lastly, how are the Linux developers going to encourage development and innovation, if they use an entirely "safe" off-the-shelf distribution? I don't particularly want kernel.org to crash, but nor do I want people to turn away on the grounds that even the Linux kernel developers don't trust their own work.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  19. Re:why blame git? by rossifer · · Score: 1

    "sqlfs"

    (that's developer-speech for a boring but feasible project that would make you shove your buzzword db-admin-speech up your arse)


    Point #1:

    A quick google search yielded a few links for "sqlfs".

    Now, are you really talking about a filesystem implemented in a relational database? You're pretty confused if you think you contradicted what I wrote. That's exactly what I'm advocating, except that I'm advocating that this database-backed filesystem also be CM-aware.

    For this particular file-centric application, I *love* filesystems implemented as front-ends to databases.

    Point #2:

    You're really confused if you think I'm a db-admin or that I'm in favor of a CM tool that requires a DB-admin to install or use. I'm in favor of an ACID database as a core part of an effective CM system. Whether end users know it's on their machine is completely irrelevant (in fact, I think the user shouldn't have to know it's there).

    Point #3:

    Why would that project be boring? Sounds like it would be pretty neat if you ask me. I'm actually a little disappointed that none of the links yielded a project with recent activity.

    Regards,
    Ross

  20. Re:why blame git? by mikolas · · Score: 1

    Why do you make an assumption that a database backend must be relational one? A lot of systems designed to, a) store vast amount of data b) fast retrieve data, do not use RDBMS as the backend as it tends to suck for tasks common in archival, digital asset management and maybe even configuration management. They use file/directory based indices. It really kills performance when tree-like data (read: directory structure and branched source) is forced into relational model. Add some metadata tagging to that and you'll see some horrible performance when the server is spending it's time matching data via crossref tables.

  21. Re:why blame git? by rossifer · · Score: 1

    There certainly are problem sets for which the relational model is inappropriate. However, there is no large data management problem for which a flat file is more appropriate than a relational database (assertions by the Prevayler group nonwithstanding).

    As for your assertion that tree-like data is a poor fit for relational programming, it's an issue of having a deeper understanding of relational programming (a "kind" of programming parallel to procedural or object-oriented programming). Trees fit perfectly well into relational models, you just need to understand how a relational system should manage them. If you're ever interested in learning more, a great book to read would be: SQL For Smarties.

    Even more germane to the specifics of trees in databases, the same author has a newer book (that I haven't read) Trees and Hierarchies in SQL for Smarties. Only one person has reviewed it and they weren't overwhelmed, so who knows.

    Tags are only a problem if it wasn't designed into the schema from the start (i.e. CVS style tags strapped into RCS files). If a tag is actually an association of deltas as in Subversion (much more amenable to a relational model), asking for "the state of the system at 'Release 1.1' (changeset 1134)", becomes fairly simple to implement and much simpler to optimize.

    Regards,
    Ross

  22. I blame Linus by Fred+Nerk · · Score: 1

    Are you reading this man?

    You're responsible for all the world's problems! The linux kernel, bitrot on my cds, war in Iraq, Guantanamo Bay, and now git!

    Come on Linus, clean up your act!

    (Sorry if this offends *anyone*)

    --
    Anything is possible, except skiing through revolving doors.
  23. Re:Well, what filesystem are they using? ext3 OK by anon+mouse-cow-aard · · Score: 2, Interesting
    Gripes about ext3 performance are probably outdated.

    We did some tests comparing reiser3, xfs, and ext3 with the dir_index option on 2.6 kernels. We were writing thousands (ok tens of thousands) of small files into a couple of directories (specialized app, you don't want to know.)

    When directories got large, ext3 with the hash lookups (between 800 and 1500 creations per second on newish hardware) ran much faster than xfs, oh and several orders of magnitude faster than ext3 without the directory hashing. reiser3 was slower than xfs.

    We were thinking of going with xfs anyways, because it was so attractive that the directories would shrink when files were deleted (whereas ext3 directories stay big, with a hole in it.) but xfs would crash on us after a couple of days. So In March we chose ext3. We have approximately 9 million files in a single file system at the moment, it seems to work ok, but the system crashes every three weeks or so. We think we might have tortured it too much, and can reasonably keep only about 2 million files on-line, so we'll see if that helps.

    of course, ymmv.

  24. Re:Well, what filesystem are they using? ext3 OK by Anonymous Coward · · Score: 0

    you don't want to know

    But I _do_ want to know, waaahh!
    I wonder if JFS would work, maybe it is too slow - I found that some xfs filesystems that I had got corrupted but even after power outages, etc my jfs partitions just keep plugging away.