Slashdot Mirror


XFS merged in Linux 2.5

joib writes "According to this notice, the XFS journaling file system has been merged into Linus bitkeeper tree, to show up in 2.5.36." Ya just know someone out there wants to have every journaling file system on one drive just 'cuz.

34 of 271 comments (clear)

  1. XFS FAQ by semaj · · Score: 5, Informative

    There's an XFS FAQ and a load more information about it on SGI's site - which points out that several large distributions have had XFS support for a while by default.

    Still, it's noteworthy that Linus has finally accepted it into his tree...

    --
    Meep meep
  2. Re:Comparison? by Wee · · Score: 3, Informative
    Does anyone have a link to any comparisons of all these journaling filesystems, showing their strengths and weaknesses?

    Google is always your friend.

    -B

    --

    Ash and Hickory, straight-grained and true, make excellent bludgeons, dandy for the cudgeling of vegetarians.

  3. Re:Wow by GigsVT · · Score: 2, Informative

    Well because it used to make extensive VM changes to the kernel, which was keeping it out for so long. The way I understand it, it was a direct port of the IRIX XFS mostly, and thus also had some IRIXisms that could cause problems on Linux. I read a recent post to lkml that indicated they had cleaned it of VM changes, but I really didn't expect a merge so soon.

    --
    I've had enough abrasive sigs. Kittens are cute and fuzzy.
  4. Re:Comparison? by rindeee · · Score: 5, Informative

    http://aurora.zemris.fer.hr/filesystems/

  5. You missed out the flamewar on the mailing list! by Anonymous Coward · · Score: 2, Informative

    For those of you who don't subscribe to the Linux kernel development mailing list, it was absolutely not a case of XFS just being accepted, there was a HUGE flamewar about it, which only ended a few days ago.

    Mailing list archive

    Just search in the page for XFS and you'll find the thread.

  6. Re:Comparison? by rindeee · · Score: 2, Informative

    And this: http://oss.software.ibm.com/developer/opensource/j fs/project/pub/jfs040802.pdf

  7. My understanding by 0x0d0a · · Score: 3, Informative

    ...is that the breakdown goes something like this:

    ext3:
    * can be told to journal everything, including data (not just metadata) -- most theoretical reliability.
    * is backwards compatible with ext2

    xfs:
    * tweaked for streaming large files to/from disk -- probably best at sequential reads/writes.

    reiserfs:
    * best performance with many, many files in a single directory.
    * Can save space on very small files with -tail option

    jfs:
    * really don't know. :-)

    1. Re:My understanding by dmelomed · · Score: 2, Informative

      Just to note: ReiserFS is also inodeless. This means you can't run out of them, as far as I can imagine.

    2. Re:My understanding by jgarzik · · Score: 3, Informative
      If reiserfs was inode-less, it would not work with Linux.

      Even NTFS has inodes, they simply call them "MFT records."

  8. Yes! by zentec · · Score: 3, Informative


    Despite being a little more resource intensive than ext3, XFS has to be one of the better file systems available. I've used it (obviously) on SGI's and it's been outstanding, and opted to use it before ext3, JFS and Reiserfs (although I believe Reiserfs is just as nifty).

    Having it accepted into the kernel makes upgrades a world easier, and hopefully I'll be able to move away from SGI's modified Red Hat installation. Although, I doubt Red Hat will support it out of the box.

    The other issue that needs fixing with XFS is the lack of an emergency boot disk. XFS enabled kernels are huge, and that creates a slight problem when booting from floppy.

  9. Re:Silly question by MasterD · · Score: 5, Informative

    XFS supports ACL's (or access control lists) which are much better than standard UNIX permissions.

    XFS is an extent based filesystem which means that you don't end up wasting tons of space having to allocate a 4K block for every small file. And you don't need to jump through tons of indirect blocks to get large files.

    XFS allocated inodes on the fly so it grows with what data you put on there. Once again, not wasting space up front. And it sticks the inode near the file itself so the head does not have to move far on the hard drive.

    XFS supports extended attributes which can be used for all kinds of extensions later on.

    XFS has been around since 1994 and is the most mature of the journalling filesystems.

    And there are many other reasons that I cannot think of right now.

  10. Re:Silly question by fruey · · Score: 3, Informative
    Performance. Different systems are going to take more or less overhead depending on the task. Some daemons might write a lot of data to logs, you want this to be done asynchronously, you may not need the data so badly, you don't need journalling perhaps. (so use ext2??)

    Or you have a proxy, you don't care if suddenly your cached data is lost, it will soon be refilled, it's not important data, you want performance without too much security (reiserfs)?

    In fact each filesystem has inherent limits on inodes, filenames, permissions, etc... so you go with any that has a minimum for each thing you need. Journalling you don't really need unless you want to be able to step backwards or repair your filesystem in more interesting ways...

    --
    Conversion Rate Optimisation French / English consultant
  11. Re:Silly question by felicity · · Score: 2, Informative

    XFS also allows you to grow the filesystem live (ie: mounted). This is great for those of us who use it in conjunction with a volume manager (I use LVM). lvextend to enlarge the volume, growfs to enlarge the filesystem. No downtime required. :)

    It's also a 64-bit filesystem, so you could have extremely large files and filesystems, although my understanding is that the Linux VFS system can't handle the large sizes right now (1Tb max filesystem for instance). XFS is the standard filesystem for SGI's IRIX which doesn't have the restrictions. :)

  12. here's an interesting read by someonehasmyname · · Score: 4, Informative

    this pdf compares how journaling file sytems compare to non-journaling systems like ffs or freebsd's soft updates.

    --
    Common sense is not so common.
  13. Re:Silly question by blakestah · · Score: 3, Informative

    1) Backup strategies. Versions of dump are available for ext2/ext3 and xfs, but not for ReiserFS (I don't know about JFS). (I don't mean to start a page cache/buffer cache debate).

    2) Journalled file systems mean fast re-boots on power outages

    3) Speed. This depends on your usage. A huge mail spool machine may use ReiserFS on the mail spool. For most people it is a wash.

    4) Ext3 can be remounted as ext2, and really good file system checking tools exist for ext2/3.

    Mostly, though, you CAN just stick with whatever the default suggests.

  14. Re:Comparison? by auferstehung · · Score: 5, Informative

    You could check out Daniel Robbins' "Advanced filesystem implementor's guide" over on IBM's developerworks. He covers reiserfs, ext3, and XFS and I believe there is a link to articles on JFS in the Resources section at the bottom of the page.

    --
    Logic is not Divine.
  15. Red Hat DOES NOT has XFS... by Booker · · Score: 4, Informative

    This isn't correct... if it were correct, I would not have spent so much time working on a
    custom Red Hat installer for XFS. :)

    There is some XFS-aware code in the Red Hat Linux installer, but there is no kernel support or userspace tools available, so what you propose simply can't work.

    However, SuSE, Mandrake, Gentoo, Slackware, and Debian (to some extent) do have XFS support.

  16. Re:2.6 kernel goodies by paulbd · · Score: 3, Informative

    the skipping in your mp3 player has nothing to do with disk i/o. it has to do with scheduling latency. that is, unless your mp3 player has been poorly designed, which many of them have been.

    also, 2.5/2.6 is still missing the better patches for low latency (from andrew morton), and so its performance is still not as good as it could be.

    2.6 doesn't beat windows at audio latency when using WDM drivers for windows. it (along with 2.2 and 2.4) beat windows with MME drivers. the WDM audio driver model is very fast, and windows has always done a better job of handling scheduling latency than linux (other than with andrew's patches). in 2.4 there are still places in a mainstream kernel that will stall the entire box for up to 1/10 second.

  17. Re:Cool by ShawnX · · Score: 3, Informative

    Try my patches at http://xfs.sh0n.net/2.4. They merge in XFS with 2.4.20-pre7 (current) and rmap =)

    Shawn.

    --
    Everyone wants a Tux in their life.
  18. Re:My personal experience by kubrick · · Score: 3, Informative

    # man tune2fs

    (you can turn fscks off, change the number of mounts or make it time-dependent, etc.)

    --
    deus does not exist but if he does
  19. Re:My personal experience by psamuels · · Score: 3, Informative
    Every month or so, I had to sit through the following:
    "Warning: drive has been mounted more than 30 times, check forced" on the ext3 partition

    This is a safety feature. Filesystem corruption can be caused by hardware funnies as well as software bugs. Your memory could be flaky, your hard drive could be on its way out, your IDE cable could be too long, your SCSI chain could be improperly terminated, your motherboard might be iffy, your CPU could be running too hot. There might be software bugs in the generic kernel, the block / scsi drivers, the ext3 code, or even some random driver that has nothing to do with filesystems or memory management.

    Because of this, ext2 and ext3 have tunable parameters for how often to force an fsck, overriding the fact that the fs is supposed to be in a known clean state. Apparently reiserfs does not have this safety feature - or does it? (I don't know.)

    If this annoys you, turn it off. 'man tune2fs', or specifically,

    tune2fs -c0 -i0 /dev/your/filesystem

    HTH..

    --
    "How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README
  20. Re:Excellent by psamuels · · Score: 2, Informative
    Now if we could just get some stable NTFS read/write support I would be set.

    It's on the way. Read-only NTFS (rather poor in 2.4) has been rewritten and is much improved in 2.5, and a certain subset of read-write (writing new contents to an existing file) is reported to be stable. I haven't tried it. Full read-write may or may not make 2.6.0 but you can be sure it is in active development.

    --
    "How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README
  21. An interesting thing about XFS... by Scooby+Snacks · · Score: 3, Informative
    I hear that it's the only Linux filesystem that is endian-safe. IOW, you can move it from a system of one endian type to a system of the other type and it will still work. No other filesystem for Linux currently is able to make that claim.

    I find that very cool, for some reason. I guess one practical application is if you have a box that is the only one of that type (either big-endian or little-endian) that dies and you need to recover the data.

    --

    --
    Runnin' around, robbin' banks all whacked on the Scooby Snacks...
    1. Re:An interesting thing about XFS... by flight666 · · Score: 2, Informative
      Bzzt. Wrong answer, thanks for playing.

      Ext2 has been endian safe since kernel v1.{lownum}

      Ext3 has _always_ been endian safe.

      reiserfs became endian-safe about 6 months ago.

      Don't know, but I would suspect the same for JFS, etc.

  22. Re:Questions... by psamuels · · Score: 3, Informative
    The stable kernel is usually released a couple of months after the feature freeze (bugs permitting).

    +1, Funny. I think you mean after the code freeze, which usually happens a month later, well, two, three, ok, six months later. You also forgot to mention that Linus usually has multiple freezes, and the one on 31 Oct is only the first. With each successive freeze he puts on a more threatening tone, crying woe unto them who would dare tempt him to thaw the kernel again. Eventually the first code freeze happens, then maybe one or two more of those....

    Even odds we get a 2.6.0 by June.

    --
    "How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README
  23. Re:Not just journaling by jbolden · · Score: 3, Informative

    Apple is designing versions of the tools that support complex attributes for use with the HFS+ filesystem. While the specific issues are slightly different since their code is open sourced no reason it couldn't move over to Linux.

  24. Re:My experience with XFS by red_dragon · · Score: 3, Informative

    Just wondering, are you using the custom kernel from Gentoo? If so, have you compiled your kernel with either/both of the low latency patch and/or the preemptible kernel patch? What are your experiences with either of those two options when running XFS? I'd expect the use of either of those two to improve a system's responsiveness to user interaction when doing a lot of disk I/O, but if those don't help when using XFS, I wonder what kind of black magic is going on inside that code.

    --
    In Soviet Russia, Jesus asks: "What Would You Do?"
  25. Re:2.6 kernel goodies by Lucky+Kevin · · Score: 2, Informative
    * ALSA support. ALSA is a pain to keep patching your kernel with every redownload. ALSA is a Good Thing, if a pain in the butt to configure. My guess is that there will be decent front ends on top of the thing when distros start shipping 2.6.

    From the ALSA site:

    "2002-02-13 ALSA has been integrated to the official Linux 2.5 tree! The initial merge is in patch-2.5.5-pre1."

    Yippiee! Great sound, here we come!

    --
    Kevin
    "It's not the cough that carries you off, it's the coffin they carry you off in" O. Nash
  26. Re:My experience with XFS by josh+crawley · · Score: 5, Informative

    ---"- Recoveries after a crash are really fast. Almost immedate, better than ext3 and reiserfs."

    Hmmm.. I'd assume that ext3 wouldn't be as good.. A fix on a fix usually sucks. And then I've heard about Reiser's file truncation problems. I use Reiser and no big problems."

    ---"- _BUT_ there's something strange. Basically during disk I/O, the whole system is unresponsive. While I'm compiling something, KDE becomes slow, playing videos is not smooth at all, etc. Just as if it didn't scale at all for concurrent disk access. So I finally switched back to ReiserFS just because of this. Maybe the 2.5.x series of kernel behaves differently.

    I've had the same problems on 2.2.X when I didn't tweak my HD's to dma66 32 bit. Try doing a:

    hdparm /dev/(drive linux is on)
    hdparm -tT /dev/(drive linux is on)

    If you dont like those settings, Drop into single user mode, with / read only and do this command

    hdparm -X66 -d1 -u1 -m16 -c3 /dev/hda

    Now manually do a fsck on that partition. If you have errors, it's a bad mode. But if it works, then redo the -tT option (it's a benchmark).

    Be aware that 2.4 does most of this for you, but sometimes can give to little of a setting (so your performance sucks). Then again, you could have an unsupported IDE device.

    All the best..

  27. Re:Wow by dcstimm · · Score: 2, Informative

    the reason they pulled it is because of gentoo ppc. Very unstable on ppc.

  28. Re:Journalling filesytems... by psamuels · · Score: 5, Informative
    What exactly is 'journalling'?

    Here's the basic theory. Think about what happens when you make a change on a filesystem - say you add a file to a directory. The system has to:

    • add a filename entry to the directory itself
    • allocate the initial blocks for the file, from the pool of free space in your filesystem
    • create the inode, which is a block of information about the file. The inode includes file modification times, owner, permissions, file type (regular file? directory? etc), and the location of its actual data blocks
    • if there are too many data blocks, allocate one or more "indirect blocks", which are extensions to the inode so it can hold more data blocks - inodes usually have a fixed size. Initialise these with the correct block numbers as well.
    • actually write the file contents to the data blocks you have allocated

    If you don't do these things in the correct order, there will be times when the on-disk structure is not consistent. For example, you may have modified the directory to include an entry for the new file, but the entry points at an inode which hasn't been filled in yet. Or the inode may be filled in, but the free space pool hasn't been updated to correspond with the data block allocations in the inode. Throw in other modifications like deleting files or making them larger or smaller, and it gets pretty complicated. If the machine happens to crash at such a time - or the power goes out and you don't have a UPS - the disk will be in an inconsistent state. This has two major consequences:

    1. the filesystem checker, or fsck (the equivalent Windows utility is scandisk) will have to run next time you boot, and go over the whole structure of your filesystem, which can take minutes or even hours on a large enough disk (80 GB takes a long time unless your disks are very fast). Nobody wants to sit around for 15 minutes waiting for the server to finish rebooting.
    2. depending on exactly what was written to disk in what order, the fsck utility may not even be able to restore your filesystem to a consistent state at all, or it may lose important files or directories in the process of doing so.

    Journalling prevents both problems (barring bugs in your OS or hardware, of course) by writing transactions to your filesystem. Instead of making changes directly to your directories, inodes, free block maps, etc, the filesystem batches up such changes by spooling them to a separate area on disk, the journal. Then, when it has written enough such changes to account for an entire, self-consistent transaction, it puts a marker in the journal indicating "transaction complete" and starts copying these changes to their usual locations on disk. Meanwhile, the next transaction can be spooled onto the end of the journal area, and it will get its own "transaction complete" marker when it is done. A journal can hold a lot of transactions - only limited by the journal size, which is usually configurable. When a transaction has been fully copied out of the journal to its final locations, it is re-labeled "journal free space" in the journal.

    How does this help? Imagine that the machine goes down while a transaction is still incomplete in the journal. Next time you boot, the OS "replays" the journal: it looks for all the completed transactions and commits each part of a transaction to its correct permanent location. It ignores journal free space, and any incomplete transactions - essentially rewinding the filesystem state to the end of the last completed transaction. There is never any danger of "partially updated" filesystem state, since each transaction starts and ends with a known-consistent state.

    (Ah, but what happens it the OS goes down again while replaying a journal? No big deal: next time it boots, it just replays the same journal again, which produces the same result as it would have done the first time.)

    Some simplifications, obviously, but that's the basic idea. Did it help?

    The different levels of journalling have to do with whether all filesystem data is journalled or only some of it. You usually only journal metadata, which is the filesystem structure: directories, inodes, free block maps, etc. That's because copying all your file contents twice (first into the journal, then into its permanent location in the filesystem) is quite slow. The main purpose of a journal is not to guarantee pristine file contents in the event of partially written files, but to ensure a consistent view of the filesystem as a whole - so you can avoid that long fsck and avoid ever ending up with a partially or fully scrambled filesystem (modulo hardware failure, of course).

    HTH..

    --
    "How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README
  29. Re:Related question by foobar104 · · Score: 3, Informative
    Just FYI, XFS on IRIX can support files up to 9 million terabytes (9 EB) and filesystems up to 18 million terabytes (18 EB).

    It's more complex under Linux. Here's the Linux-specific answer to this question from the FAQ:
    Q: Does XFS support large files (bigger then 2GB)?

    Yes, XFS supports files larger then 2GB. The large file support (LFS) is largely dependent on the C library of your computer. Glibc 2.2 and higher has full LFS support. If your C lib does not support it you will get errors that the valued is too large for the defined data type.

    Userland software needs to be compiled against the LFS compliant C lib in order to work. You will be able to create 2GB+ files on non LFS systems but the tools will not be able to stat them.

    Distributions based on Glibc 2.2.x and higher will function normally. Note that some userspace programs like tcsh do not correctly behave even if they are compiled against glibc 2.2.x

    You may need to contact your vendor/developer if this is the case.

    Here is a snippet of email conversation with Steve Lord on the topic of the maximum filesize of XFS under linux.

    I would challenge any filesystem running on Linux on an ia32, and using the page cache to get past the practical limit of 16 Tbytes using buffered I/O. At this point you run out of space to address pages in the cache since the core kernel code uses a 32 bit number as the index number of a page in the cache.

    As for XFS itself, this is a constant definition from the code:

    #define XFS_MAX_FILE_OFFSET ((long long)((1ULL<<63)-1ULL))

    So 2^63 bytes is theoretically possible.

    All of this is ignoring the current limitation of 2 Tbytes of address space for block devices (including logical volumes). The only way to get a file bigger than this of course is to have large holes in it. And to get past 16 Tbytes you have to used direct I/O.

    Which would would mean a theoretical 8388608TB file size. Large enough?
  30. Re:Why is kernel-image so big? by Bob[Bob] · · Score: 2, Informative

    You're basically correct about how SGI did the port... they created an IRIX to Linux VFS mapping layer, as described in the papers on this page: XFS Talks and Papers.

  31. More on inodes (was Re:My understanding) by jgarzik · · Score: 3, Informative
    AFAIK in ReiserFS inodes are not used the way they're in traditional FS'. You certainly need to present the inode layer to the OS, but. They use Balanced trees for block allocation. AFAIK you do not end up with a fixed number of "inodes" after ReiserFS is created.

    You're mixing filesystem features up. To clear things up a bit,

    • Individual inode records need not be of a fixed size.
    • The inode table (total number of inodes) need not be a fixed size, and it can even be moved around, and spread across, various physical locations on the disk.
    • The inode table can either have a special-cased storage method (ext2/3), or simply be stored using the filesystem's own block allocation methods -- in effect treating the inode table as a "normal file" (jfs, ntfs, several others) This second method has the property of being very flexible: just as it is trivial to extend the length of a normal file [i.e. append], it is trivial to add new inodes to an inode table that the filesystem treats internally as a "normal file."
    There are wild and varied ways to store inodes. But ReiserFS definitely has them. :)

    Regards,

    Jeff