Slashdot Mirror


Ext4 Advances As Interim Step To Btrfs

Heise.de's Kernel Log has a look at the ext4 filesystem as Linus Torvalds has integrated a large collection of patches for it into the kernel main branch. "This signals that with the next kernel version 2.6.28, the successor to ext3 will finally leave behind its 'hot' development phase." The article notes that ext4 developer Theodore Ts'o (tytso) is in favor of ultimately moving Linux to a modern, "next-generation" file system. His preferred choice is btrfs, and Heise notes an email Ts'o sent to the Linux Kernel Mailing List a week back positioning ext4 as a bridge to btrfs.

66 of 510 comments (clear)

  1. BTRFS? REALLY? by erroneus · · Score: 4, Interesting

    Couldn't they come up with a better name than "BuTteR FaSe?" I know I can't be the only one who read it like that. Call it anything but that.

  2. BTRFS? by Anonymous Coward · · Score: 5, Funny

    So it incorporates compression by vowel ommission? Brllnt!

  3. Why not ZFS? by mlts · · Score: 5, Interesting

    Unless ZFS has patent issues, why not just work on having ZFS as Linux's standard FS, after ext3?

    ZFS offers a lot of capabilities, from no need to worry about a LVM layer, to snapshotting, to excellent error detection, even encryption and compression hooks.

    1. Re:Why not ZFS? by PhrostyMcByte · · Score: 5, Informative
      I am not aware of the differences, but from Theodore Ts'o:

      people who really like reiser4 might want to take a look at btrfs; it has a number of the same design ideas that reiser3/4 had --- except (a) the filesystem format has support for some advanced features that are designed to leapfrog ZFS, (b) the maintainer is not a crazy man and works well with other LKML developers (free hint: if your code needs to be reviewed to get in, and reviewers are scarce; don't insult and abuse the volunteer reviewers as Hans did --- Not a good plan!).

    2. Re:Why not ZFS? by Anonymous Coward · · Score: 5, Informative

      The ZFS developers specifically wanted the open sourced code to be under a GPL incompatible license, hence it has been released under CDDL (there was a interview with the Sun open source rep, can someone provide info/links about this). So ZFS cannot be part of the kernel, but there is a FUSE port of ZFS and according to http://en.wikipedia.org/wiki/ZFS#Linux Sun is investigating a Linux port, so there may be something good coming

    3. Re:Why not ZFS? by Mad+Merlin · · Score: 4, Insightful

      ZFS offers a lot of capabilities, from no need to worry about a LVM layer, to snapshotting, to excellent error detection, even encryption and compression hooks.

      ...and that's it's biggest problem. ZFS duplicates a lot of functionality that belongs outside of a filesystem. All of the above can already be done using any Linux filesystem, so why keep around a second copy of all that code that implements those features for just a single filesystem?

      ReiserFS was (is) in a similar situation, where it also duplicates a lot of functionality that doesn't belong in the filesystem. Not only does this make it harder to maintain, but it makes a lot of features filesystem specific that shouldn't be.

    4. Re:Why not ZFS? by volsung · · Score: 4, Informative

      I don't know about the patents, but the current major obstacle is the license. ZFS, as part of the OpenSolaris kernel, is available under the CDDL. The CDDL is incompatible with the GPL, ruling out ZFS inclusion directly in the Linux kernel. Sun has hinted that they could dual license the Solaris kernel under CDDL and GPL, but that hasn't happened yet. Small parts of the ZFS filesystem code have been GPLed so they could be added to grub to support booting ZFS root filesystems.

      There is a userspace port of the ZFS code and utilities which avoids the license problem by using FUSE to separate the filesystem code into a separate process: ZFS-FUSE.

      If Sun were to ever dual-license ZFS, the ZFS-FUSE codebase would be a good place to start for porting the code to direct kernel inclusion. (Note: Sun, via their subsidiary, Cluster File Systems, now employes the author of ZFS-FUSE to use his port as an optional backend for the Lustre file system.)

    5. Re:Why not ZFS? by Wonko · · Score: 5, Informative

      ZFS duplicates a lot of functionality that belongs outside of a filesystem. All of the above can already be done using any Linux filesystem, so why keep around a second copy of all that code that implements those features for just a single filesystem?

      It wouldn't be possible to duplicate RAID-Z with LVM. Other features of ZFS are very handy, but RAID-Z is by far my favorite. Same storage density as RAID 5 but without the horrible write performance. RAID-Z uses copy-on-write to avoid RAID 5's required read for every non-cached write.

      Being able to create filesystems just as easily as creating directories is quite handy as well, though. IIRC, the filesystem sizes in ZFS are controlled by a quota style system. That is much simpler than shrinking an LV (if your filesystem supports shrinking), then adding a new LV, and then creating a filesystem. I don't know about you, but I am always a bit nervous when I have to resize an LV.

    6. Re:Why not ZFS? by 42forty-two42 · · Score: 3, Informative

      Sun has some patents on ZFS; the CDDL grants a license to these patents if you're deriving from the original ZFS source, but then you can't link it to linux.

      FWIW, I doubt ZFS-FUSE would be a good place to start - FUSE is totally different from Linux's actual vfs layer, after all.

    7. Re:Why not ZFS? by mritunjai · · Score: 4, Informative

      The ZFS developers specifically wanted the open sourced code to be under a GPL incompatible license, hence it has been released under CDDL (there was a interview with the Sun open source rep, can someone provide info/links about this). So ZFS cannot be part of the kernel, but there is a FUSE port of ZFS and according to http://en.wikipedia.org/wiki/ZFS#Linux Sun is investigating a Linux port, so there may be something good coming

      Rather, GPL is incompatible with anything else that can't be re-licensed as GPL, and that includes GPL v2 and v3, which can't even be mixed among themselves. May first we clear that mess, right ?

      ZFS is present in both Mac OSX and FreeBSD, thank you! They have no license issues whatsoever.

      --
      - mritunjai
    8. Re:Why not ZFS? by clarkkent09 · · Score: 5, Insightful

      (b) the maintainer is not a crazy man and works well with other LKML developers

      Also important, he might be more focused due to not being in prison for first degree murder

      --
      Negative moral value of force outweighs the positive value of good intentions.
    9. Re:Why not ZFS? by GrievousMistake · · Score: 5, Interesting

      Huh. One of the interesting things things about Reiser4 from an end-user perspective was Hans Reisers plans for file metadata. From what I can find about btrfs, it currently doesn't even support normal extended attributes. There was also talk about making it easy for developers to extend the filesystem with plugins that could add e.g. compression schemes.
      I can't really recognize anything from Hans Reiser's ramblings in the btrfs documentation that isn't standard file system improvements already seen in e.g. ZFS. does anyone have any specific examples of the ZFS-leapfrogging features referred to?

      --
      In a fair world, refrigerators would make electricity.
    10. Re:Why not ZFS? by Ivlis · · Score: 3, Informative

      Parts of ZFS are patented, but the license allows running it in userspace using FUSE.

      I'm confused: if we ask people why not run ZFS using FUSE, they reply because it's slow (I'm assuming it's possible to load ZFS at boot time using an initrd). And if we ask people which is better monolithic or microkernel, they reply microkernel. But ZFS using FUSE would be like a microkernel driver, so which is it?

    11. Re:Why not ZFS? by Anonymous Coward · · Score: 5, Funny

      Huh. One of the interesting things things about Reiser4 from an end-user perspective was Hans Reisers plans for file metadata.

      No, the most interesting feature of ReiserFS is this one (look to the far right).

      --
      ReiserFS: It puts the "stab" in "/etc/fstab".

    12. Re:Why not ZFS? by Xaria · · Score: 3, Informative

      No, it wouldn't. A microkernel loads modules into the kernel space. You're talking about running in user space. So when an application makes a system call, the kernel has to translate it to the FUSE layer into user space. So there's an extra layer consuming time. On top of that, kernel space isn't generally swapped out, but user space can be. Obviously it should never happen, but wouldn't it suck if your disk driver was swapped out?

      See the diagram at the bottom of this page: http://fuse.sourceforge.net/

      Also, ZFS (like ReiserFS) handles its metadata differently from ext3, so you have to translate the differences between the virtual file system and ZFS. This is why writes are significantly slower. Reads are not so bad. The NFS penalty would be huge. See http://www.linux.com/feature/138452

    13. Re:Why not ZFS? by deniable · · Score: 5, Funny

      Yep, BeaTeR FS is a kinder, gentler alternative to Reiser FS.

    14. Re:Why not ZFS? by mml · · Score: 5, Informative

      > Rather, GPL is incompatible with anything else that can't be re-licensed as GPL, and
      > that includes GPL v2 and v3, which can't even be mixed among themselves.

      Saying that GPLv2 and GPLv3 "can't even be mixed among themselves" is wrong and
      misleading.

      Section 14 of GPLv2 specifically deals with the problem of later versions of the
      licence and sets out the options. A copyright holder can choose to allow work to be used
      with later versions, such as GPLv3, or can choose not to. There are also more
      complex options. The licence itself doesn't force the choice one way or the other.

      Matt

    15. Re:Why not ZFS? by mvdwege · · Score: 3, Interesting

      Come back when ZFS has decent filesystem maintenance tools.

      And don't give me that 'ZFS doesn't need a fsck' crap. SGI tried to pull that with XFS, and it didn't work. Filesystem (at least metadata) corruption will happen, and once it does, ZFS doesn't have the tools to fix it.

      Mart

      --
      "I know I will be modded down for this": where's the option '-1, Asking for it'?
    16. Re:Why not ZFS? by adrianwn · · Score: 5, Interesting

      A microkernel loads modules into the kernel space.

      No, that's the opposite of a microkernel. A microkernel loads its modules (then often called "servers") into user space. If the kernel and its drivers etc. run in the same address space (as is the case with, e.g., Linux), then we're talking about a monolithic kernel, even if it can dynamically load modules.

    17. Re:Why not ZFS? by BrokenHalo · · Score: 4, Interesting

      not to belittle ext3 and ext2 for that matter, but their time is beginning to pass, and something new needs to replace it.

      I'm not sure that I see why, unless you're simply bored with the older filesystems. Something as critical as this should not be driven by what is trendy at any given moment. If one has no need for particular advanced bells or whistles, there is no need to use them.

      For instance, since for historical and security reasons I keep /boot on its own separate partition which is mounted readonly, it makes sense here to not have anything trying to write to a journal, so ext2 is still a very good choice here. As the partition is tiny (only 20MB) it takes a fraction of a second to run e2fsck over it when or as required, so there is nothing to be gained by journalling it anyway.

      I still use ReiserFS3 on most of my other partitions, since I don't have any intention of changing the filesystem until I change the drives. ReiserFS is still a good choice for my purposes anyway.

    18. Re:Why not ZFS? by BrentH · · Score: 4, Insightful

      The things you think belong outside of a filesystem only 'nelong' there because that's what years of narrowminded developing have tought you. Look at it this way: /everything/ related to filestorage is managed by ZFS. What could be more convenient than that? Because of this, ZFS can do things much faster and much more reliable than any combo of LVM with a filesystem. Why chain together tools yourself, and manually think about things you really shouldn't be thinking about, when you can have a good filesystem take care of it for you.

      ZFS is easier to maintain, from a users perpective (and that's the job of development, to make usage easier, not ever the other way round).

    19. Re:Why not ZFS? by Kjella · · Score: 4, Informative

      Rather, GPL is incompatible with anything else that can't be re-licensed as GPL, and that includes GPL v2 and v3, which can't even be mixed among themselves. May first we clear that mess, right?

      With a copyleft license, you intend to secure certain rights to the end user to the work as a whole. It is at the very essence of what the GPL tries to do compared to non-copyleft open source licenses or the LGPL that only covers the parts consisting of LGPL code, not any sort of "flaw" or "mess". Licenses work so that you must simultaniously fulfill all of them, so the GPL denies using GPL code with code that denies end users the four freedoms the FSF profess. That is the intention by design, but then there is some collateral damage as well-intended licenses are rendered GPL-incompatible due to details since the GPL (or any copyleft license) couldn't allow open-ended arbitrary restrictions without losing all meaning. The GPLv2 was particularly flawed in this area since it was made fairly long ago with this not much in mind, and in the GPLv3 they did a lot of work to improve compatibility leading to section 7 that among other things say:

      Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms:
      a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or
      b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or
      c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or
      d) Limiting the use for publicity purposes of names of licensors or authors of the material; or
      e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or
      f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors.

      That vastly improves compatibility with the licenses the GPL wants to be compatible with so collateral damage is reduced to a minimum. It's still very easy to write a license, even a free software license, that isn't GPL compatible though. If you look at the reason the CDDL and GPL are incompatible it's that the CDDLs copyleft conditions and the GPLs copyleft conditions clash because they both try to do the same thing. It's almost impossible to write two copyleft licenses where one (or both) doesn't see the other as adding "additional restrictions" on the end user. Even the GPL can't escape that as it tries to improve the GPL unless you have the "and later" clause. Then again, there's no reason such a license should have to be revised often - it took 16 years before releasing version three and it'll probably be longer until next time it's needed.

      --
      Live today, because you never know what tomorrow brings
    20. Re:Why not ZFS? by tkinnun0 · · Score: 3, Insightful

      It's not wrong or misleading. If you have GPLv3 and GPLv2 code, you can mix them if the GPLv2 code's copyright holder gives you the permission. Likewise, if you have BSD and GPLv2 code and wish to retain the BSD licence. The mechanisms may be different but the end result is the same: GPLv2 in itself doesn't give you the permission, you need permission from the copyright holder.

    21. Re:Why not ZFS? by gbjbaanb · · Score: 3, Insightful

      Why chain together tools yourself, and manually think about things you really shouldn't be thinking about, when you can have a good filesystem take care of it for you.

      Because that's the Unix way - build small components (applications) and chain them together to create something out of the parts. I mean, why have ls and grep when you can have lsgrepsortfind? Really, the point is to have small, easily maintained apps that do 1 thing well than 1 app that does everything possibly well, but more usually poorly as its difficult to maintain and ensure it works properly. Not to mention the bloat when it replicates functionality already provided.

      This may not be the best model for a critical component like a filesystem, but on the other hand, reliability of a filesystem is paramount, so keeping it as small as possible is probably a good idea.

    22. Re:Why not ZFS? by Wonko · · Score: 3, Interesting

      I often hear that claim but never see any support of that claim.

      The closest thing to RAID-Z in the Linux kernel is the RAID 5 DM. If you want to write a 4k block to some random location that isn't currently fully cached the DM has to read 1 stripe from each disk in the array, make the 4k change, recompute the checksums, and then flush that stripe back to each disk. The default stripe size is 64k. That means if you have 4 drives you would be performing a 256k read and a 256k write just to change a single 4k block. Of course, that is worst case. Best case is you have to overwrite the entire stripe with a fresh 256k block of data.

      ZFS and RAID-Z get around that problem by just writing the changed blocks to an unused part of the disk. Once the write is complete it just moves the pointer to the new block location. This is copy-on-write, and this is where the performance boost comes in over RAID 5. With RAID-Z you should never be required to read the whole stripe to do a write.

      RAID-Z also allows for dynamic stripe sizing. That helps get more optimal efficiency on small files and large files.

      The dynamic stripes aren't terribly important, but if you could figure out a way to do the copy-on-write without the filesystem have very fine grained control and knowledge of the underlying array we would all love to hear about it :).

    23. Re:Why not ZFS? by Wonko · · Score: 3, Interesting

      Of course, the very same copy-on-write will also result in massive file fragmentation, carefully smearing your dbf files over the entire platters, making your SAN caches useless. Over time resulting in horrible read performance.

      If you want good database performance you probably want as little file system overhead as possible between your database and the disk. I wouldn't have expected ZFS to be the most efficient place to store a database.

      I would have to imagine your SAN is just doing uninformed readaheads. That would be a very good way to fill up a cache with useless data if you are reading from a fragmented file system. :)

      This issue with copy-on-write will be entirely sidestepped in a few years by flash storage's lightning fast seek times and smarter caching. IIRC, isn't the reason that zfs-fuse uses so damn much ram because ZFS has its own caching logic built in? If the file system knows where all the blocks in a file are it can do readaheads on its own.

      ZFS is certainly a huge improvement for anyone used to ufs and disksuite, but I have to say that using it in the real world it's not all it's cracked up to be.

      I don't have enough of my own real world experience with ZFS to argue with your experience. In fact, what I know of how ZFS works makes me believe that it can cause exactly the problems of which you speak.

      However, I don't think that means that there aren't a ton of workloads that wouldn't be impacted by these problems. I also believe that a large percentage of those workloads could benefit greatly from some of the features ZFS brings to the table.

      RAID-Z is nice when you need write performance but can't afford the drives for RAID 10. I can think of plenty of times when it would have been nice to have a writable snapshot to chroot into.

      Hell, I would even love to have ZFS on my laptop for snapshotting and cloning. It also seems like ZFS send/recv would make for much more efficient backups of my laptop than rsync buys me.

      Mixing together the features of various layers is, imo, no matter how tempting, simply the wrong approach. Proceed further along that road and you get to record based filesystems or even more special-purpose variants. I mean, there are even more optimizations that you can do if you know the _contents_ of the files.

      I think we are getting some pretty neat new features out of our file systems by blurring the lines between the layers. I wouldn't be surprised if we stumble upon a few more neat ideas before we're through.

      There is still quite a bit of improvement to make even before we have to make the file system aware of what is inside our files. :)

    24. Re:Why not ZFS? by QuoteMstr · · Score: 3, Informative

      I'm definitely in the layered-design-is-good, ZFS-is-an-abomination camp. But I do have to point out that mlockall would keep a userspace filesystem server from being swapped out, and with realtime priority, the process could even have some guaranteed CPU time. Userspace isn't that bad.

    25. Re:Why not ZFS? by diegocgteleline.es · · Score: 4, Informative

      One of the differences I can find between btrfs and ZFS is that ZFS explicitely avoided a fsck utility, and btrfs is explicitely designed with features designed to make fsck even more powerful than it's on usual filesystems like ext3. In btrfs, data structures have "back references", and the fsck can be used while the filesystem is mounted.

      IMO, this is a a btrfs advantage. ZFS has checksums and will find errors, but only will be able to self-heal the errors in a redundant configuration. On a single disk, ZFS will find the error thanks to checksums but will not be able to recover your data. Since ZFS was mainly designed for systems that will use redundant configurations, it may have sense there, but desktops are not never going to do such things. IMO the ZFS people were a bit elitist here - "let's going to build a filesystem so good that we won't need a fsck". But in the real world you _are_ going to need a fsck util. Only in excepcional and very rare cases, but you're going to need it.

      Of course that doesn't makes ZFS a bad filesystem, but it's an advantage for btrfs and linux.

    26. Re:Why not ZFS? by makomk · · Score: 3, Informative

      Yeah, and if you get any sort of metadata corruption, you're apparently fscked. See, for example, this thread in alt.sysadmin.recovery. Several of the posters say they basically had to manually fix the filesystem after it got screwed up - how very 1970s.

    27. Re:Why not ZFS? by jhol13 · · Score: 3, Insightful

      If a filesystem detects errors it is helping me (at least) there. No matter what creates them.

      I do not think SSDs will solve storage problems: there will be flaky adapters and other IF chips/firmware, etc.

  4. What I'd like by grasshoppa · · Score: 4, Interesting

    I would like transparent, administrator controlled, versioning. Modified a word document and saved it in place? root can go back and get the old version ( and, alternatively, the user can. root could disable this functionality ).

    The pieces are in place, it's doable, just someone needs to program it.

    --
    Mod me down with all of your hatred and your journey towards the dark side will be complete!
    1. Re:What I'd like by corsec67 · · Score: 4, Interesting

      So, you want a Versioning file system? Just make sure you never let that run on /var.

      OSS is like capitalism: If you see a need, then make it and distribute it.

      --
      If I have nothing to hide, don't search me
    2. Re:What I'd like by bendodge · · Score: 4, Interesting

      That leads to space-bloat.

      What I'd like are files with expiration dates. When I make up some twiddly chart or download some funny video, I keep it because I'll probably want it tomorrow or next week, but then I tend to forget to delete it later. It would be really cool if creating a user data file prompted you with a simple dialog specifying how long you want it. Common options like 1 Week, 1 Month, 6 Months, 2 Years, Forever would do most of the time, and an option to choose a custom date would cover the rest. When a file expired, it would be placed in some kind of psudo-Trash Bin that could be reviewed and emptied when you want more space.

      I'd also love something tag-based instead of hierarchy-based. For example, I store photos by Year > Month > Event, but sometimes I want to make another category for photos of a specific person. This means I either make duplicates or have to dig around to find things. If I could tag them with dates (that should actually be auto-generated from the EXIF), event, place, and people I could then just browse for files with a particular tag.

      Come to think of it, these ideas are both somewhat akin to how a human brain stores stuff.

      --
      The government can't save you.
    3. Re:What I'd like by fuzzyfuzzyfungus · · Score: 4, Funny

      Wouldn't the world be so, so, so much nicer if users understood that the actual importance of a document is reflected in how carefully they stored it, not how angry they get when you can't get it back?

    4. Re:What I'd like by Tubal-Cain · · Score: 3, Insightful

      It sounds useful, but I think it would turn out to be about as annoying as UAC. Better to keep your files organized and prune occasionally.

  5. I can't believe... by arrenlex · · Score: 5, Funny

    Butter FS? Are you kidding me?

    Here is your first official list of jokes. Please contribute.

    1. You're still running ext4? I can't believe it's not ButterFS!
    2. But will it run on toast?
    3. Will fsck be renamed to butterknife?
    4. If your system overheats will your filesystem melt?
    5. If you use ButterFS too much, will it turn into FAT?
    6. If you leave ButterFS on your volume too long, will your hard drive start to reek?
    7. Will the next version of ButterFS be called GoatButterFS, just like the next version of Leopard is Snow Leopard?
    8. "Tough" notebooks will never have their hard drives formatted with ButterFS, because if you dropped them, they would always land hard drive down.
    9. When you submit your dead ButterFS hard drive to a data recovery centre, will they have an intern lick it to get the data off instead of putting it under a read head?

    These are getting kind of desperate -- your turn now.

    Honestly, what is it with FOSS and crappy names? (looking at you, gimp)

    1. Re:I can't believe... by penguinchris · · Score: 3, Funny

      When your hard drive fails and you hear those awful noises, you can say it's churning butter.

    2. Re:I can't believe... by Anonymous Coward · · Score: 5, Funny

      These are getting kind of desperate -- your turn now.

      Yeah, you're spreading yourself a bit thin.

      • I hear some of the features in btrfs have been refined from ext3cow.
      • I touch'd a file on a btrfs disk, and now it's sticky!
      • I hear the standard block size of btrfs is 8 oz.
      • How can I make a business case for btrfs? I'm all for introducing new tech, but my boss only cares about how it will affect our margarins.
      • Will btrfs keep my servers from grinding? I'm a bit worried that if they churn too much, my files will separate!
      • And most importantly, In an emergency, can I use btrfs for a smoother fsck?
  6. Butters' FS! by russlar · · Score: 3, Funny

    Great for playing "Hello Kitty! Adventures"

    --
    Anybody want my mod points?
  7. Re:BTRFS? REALLY? by Anonymous Coward · · Score: 4, Funny

    Butter Fase probably intended as Butter Face.

    Sounds like "But Her Face" as in: She has a great body, but her face...

  8. Whoa! by aevans · · Score: 5, Funny

    A Linux article on Slashdot!?

    1. Re:Whoa! by icydog · · Score: 3, Funny

      You must be... old here.

  9. Re:BTRFS? REALLY? by initialE · · Score: 5, Insightful

    Why not? It's a good analogy for FOSS after all. Great software, robust and all, but her face...

    --
    Starbucks, Harbuckle of Breath.
  10. Re:BTRFS? REALLY? by hampton · · Score: 5, Funny

    You're right. BTRFS is really silly. I recommend that the shortened form be ButtFS.

  11. Re:BTRFS? REALLY? by blahplusplus · · Score: 5, Insightful

    "Couldn't they come up with a better name than "BuTteR FaSe?" I know I can't be the only one who read it like that. Call it anything but that."

    I read it as:

    BeTteR FileSystem

    I guess we'll have to part was :P

  12. Re:BTRFS? REALLY? by spazdor · · Score: 5, Funny

    Good, strong file-bearing hips!

    --
    DRM: Terminator crops for your mind!
  13. You're both right. by SanityInAnarchy · · Score: 5, Interesting

    ZFS duplicates a lot of functionality that belongs outside of a filesystem.

    Very true.

    It wouldn't be possible to duplicate RAID-Z with LVM.

    Also true.

    And the features which could be duplicated, couldn't be done nearly as well without a little more knowledge of the filesystem.

    The real problem here is that we're finding out that generic block devices aren't enough to do everything we want to do outside the filesystem itself. Or, if they are, it's incredibly clumsy. Trivial example: If I want a copy-on-write snapshot, I have to set aside (ahead of time) some fixed amount of space that it can expand into. If I guess high, I waste space. If I guess low, I have to either expand it (somehow, if that's even possible) or lose my snapshot.

    A filesystem which natively implemented COW could also trivially implement snapshots which take up exactly as much space as there are differences between the increments. But because of the way the Linux VFS is structured, this kind of functionality would have to be in a single filesystem, and would be duplicated across all filesystems. Best case, it'd be like ext3's JBD, as a kind of shared library.

    A humble proposal: We need another layer, between the block layer and the filesystem layer -- call it an extent layer -- which is simply concerned with allocating some amount of space, and (perhaps) assigning it a unique ID. Filesystems could sit above this layer and implement whatever crazy optimizations or semantics they want -- linear vs btree vs whatever for directories, POSIX vs SQL, whatever.

    The extent layer itself would only be concerned with allocating extents of some requested size, and actually storing the data. But this would be enough information to effectively handle mirroring, striping, snapshotting, copy-on-write, etc.

    It wouldn't be universal -- I've said nothing about the on-disk format, and, indeed, some filesystems exist on Linux solely for that purpose -- vfat, ntfs, udf, etc. Those filesystems could be done pretty much exactly the way they're done now. After all, the existence of a block layer in no way implies that every filesystem must be tied to a block device (see proc, sys, fuse, etc.)

    But I think it would work very well for filesystems which did choose to implement it. I think it would provide the best of ZFS and LVM.

    I haven't actually been seriously following filesystem development for years, so maybe this is already done. Or maybe it's a bad idea. If not, hopefully some kernel developers are reading this.

    --
    Don't thank God, thank a doctor!
  14. Re:B-tree based Filesystem by AmberBlackCat · · Score: 3, Funny

    That's exactly what they're doing. The plan is to limit every directory to exactly two files or subdirectories that will be kept in alphabetical order. That way, you can find any file on your drive in log(n) time. Future updates are planned for people who have more than two songs by the same artist.

  15. Re:BTRFS? REALLY? by deniable · · Score: 4, Funny

    I read it as BeaterFS and wondered if it was too soon for ReiserFS jokes.

  16. when ext4 is feature complete it will be the #3 fs by ZeekWatson · · Score: 4, Interesting

    I'd like to know why Ted Tso and others are working on ext4? Even when ext4 is feature complete it will be the #3 filesystem in linux in terms of features and scalability behind xfs and jfs. I'd like to know what Ted Tso and others grudge against xfs and jfs is because they basically wont even acknowledge those filesystems.

    btrfs does have some nice looking features, its basically a gpl rewrite of zfs.

    The weakness with linux is in the LVM or EVMS layer. They both suck in that they are not enterprise ready (ie multi TB filesystems, 100+ MB/s sustained read/write) in that they cause unexplained IO hicups, lockups and kernel panics. LVM/EVMS certainly work fine for Joe Blow's HTPC, or a paltry 100GB database but they fall down when under serious load.

    This is the problem with open source. Certain areas, like filesystem development attract all the developers, and other areas like LVM/EVMS are seen as busting rocks and nobody wants to work on them. The results is we get a plethora of second rate filesystems (ie ext4) and a buggy LVM/EVMS layer that nobody wants to work on.

  17. Re:If you want a blazingly fast file system.... by moosesocks · · Score: 4, Interesting

    Max Volume Size: 8 TiB.

    That's not enough. Given that 1TB storage devices are on the market now, that could become outdated quite quickly. You'd be foolish to adopt that sort of filesystem, unless you were absolutely positive that you'd never upgrade (unlikely).

    Honestly, ZFS seems like it's the holy grail of filesystems. There are a few small issues that might need to be worked out, though it seems as close to "ideal" as you'd ever be able to get.

    --
    -- If you try to fail and succeed, which have you done? - Uli's moose
  18. Re:If you want a blazingly fast file system.... by Kent+Recal · · Score: 3, Interesting

    Well, it looks interesting feature-wise but they seem to be explicitly targeting SuSE - which is a no-go for most people.
    From a glance at the docs (hey, at least they have docs, that's a plus) it also seems like it's tied to specific versions of EVMS and other parts of the kernel, thus if you don't run a "blessed, certified" SuSE kernel with all the nasty patches then you're on your own.

    Just google for "debian|gentoo|redhat|... novell nss filesystem". Apparently nobody even tried to run NSS on another distro, or at least didn't write about it.

    I, for one, would only touch this on a blackbox, vendor-supported appliance but never consider it for a production server of my own (none of which run SuSE).
    If they worked towards integrating it into the mainline kernel, now that would be nice.

  19. Re:Ring 1 and 2? by Anonymous Coward · · Score: 3, Interesting

    yes, IIRC Windows NT uses rings 0 and 4. However, the problem would not be made better by having more rings, the performance cost is the transition between rings, nothing special about the rings themselves. eg progressing from ring 10 to ring 9 is as expensive as going from ring 0 to 1, or from ring 0 to ring 100.

  20. Re:BTRFS? REALLY? by Ragzouken · · Score: 5, Funny

    This is the internet, it's never too soon.

  21. Reiser has time and no need to work by r00t · · Score: 3, Funny

    They feed him. They put a roof over his head.
    They even bathe him.

    He might as well devote himself to filesystems.

  22. Re:when ext4 is feature complete it will be the #3 by Jah-Wren+Ryel · · Score: 5, Interesting

    The weakness with linux is in the LVM or EVMS layer. They both suck in that they are not enterprise ready (ie multi TB filesystems, 100+ MB/s sustained read/write) in that they cause unexplained IO hicups, lockups and kernel panics. LVM/EVMS certainly work fine for Joe Blow's HTPC, or a paltry 100GB database but they fall down when under serious load.

    LVM has been rock-solid for me with a ~7TB and 2 2TB ext3 filesystems (24 500GB disks) over the course of a year and a half. No problems migrating extents all over the place when I needed to swap disks in and out. Almost identical to HPUX in functionality, but without the sizing constraints.

    But, when I tried xfs for kicks I found out that a 7TB filesystem means you need 7GB of RAM to fsck it - impossible on a 32-bit system, I also had a week where I it all went in the shitter because I ran free-space to zero and started getting OS panics and data corruption.

    I'm definitely considering jfs for the next generation, my main complaint with ext3 has been ridiculously slow deletes and fsck's. Problems I have read don't exist with jfs.

    --
    When information is power, privacy is freedom.
  23. Re:Back when there was only fat16, ntfs, ext2 used by vadim_t · · Score: 5, Informative

    I hope you're joking.

    ext2 is nice and simple, but it's neither fast not reliable. It uses a linear search to find directory entries, which means it's very slow on large directories, like Maildir mailboxes. It doesn't do tail packing which means it wastes space and is slower with small files. It's not reliable because without a journal it needs a fsck after a bad shutdown which takes ages on a modern disk, and recovers it worse than a journal would.

    Just search for benchmarks, something like reiserfs beats ext2 by huge margins when it comes to important workloads such as a mail server.

    There are very good reasons why distributions generally go with ext3, or one of the other filesystems. I haven't seen ext2 as the default option for the root FS in a very long time.

  24. buttfsck!! by Zaiff+Urgulbunger · · Score: 5, Funny

    You think that's bad? The file system check command is buttfsck!

  25. Re:Back when there was only fat16, ntfs, ext2 used by jez9999 · · Score: 4, Funny

    Just search for benchmarks, something like reiserfs beats ext2 by huge margins when it comes to important workloads such as a mail server.

    Hell, it probably beats it to death.

  26. Re:Back when there was only fat16, ntfs, ext2 used by IceCreamGuy · · Score: 4, Insightful

    Yeah, I remember they used to talk about this in the Gentoo handbook; use ext2 for /boot, but ext3 for everything that you actually care about.

  27. Re:Back when there was only fat16, ntfs, ext2 used by Chemisor · · Score: 4, Interesting

    > Just search for benchmarks, something like reiserfs beats ext2 by huge margins

    You mean like these ones where ext2 beats reiserfs in most cases and is at least as fast in the others?

    > I hope you're joking. ext2 is nice and simple, but it's neither fast not reliable.
    > It uses a linear search to find directory entries, which means it's very slow on
    > large directories, like Maildir mailboxes.

    Believe it or not, the world does not revolve around huge mail servers. Some of us actually run Linux on a desktop, and so don't really care about how well an fs handles a million maildir mailboxes. Latency is the most important criteria, and reiserfs is just too complicated to deliver it, as well as being a largely fringe fs. Especially now with Hans gone, it would become even more fringe.

    > It doesn't do tail packing which means it wastes space and is slower with small files.

    Yup, I'd like to have efficient small file handling. But really, it is better to avoid having many small files in the first place. Use compressed archives to store such things; it's quite a bit more efficient, and does not require exotic file systems which most normal people (i.e. your customers) will not use.

    > It's not reliable because without a journal it needs a fsck after a bad shutdown

    I used to do that, and then I got a UPS instead and switched back to pure ext2. The performance hit from journalling is simply too high to tolerate. A decent UPS (pretty much anything made by APC) will prevent the crashes in the first place, solving the problem completely and without any unnecessary overhead. With UPS prices being as low as they are, there is no excuse for not having one, so I think that journalling will become obsolete in some near future.

  28. Re:Back when there was only fat16, ntfs, ext2 used by MBGMorden · · Score: 4, Insightful

    so I think that journalling will become obsolete in some near future.

    I bet in 1992 you were still thinking color TV's wouldn't last either . . .

    Look, a UPS is a great thing. I run one myself. Heck with more and more people switching to laptops a lot of people are running a "UPS" without even realizing it. The simple fact though is that modern processors and disks are so fast that the minimal speed impact of journaling is barely noticeable. It's certainly not worth giving up over some marginal speed gains.

    I mean we're talking about a world where people will give up tons of speed in their computer just to make the WINDOWS WOBBLE when you move them, or to make teddy bears wave at them from the system tray. Do you honestly believe that they're going to risk having their files corrupt on an unexpected power outage for a fraction of a percent increase in meaningful speed?

    --
    "People who think they know everything are very annoying to those of us who do."-Mark Twain
  29. Re:Back when there was only fat16, ntfs, ext2 used by vadim_t · · Score: 4, Insightful

    You mean like these ones where ext2 beats reiserfs in most cases and is at least as fast in the others?

    Look at the bottom of the page. That's from 2003. Of kernel 2.6.0. A lot of code changed since then.

    Believe it or not, the world does not revolve around huge mail servers. Some of us actually run Linux on a desktop, and so don't really care about how well an fs handles a million maildir mailboxes. Latency is the most important criteria, and reiserfs is just too complicated to deliver it, as well as being a largely fringe fs. Especially now with Hans gone, it would become even more fringe.

    I'm not sure what exactly you mean by this. Latency is mostly influenced by the hard disk. And on a desktop the disk shouldn't be a bottleneck anyway.

    Yup, I'd like to have efficient small file handling. But really, it is better to avoid having many small files in the first place. Use compressed archives to store such things; it's quite a bit more efficient, and does not require exotic file systems which most normal people (i.e. your customers) will not use.

    Except there's lots and lots of those files in a modern Linux system. Config files, icon files, and small libraries for instance. Additionally many files are searched in different paths, making a fast directory search important.

    I used to do that, and then I got a UPS instead and switched back to pure ext2. The performance hit from journalling is simply too high to tolerate. A decent UPS (pretty much anything made by APC) will prevent the crashes in the first place, solving the problem completely and without any unnecessary overhead. With UPS prices being as low as they are, there is no excuse for not having one, so I think that journalling will become obsolete in some near future.

    Just as a RAID is not a backup, an UPS isn't a disk journal. One of those days you'll get a long outage, or the power cable will turn out to fit badly into the power supply, have a kernel panic, the UPS won't switch to battery fast enough, etc. And then after several minutes of fsck something important might end up broken.

    If the journal causes you a noticeable slowdown you probably aren't a typical user. In typical usage the disk should be mostly idle after boot.

    I don't see a point in going forward insanely fast without brakes. I'll take the safety. I have an UPS on every computer, and still have a journalled FS, because there were times when the UPS was of no help. Like yesterday, when I upgraded my laptop's RAM, booted it, and found that with more than 2GB RAM, the BIOS maps the video RAM above 4GB. The video card showed its displeasure with that state of affairs by corrupting the display and locking up. Had no choice but to powercycle the box.

  30. Re:Back when there was only fat16, ntfs, ext2 used by illumin8 · · Score: 5, Insightful

    I used to do that, and then I got a UPS instead and switched back to pure ext2. The performance hit from journalling is simply too high to tolerate. A decent UPS (pretty much anything made by APC) will prevent the crashes in the first place, solving the problem completely and without any unnecessary overhead. With UPS prices being as low as they are, there is no excuse for not having one, so I think that journalling will become obsolete in some near future.

    Yeah, because systems never kernel panic, or crash for any other reason than power outages... Wake me up after you've been waiting for fsck to finish on your 1TB drive and it's been running for the last 72 hours.

    Whether or not you've had a system shutdown uncleanly in the past, you certainly will at some time in the future, so why not just use ext3 and save yourself the headache of a 3 day long fsck?

    It's also painfully obvious that you've never worked as a sysadmin before. You try explaining to your manager that the reason why your company's server will take 3 days to come back online is that you wanted to save a few microseconds of latency when users were accessing files...

    --
    "When the president does it, that means it's not illegal." - Richard M. Nixon
  31. All hardware can fail, including UPSes. by Medievalist · · Score: 5, Insightful

    I used to do that, and then I got a UPS instead and switched back to pure ext2. The performance hit from journalling is simply too high to tolerate. A decent UPS (pretty much anything made by APC) will prevent the crashes in the first place, solving the problem completely and without any unnecessary overhead. With UPS prices being as low as they are, there is no excuse for not having one, so I think that journalling will become obsolete in some near future.

    Our industrial UPS (which is orders of magnitude more reliable than any APC product ever made) recently exploded, burnt, and shorted out the entire building's power. It spiked thousands of volts through the protected equipment and destroyed a half-dozen servers. The fire was fierce enough to cause our fm200 system (halon equivalent) to dump, which put out the fire before the main battery bank was breached.

    This was the first time I've ever seen an UPS bigger than a Chrysler fail, but I've seen dozens of failures from those crappy little APC units. At one time I had a stack of burnt-out ones in my basement (I used to salvage the batteries for cash).

    If your disaster survivability plan depends on any single piece of hardware never failing, it's no good. Offsite backup is your friend.

  32. Re:Ring 1 and 2? by DamnStupidElf · · Score: 3, Interesting

    Not exactly. To effectively change the actual permissions that the permissions rings allow, stacks, segment registers, i/o permission bitmaps, and page tables (among other things) have to be changed. Generally this means reading values from memory into caches, which is slow. Probably the slowest of them all is the page cache. Invalidating the entire page cache is godawful slow, and is necessary if each separate user-space has a truly private address space and not simply a chunk out of the entire virtual address space. Even for operating systems that partition the virtual address space into regions for each user process, the local descriptor (or equivalent) table for segment access needs to be reloaded. This has to happen for every cross-privilege-level call. It is *much* faster to simply call another kernel mode function (push some stuff on the stack, change the instruction register, and you're done) without messing with caches.

    In fact, it would be even faster to not separate the kernel and user space processes at all, and instead use formal verification or a virtual machine (which really just means a smaller instruction set that's easier to verify) to prove that no user process could ever mess with the kernel or other processes. Virtual machines for languages are essentially at this stage today; they implement what would constitute a kernel as the run-time level portions of the virtual machine, running the virtualized software in the same address space. There have been some attacks based on virtual machine weaknesses or memory corruption that break the protection model by changing data structures so that they violate the security model. This can happen in OS's that use hardware protection as well, there are just fewer places in memory that random changes can cause problems (just the page tables and other security paraphernalia), making it less likely.