Slashdot Mirror


Kernel Hackers On Ext3/4 After 2.6.29 Release

microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"

16 of 316 comments (clear)

  1. OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · · Score: 5, Insightful

    Quote from Linus:

    "...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."

    In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.

      How about ASKING them rather than calling the Morons?

    (note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)

    TDz.

    1. Re:OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · · Score: 5, Insightful

      Torvalds exactly knows who it is and most people following the discussion will probably know it, too.
      Also, there has been a fairly public discussion including a statement by the responsible person in question.

      Not saying the name is Torvalds attempt at saving grace. Similar to a parent of two children saying, I don't know who did the mess, but if I come back, it better be cleaned up.

      Yes, Mr. Torvalds is fairly outspoken.

    2. Re:OK, then... *WHO* is the official ext3 "moron"? by houghi · · Score: 5, Insightful

      Knowing the humor that Linus has, it could be himself.

      --
      Don't fight for your country, if your country does not fight for you.
    3. Re:OK, then... *WHO* is the official ext3 "moron"? by SpinyNorman · · Score: 4, Insightful

      fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.

      I think sometimes programmers do fsync() when they really want fflush() (flush library buffers to driver) which is about program behavior ("I want this data written to disk real-soon-now", not hanging around in the library buffer indefinitely) rather than a data-on-disk guarantee.

      IMO telling programmers to flatly avoid fsync is almost as bad as having a borked meta-data/data write order - progammers should be educated about what fsync does and when they really want/need it and when they don't. I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.

    4. Re:OK, then... *WHO* is the official ext3 "moron"? by Skuto · · Score: 3, Insightful

      fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.

      The two issues are very closely related, not "an entirely different issue". What the apps want is not "put this data on the disk, NOW", but "put this data on the disk sometime, but do NOT kill the old data until that is done".

      Applications don't want to be sure that the new version is on disk. They want to be sure that SOME version is on disk after a crash. This is exactly what some people can't seem to understand.

      fsync() ensures the first at a huge performance cost. rename() + ext3 ordered gives you the latter. The problem is that ext4 breaks this BECAUSE of the journal ordering. The "consistent state" is broken for application data.

      I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.

      Yes. But they are assuming this exists and the API is called rename() :)

  2. I would go further than Linus on this one... by pla · · Score: 4, Insightful

    FTA: "if you write your data _first_, you're never going to see corruption at all"

    Agreed, but I think this still misses the point - Computers go down unexpectedly. Period.

    Once upon a time, we all seemed to understand that, and considered writeback behavior (when rarely available) always a dangerous option only for use in non-production systems and with a good UPS connected. And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!

    Yes, it gives a huge performance boost... But performance without reliability means absolutely nothing. Eventually every computer will go down without enough warning to flush the write buffers.

    1. Re:I would go further than Linus on this one... by Anonymous Coward · · Score: 4, Insightful

      Yes! This is the whole point. I am not a filesystem guy either. I don't even know that much about filesystems. But imagine you write a program with some common data storage. Imagine part of that common data is a pointer to some kind of matrix or whatever. Does anybody think it is a good idea to set that pointer first, and then initialize the data later?

      Sure, a realy robust program should be able to somehow recover from corrupt data. But that doesn't mean you can just switch your brain off when writing the data.

    2. Re:I would go further than Linus on this one... by morgan_greywolf · · Score: 3, Insightful

      Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.

      It's common sense! Duh. Write data first, pointers to data second. If the system goes down, you're far less likely to lose anything. That's obvious. Those who think this is somehow not obvious don't have the right mentality to be writing kernel code.

      I think the problem is Ted T'so has had a slight 'works for me' attitude about it:

      All I can tell you is that *I* don't run into them, even when I was
      using ext3 and before I got an SSD in my laptop. I don't understand
      why; maybe because I don't get really nice toys like systems with
      32G's of memory. Or maybe it's because I don't use icecream (whatever
      that is).

    3. Re:I would go further than Linus on this one... by Hatta · · Score: 3, Insightful

      In that case, you've removed or overwritten the data on disk, but now the metadata is invalid.

      i.e. You truncated a file to 0 bytes, and wrote the data.

      Why on earth would you do that? Write the new data, update the metadata, THEN remove the old file.

      --
      Give me Classic Slashdot or give me death!
    4. Re:I would go further than Linus on this one... by Cassini2 · · Score: 4, Insightful

      When you have less than 64K of RAM, and a processor that barely has a modern memory management unit, then some of these "extras" like Copy-On-Write appear as advanced features. Additionally, when your computer costs $500,000, you tend not to scrimp on stuff like a UPS.

      Economics have changed much since the early days of UNIX. Many of the file system design principles still remain the same. Assumptions need to change with the times. Reasonable historical assumptions were:
      - Every UNIX machine has a UPS.
      - Production servers run UNIX. What's this Linux you are talking about?
      - Disk space is expensive. No one will pay for unused disk space.
      - RAM is expensive. As such, it can be quickly flushed to disk.
      - No one has enough disk space, RAM, or disk bandwidth to experience a random fault rate of 1 part in 1 quadrillion (1E-15).
      Times have changed, Linux is used on heavy servers now. UNIX (with deference to AIX and Solaris) is almost gone from the market place. RAM and disk space are cheap, so cheap that random data errors can big issue. A UPS can cost more than a hard drive, and sometimes more than the computer it is attached to. Disk capacities are huge.

      Unfortunately, the file system designers haven't kept pace. The Ext4 bug was detected, reproduced, and ultimately solved for a group desktop Ubuntu users. Linux is used in cheap embedded applications, like home NAS servers. Applications that don't have a UPS. Linux isn't a just server O/S anymore. The way to design and optimize a file system needs to change too.

      Additionally, even for servers, the times have changed, and this affects file systems. It used to be that accepting data loss was OK, since you would need to rebuild a server after a failure. Today, the disk arrays are so large, that if you attempted to restore the data from backups, it would take hours (sometimes days.) As such, capabilities like "snapshots" are becoming very important to servers. Server disk storage is increasingly bandwidth limited, and not disk size limited. Today, it is possible to have 1 TB of data on a single disk, while being unable to use that disk space effectively. Under many workloads, the users are capable of changing the data faster than a backup program can copy the data off the disk. In such a case, without a snapshot capability, it is impossible to make a valid backup.

  3. Re:lkml.org server is slashdotted. by Anonymous Coward · · Score: 5, Insightful

    Well this is just my meta comment. I'll be writing my real comment later...

    You forgot to include a link to the comment you'll be writing later.

  4. Re:Safest mkfs/mount options? by Blackknight · · Score: 3, Insightful

    Solaris 10 with ZFS, if you actually care about your data.

  5. Um. This doesn't make sense. by Colin+Smith · · Score: 4, Insightful

    Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.

    from the FAQ: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html

    "mount -o data=ordered"
                    Only journals metadata changes, but data updates are flushed to
                    disk before any transactions commit. Data writes are not atomic
                    but this mode still guarantees that after a crash, files will
                    never contain stale data blocks from old files.

    "mount -o data=writeback"
                    Only journals metadata changes, and data updates are entirely
                    left to the normal "sync" process. After a crash, files will
                    may contain stale data blocks from old files: this mode is
                    exactly equivalent to running ext2 with a very fast fsck on reboot.

    So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...
     

    --
    Deleted
  6. Re:lkml.org server is slashdotted. by linuxrocks123 · · Score: 5, Insightful

    Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking. All the filesystem can normally promise in the event of a crash is that the metadata will describe a valid filesystem somewhere between the last returned synchronization call and the state at the event of the crash. If you need more than that -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes and putting a special capacitor in the system so the OS can flush everything to disk before the computer loses power in an outage.

    --
    vi ~/.emacs # I'm probably going to Hell for this.
  7. Re:lkml.org server is slashdotted. by mmontour · · Score: 3, Insightful

    Some of us have discovered the 'shutdown' command. [...]Anyhow, I suggest you use it occasionally. Then perhaps you can only fsck when something bad has happened.

    Don't be too smug - a "shutdown" doesn't always guarantee a clean startup. I remember a bug (hopefully fixed now) where "shutdown" was completing so quickly that it powered off the computer while data was still sitting in the hard drive's volatile write cache. Even though the OS had unmounted the filesystem, the on-disk blocks were still dirty.

    p.s. If any OS/kernel developers are listening - how about implementing a standard API through which drive write-caches can be flushed+disabled whenever a system starts a shutdown procedure, gets a signal that the UPS is running on battery power, or otherwise concludes that it is in a state where a temporarily-increased risk of data loss justifies slowing down I/O?

  8. Re:ZFS by Mr.Ned · · Score: 3, Insightful

    FreeBSD has ZFS. My understanding is while ZFS is a good filesystem, it isn't without issues. It doesn't work well on 32-bit architectures because of the memory requirements, isn't reliable enough to host a swap partition, and can't be used as a boot partition when part of a pool. Here's FreeBSD's rundown of known problems: http://wiki.freebsd.org/ZFSKnownProblems.

    On the other hand, the new filesystems in the Linux kernel - ext4 and btrfs - are taking the lessons learned from ZFS. I'm excited about next-generation filesystems, and I don't think ZFS is the only way to go.