Slashdot Mirror


Kernel Hackers On Ext3/4 After 2.6.29 Release

microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"

10 of 316 comments (clear)

  1. OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · · Score: 5, Insightful

    Quote from Linus:

    "...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."

    In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.

      How about ASKING them rather than calling the Morons?

    (note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)

    TDz.

    1. Re:OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · · Score: 5, Insightful

      Torvalds exactly knows who it is and most people following the discussion will probably know it, too.
      Also, there has been a fairly public discussion including a statement by the responsible person in question.

      Not saying the name is Torvalds attempt at saving grace. Similar to a parent of two children saying, I don't know who did the mess, but if I come back, it better be cleaned up.

      Yes, Mr. Torvalds is fairly outspoken.

    2. Re:OK, then... *WHO* is the official ext3 "moron"? by houghi · · Score: 5, Insightful

      Knowing the humor that Linus has, it could be himself.

      --
      Don't fight for your country, if your country does not fight for you.
    3. Re:OK, then... *WHO* is the official ext3 "moron"? by SpinyNorman · · Score: 4, Insightful

      fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.

      I think sometimes programmers do fsync() when they really want fflush() (flush library buffers to driver) which is about program behavior ("I want this data written to disk real-soon-now", not hanging around in the library buffer indefinitely) rather than a data-on-disk guarantee.

      IMO telling programmers to flatly avoid fsync is almost as bad as having a borked meta-data/data write order - progammers should be educated about what fsync does and when they really want/need it and when they don't. I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.

  2. I would go further than Linus on this one... by pla · · Score: 4, Insightful

    FTA: "if you write your data _first_, you're never going to see corruption at all"

    Agreed, but I think this still misses the point - Computers go down unexpectedly. Period.

    Once upon a time, we all seemed to understand that, and considered writeback behavior (when rarely available) always a dangerous option only for use in non-production systems and with a good UPS connected. And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!

    Yes, it gives a huge performance boost... But performance without reliability means absolutely nothing. Eventually every computer will go down without enough warning to flush the write buffers.

    1. Re:I would go further than Linus on this one... by Anonymous Coward · · Score: 4, Insightful

      Yes! This is the whole point. I am not a filesystem guy either. I don't even know that much about filesystems. But imagine you write a program with some common data storage. Imagine part of that common data is a pointer to some kind of matrix or whatever. Does anybody think it is a good idea to set that pointer first, and then initialize the data later?

      Sure, a realy robust program should be able to somehow recover from corrupt data. But that doesn't mean you can just switch your brain off when writing the data.

    2. Re:I would go further than Linus on this one... by Cassini2 · · Score: 4, Insightful

      When you have less than 64K of RAM, and a processor that barely has a modern memory management unit, then some of these "extras" like Copy-On-Write appear as advanced features. Additionally, when your computer costs $500,000, you tend not to scrimp on stuff like a UPS.

      Economics have changed much since the early days of UNIX. Many of the file system design principles still remain the same. Assumptions need to change with the times. Reasonable historical assumptions were:
      - Every UNIX machine has a UPS.
      - Production servers run UNIX. What's this Linux you are talking about?
      - Disk space is expensive. No one will pay for unused disk space.
      - RAM is expensive. As such, it can be quickly flushed to disk.
      - No one has enough disk space, RAM, or disk bandwidth to experience a random fault rate of 1 part in 1 quadrillion (1E-15).
      Times have changed, Linux is used on heavy servers now. UNIX (with deference to AIX and Solaris) is almost gone from the market place. RAM and disk space are cheap, so cheap that random data errors can big issue. A UPS can cost more than a hard drive, and sometimes more than the computer it is attached to. Disk capacities are huge.

      Unfortunately, the file system designers haven't kept pace. The Ext4 bug was detected, reproduced, and ultimately solved for a group desktop Ubuntu users. Linux is used in cheap embedded applications, like home NAS servers. Applications that don't have a UPS. Linux isn't a just server O/S anymore. The way to design and optimize a file system needs to change too.

      Additionally, even for servers, the times have changed, and this affects file systems. It used to be that accepting data loss was OK, since you would need to rebuild a server after a failure. Today, the disk arrays are so large, that if you attempted to restore the data from backups, it would take hours (sometimes days.) As such, capabilities like "snapshots" are becoming very important to servers. Server disk storage is increasingly bandwidth limited, and not disk size limited. Today, it is possible to have 1 TB of data on a single disk, while being unable to use that disk space effectively. Under many workloads, the users are capable of changing the data faster than a backup program can copy the data off the disk. In such a case, without a snapshot capability, it is impossible to make a valid backup.

  3. Re:lkml.org server is slashdotted. by Anonymous Coward · · Score: 5, Insightful

    Well this is just my meta comment. I'll be writing my real comment later...

    You forgot to include a link to the comment you'll be writing later.

  4. Um. This doesn't make sense. by Colin+Smith · · Score: 4, Insightful

    Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.

    from the FAQ: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html

    "mount -o data=ordered"
                    Only journals metadata changes, but data updates are flushed to
                    disk before any transactions commit. Data writes are not atomic
                    but this mode still guarantees that after a crash, files will
                    never contain stale data blocks from old files.

    "mount -o data=writeback"
                    Only journals metadata changes, and data updates are entirely
                    left to the normal "sync" process. After a crash, files will
                    may contain stale data blocks from old files: this mode is
                    exactly equivalent to running ext2 with a very fast fsck on reboot.

    So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...
     

    --
    Deleted
  5. Re:lkml.org server is slashdotted. by linuxrocks123 · · Score: 5, Insightful

    Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking. All the filesystem can normally promise in the event of a crash is that the metadata will describe a valid filesystem somewhere between the last returned synchronization call and the state at the event of the crash. If you need more than that -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes and putting a special capacitor in the system so the OS can flush everything to disk before the computer loses power in an outage.

    --
    vi ~/.emacs # I'm probably going to Hell for this.