Slashdot Mirror


Ext4 Data Losses Explained, Worked Around

ddfall writes "H-Online has a follow-up on the Ext4 file system — Last week's news about data loss with the Linux Ext4 file system is explained and new solutions have been provided by Ted Ts'o to allow Ext4 to behave more like Ext3."

15 of 421 comments (clear)

  1. LOL: Bug Report by Em+Emalb · · Score: 5, Funny

    User: My data, it's gone!
    EXT4:"Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations."

    Solution: WORKS AS DESIGNED

    --
    Sent from your iPad.
  2. Those who fail to learn the lessons of history.... by morgan_greywolf · · Score: 5, Insightful

    FTFA, this is the problem:

    Ext4, on the other hand, has another mechanism: delayed block allocation. After a file has been closed, up to a minute may elapse before data blocks on the disk are actually allocated. Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. If the system crashes during this time, the rename() operation may already be committed in the journal, even though the new file still contains no data. The result is that after a crash the file is empty: both the old and the new data have been lost.

    And now my question: Why did the Ext4 developers make the same mistakes Reiser and XFS both made (and later corrected) years ago? Before you get to write any filesystem code, you should have to study how other people have done it, including all the change history. Seriously.

    Those who fail to learn the lessons of [change] history are doomed to repeat it.

  3. rename completes before the write by Spazmania · · Score: 5, Insightful

    Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.

    I couldn't disagree more:

    When applications want to overwrite an existing file with new or changed data [...] they first create a temporary file for the new data and then rename it with the system call - rename(). [...] Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until [up to 60 seconds later].

    Application developers reasonably expect that writes to the disk which happen far apart in time will happen in order. If I write to a file and then rename the file, I expect that the rename will not complete significantly before the write. Certainly not 60 seconds before the write. It seems dead obvious, at least to me, that the update of the directory entry should be deferred until after ext4 flushes that part of the file written prior to the change in the directory entry.

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
  4. Show some respect! by LotsOfPhil · · Score: 5, Funny

    ...new solutions have been provided by Ted Ts'o to...

    That's General Ts'o to you!

    --
    This post climbed Mt. Washington.
  5. Re:the workaround is bad design by jd · · Score: 5, Funny

    But... those of us who learned the Ancient And Most Wise ways always triple-sync. We also sacrifice Peeps and use red food colouring in voodoo ceremonies (hey, it really is blood, so it should work) to keep the hardware from failing.

    On next week's Slashdot, there will be a brief tutorial on the right way to burn a Windows CD at the stake, and how to align the standing stones known as RAM Chips to points of astronomical significance.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  6. Workaround is disaster for laptops by victim · · Score: 5, Insightful

    The workaround (flushing everything to disk before the rename) is a disaster for laptops or anything else which might wish to spin down a disk drive.

    The write-replace idiom is used when a program is updating a file and can tolerate the update being lost in a crash, but wants either the old or the new to be intact and uncorrupted. The proposed sync solution accomplishes this, but at the cost of spinning up the drive and writing the blocks at each write-replace. How often does your browser update a file while you surf? Every cache entry? Every history entry? What about your music player? Desktop manager? All of these will be spin up your disk drive.

    Hiding behind POSIX is not the solution. There needs to be a solution that supports write-replace without spinning up the disk drive.

    The ext4 people have kindly illuminated the problem. Now it is time to define a solution. Maybe it will be some sort of barrier logic, maybe a new kind of sync syscall. But it needs to be done.

    1. Re:Workaround is disaster for laptops by Kjella · · Score: 5, Informative

      Fixed code:
      fwrite()
      fsync() - sync this file before close
      fclose()
      rename()

      Either you're a troll or an idiot, since you're AC'ing I guess I got trolled. This will sync immidiately and kill performance and battery life, since every block must be confirmed written before the process can continue. What you need to fix this is a delayed rename that happens after the delayed write.

      Problem:
      fwrite()
      fclose()
      rename()
      *ACTUAL RENAME*
      *TIME PASSES* <-- crash happens here = lose old file
      *ACTUAL WRITE*

      Real solution:
      fwrite()
      fclose()
      rename()
      *TIME PASSES* <-- crash happens here = keep old file
      *ACTUAL WRITE*
      *ACTUAL RENAME*

      --
      Live today, because you never know what tomorrow brings
  7. Re:LOL: Bug Report by Anonymous Coward · · Score: 5, Insightful

    Rubbish. Sorry, if the syncs were implicit, app developers would just be demanding a way to to turn them off most of the time because they were killing performance.

  8. Quick workaround - no patches required by canadiangoose · · Score: 5, Informative

    If you mount your ext4 partitions with nodelalloc you should be fine. You will of course no longer benefit from the performance enhancements that delayed allocation bring, but at least you'll have all of your freaking data. I'm running Debian on Linux 2.6.29-rc8-git4, and so far my limited testing has shown this to be very effective.

    --
    Never eat more than you can lift -- Miss Piggy
  9. No kidding by Sycraft-fu · · Score: 5, Insightful

    All the stuff with Ext4 strikes me as amazingly arrogant, and ignorant of the past. The issue that FS authors, well any authors of any system programs/tools/etc need to understand is that your tool being usable is the #1 important thing. In the case of a file system, that means that it reliably stores data on the drive. So, if you do something that really screws that over, well then you probably did it wrong. Doesn't matter if you fully documented it, doesn't matter if it technically "follows the spec" what matters is that it isn't usable.

    I mean I could write a spec for a file system that says "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.

    I'd give these guys more credit if I was aware of any other major OS/FS combo that did shit like this, but I'm not. Linux/Ext3 doesn't, Windows/NTFS doesn't, OS-X/HFS+ doesn't, Solaris/ZFS doesn't, etc. Well that tells me something. That says that the way they are doing things isn't a good idea. If it is causing problems AND it is something else nobody else does, then probably you ought not do it.

    This is just bad design, in my opinion.

  10. Re:LOL: Bug Report by try_anything · · Score: 5, Insightful

    This is the problem with new features - the users have problems using them until they fully understands and appreciates the advantages and disadvantages.

    Advantages: Filesystem benchmarks improve. Real performance... I guess that improves, too. Does anybody know?

    Disadvantages: You risk data loss with 95% of the apps you use on a daily basis. This will persist until the apps are rewritten to force data commits at appropriate times, but hopefully not frequently enough to eat up all the performance improvements and more.

    Ext4 might be great for servers (where crucial data is stored in databases, which are presumably written by storage experts who read the Posix spec), but what is the rationale for using it on the desktop? Ext4 has been coming for years, and everyone assumed it was the natural successor to ext3 for *all* contexts where ext3 is used, including desktops. I hope distros don't start using or recommending ext4 by default until they figure out how to configure it for safe usage on the desktop. (That will happen long before the apps are rewritten.) Filesystem benchmarks be damned.

  11. Re:LOL: Bug Report by swillden · · Score: 5, Interesting

    The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"

    They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

    The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.

    --
    Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  12. A bad design that it is used everywhere by diegocgteleline.es · · Score: 5, Informative

    "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.

    It turns out that all the modern operative systems work exactly like that. In ALL of them you need to use explicit syncronization (fsync and friends) to get a notification that your data has really been written to disk (and that's all what you get, a notification, because the system could oops before fsync finishes). You also can mount your filesystem as "sync", which sucks.

    Journaling, COW/transaction-based filesystems like ZFS only guarantee the integrity, not that your data is safe. It turns out that Ext3 has the same problem, it's just that the window is smaller (5 seconds). And I wouldn't bet that HFS and ZFS have not the same problem (btrfs is COW and transaction based, like ZFS, and has the same problem).

    Welcome to the real world...

    1. Re:A bad design that it is used everywhere by Tacvek · · Score: 5, Informative

      The Ext3 5 seconds thing is true, but that is not the important difference.

      On Ext3, with the default mount options, if one writes a file to disk, and then renames the file the write is guarantee to come before the rename. This can be used to ensure atomic updates to files, by writing a temporary copy of the file with the desired changes, and then renaming the file.

      On Ext4, if one writes a file to the disk, and then renames the file, the rename can happen first. The result of this is that it is not possible to ensure atomic updates to files unless one uses fsync between the writing and the renaming. However, that would hurt performance, since fsync will force the file to be committed to disk right now, when all that is really important is that it is committed to disk before the rename is.

      Thankfully the Ext4 module will be gaining a new mount option that will ensure that a file is written to disk before the renaming occurs. This mount option should have no real impact on performance, but will ensure the atomic update idiom that works on Ext3 will also work on Ext4.

      --
      Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
  13. Re:LOL: Bug Report by blazerw · · Score: 5, Insightful

    1) Modern filesystems are expected behave better than POSIX demands.

    2) POSIX does not cover what should happen in a system crash at all.

    3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.

    4) fsync is not a solution, because ir forces the operation to complete *now*, which is counterproductive to write performance, cache coherence, laptop battery life, excessive SSD wear and a bunch of other reasons.

    We don't need reliable data-on-disk-now, we need reliable old-or-new data without using a sledgehammer of fsync.

    1. POSIX is an API. It tries not to force the filesystem into being anything at all. So, for instance, you can write a filesystem that waits to do writes more efficiently to cut down on the wear of SSDs.
    2. Ext3 has a max 5 second delay. That means this bug exists in Ext3 as well.
    3. If you have important data that if not written to the hard drive will cause catastrophic failure, then you use the part of the API that forces that write.
    4. Atomicity does not guarantee the filesystem be synchronized with cache. It means that during the update no other process can alter the affected file and that after the update the change will be seen by all other processes.

    We don't need a filesystem that sledgehammers each and every byte of data to the hard drive just in case there is a crash. What we DO need is a filesystem that can flexibly handle important data when told it is important, and less important data very efficiently.

    What you are asking is that the filesystem be some kind of sentient all knowing being that can tell when data is important or not and then can write important data immediately and non-important data efficiently. I think that it is a little better to have the application be the one that knows when it's dealing with important data or not.