Ext4 Data Losses Explained, Worked Around

Quick workaround - no patches required by canadiangoose · 2009-03-19 06:32 · Score: 5, Informative

If you mount your ext4 partitions with nodelalloc you should be fine. You will of course no longer benefit from the performance enhancements that delayed allocation bring, but at least you'll have all of your freaking data. I'm running Debian on Linux 2.6.29-rc8-git4, and so far my limited testing has shown this to be very effective.

--
Never eat more than you can lift -- Miss Piggy

Re:Workaround is disaster for laptops by Kjella · 2009-03-19 06:48 · Score: 5, Informative

Fixed code:
fwrite()
fsync() - sync this file before close
fclose()
rename()

Either you're a troll or an idiot, since you're AC'ing I guess I got trolled. This will sync immidiately and kill performance and battery life, since every block must be confirmed written before the process can continue. What you need to fix this is a delayed rename that happens after the delayed write.

Problem:
fwrite()
fclose()
rename()
*ACTUAL RENAME*
*TIME PASSES* <-- crash happens here = lose old file
*ACTUAL WRITE*

Real solution:
fwrite()
fclose()
rename()
*TIME PASSES* <-- crash happens here = keep old file
*ACTUAL WRITE*
*ACTUAL RENAME*

--
Live today, because you never know what tomorrow brings

Re:Workaround is disaster for laptops by dshadowwolf · 2009-03-19 07:11 · Score: 3, Informative

And you don't get it... The truth is that Ext4 was writing the journal out before any changes took place. This means that when the crash happens between the metadata write and the actual write a replay of the journal will cause data loss.

Other filesystems with delayed allocation solve this by not writing the journal before the actual data commits happen. The fix that TFA is talking about introduces this to Ext4.

Re:Workaround is disaster for laptops by david_thornley · 2009-03-19 07:15 · Score: 3, Informative

In which case the standard sucks, big time, and finding a loophole that trashes normal expected behavior should not be cause for rejoicing.

There needs to be a way to write a file such that either the old or the new is preserved. Agreed on this?

Now, in a file system that's going to run real well, there needs to be a way to delay writes in order to batch them. Agreed on this?

We have two reasonable demands here. Pick one, because that's all you're going to get.

Currently, in order to keep either the old or new file, it's necessary to write the new file right now. This is the standard behavior, and it trashes performance. Alternatively, the writes can be batched up for later, for good performance, and we run the risk of losing both old and new versions of a file.

In other words, in order to optimize the heck out of the file system, it's necessary to trash the performance.

What we need is a way to do the rewrite-rename thing in a way so it can be safely delayed, so the file system can batch up a lot of writes to do in a really fancy optimized way, but writing the new file fully before renaming it. There's no obvious reason to me why the file system can't keep track of this and guarantee the order. It may not be required by the standard, but that's no excuse for not implementing it.

--
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes

A bad design that it is used everywhere by diegocgteleline.es · 2009-03-19 07:46 · Score: 5, Informative

"No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.

It turns out that all the modern operative systems work exactly like that. In ALL of them you need to use explicit syncronization (fsync and friends) to get a notification that your data has really been written to disk (and that's all what you get, a notification, because the system could oops before fsync finishes). You also can mount your filesystem as "sync", which sucks.

Journaling, COW/transaction-based filesystems like ZFS only guarantee the integrity, not that your data is safe. It turns out that Ext3 has the same problem, it's just that the window is smaller (5 seconds). And I wouldn't bet that HFS and ZFS have not the same problem (btrfs is COW and transaction based, like ZFS, and has the same problem).

Welcome to the real world...

Re:A bad design that it is used everywhere by Tacvek · 2009-03-19 08:52 · Score: 5, Informative

The Ext3 5 seconds thing is true, but that is not the important difference.
On Ext3, with the default mount options, if one writes a file to disk, and then renames the file the write is guarantee to come before the rename. This can be used to ensure atomic updates to files, by writing a temporary copy of the file with the desired changes, and then renaming the file.
On Ext4, if one writes a file to the disk, and then renames the file, the rename can happen first. The result of this is that it is not possible to ensure atomic updates to files unless one uses fsync between the writing and the renaming. However, that would hurt performance, since fsync will force the file to be committed to disk right now, when all that is really important is that it is committed to disk before the rename is.
Thankfully the Ext4 module will be gaining a new mount option that will ensure that a file is written to disk before the renaming occurs. This mount option should have no real impact on performance, but will ensure the atomic update idiom that works on Ext3 will also work on Ext4.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524

Re:No kidding by Tacvek · 2009-03-19 08:40 · Score: 4, Informative

I don't think you have it right.

On Ext3 with "data=ordered" (a default mount option), if one writes the file to disk, and then renames the file, ext3 will not allow the rename to take place until after the file has been written to disk.

Therefore if an application that wants to change a file uses the common pattern of writing to a temporary file and then renaming (the renaming is atomic on journaling file systems), if the system crashes at any point, when it reboots the file is guaranteed to be either the old version or the new version.

With Ext4, if you write a file and then rename it, the rename can happen before the write. Thus if the computer crashes between the rename and the write, on reboot the result will be a zero byte file.

The fact that the new version of the file may be lost is not the issue. The issue is that both versions of the file may be lost.

The end result is the write and rename method of ensuring atomic updates to files does not work under Ext4.

A new mount option that forces the rename to come after the data is written to disk is being added. Once that is available, the problem will be gone if you use that mount option. Hopefully it will be made a default mount option.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524

Re:LOL: Bug Report by zenyu · 2009-03-19 08:47 · Score: 4, Informative

They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

Yup, and the problem has existed with KDE startup for years. I remember the startup files getting trashed when Mandrake first came out and I tried KDE for long enough to get hooked, and it's happened to me a few times a year ever since with every filesystem I've used. I just make my own backups of the .kde directory and fix this manually when it happens. I'm pretty good at this restore by now. Hopefully this bug in KDE will get fixed now that it is causing the KDE project such great embarrassment. I had a silent wish Tso would increase the default commit interval to 10 minutes when the first defenders of the KDE bug started squawking, but he's was too gracious for that.

PS I use a lot of experimental graphics drivers for work, hence lockups during startup are common enough that I probably see this KDE bug more than most KDE users. But they really violate every rule of using config files: 1st. open with minimum permission needed, in this case read only, unless a write is absolutely necessary. 2nd. only update a file when it needs updating. 3rd. when updating a config file make a copy, commit it to disk, and then replace the original, making sure file permissions and ownership are unchanged, then commit the rename if necessary.

PS2 Those computer users saying an fsync will kill performance need to get cluebat applied to them by the nearest programmer. 1st. There will be no fsyncs of config files at startup once the KDE startup is fixed. 2nd. fsyncs on modern filesystems are pretty fast, ext3 is the rare exception to that norm; this will be non-noticable when you apply a settings change. 3rd. These types of programming errors are not the norm; I've graded first and second year computer science classes and each of the three major mistakes made would have lost you 20-30% of your score for the assignment.

Re:LOL: Bug Report by DragonWriter · 2009-03-19 08:47 · Score: 3, Informative

Is writing a new file, and then renaming it over an existing file really a 'typical workload'???

Its a fairly typical way of trying to acheive something loosely approximating transactional behavior with respect to updates to the file in question without relying on transactional file system semantics.

Sounds like they need to talk to Kirk McKusick by argent · 2009-03-19 08:58 · Score: 4, Informative

Kirk McKusick spent a lot of time working out the right order to write metadata and file data in FFS and the resulting file system, FFS with Soft Updates, gets high performance and high reliability... even after a crash.

Re:LOL: Bug Report by grumbel · 2009-03-19 10:08 · Score: 3, Informative

3. If you have important data that if not written to the hard drive will cause catastrophic failure, then you use the part of the API that forces that write.

You completly missed the point. The new data isn't important, it could be lost and nobody would care. The troublesome part is that you lose the old data too. If you would lose the last 5 minutes of changes in your KDE config that would be a non-issue, what however happens is that you not just lose the last few changes, but your complete config, it ends up as 0 byte files, which is a state that the filesystem never had.

Workaround patches already in Fedora and Ubuntu by tytso · 2009-03-19 15:04 · Score: 4, Informative

It's really depressing that there are so many clueless comments in Slashdot --- but I guess I shouldn't be surprised.

Patches to work around buggy applications which don't call fsync() have been around long before this issue got slashdotted, and before the Ubuntu Laundpad page got slammed with comments. In fact, I commented very early in the Ubuntu log that patches that detected the buggy applications and implicitly forced the disk blocks to disk were already available. Since then, both Fedora and Ubuntu are shipping with these workaround patches.

And yet, people are still saying that ext4 is broken, and will never work, and that I'm saying all of this so that I don't have to change my code, etc ---- when in fact I created the patches to work around the broken applications *first*, and only then started trying to advocate that people fix their d*mn broken applications.

If you want to make your applications such that they are only safe on Linux and ext3/ext4, be my guest. The workaround patches are all you need for ext4. The fixes have been queued for 2.6.30 as soon as its merge window opens (probably in a week or so), and Fedora and Ubuntu have already merged them into their kernels for their beta releases which will be released in April/May. They will slow down filesystem performance in a few rare cases for properly written applications, so if you have a system that is reliable, and runs on a UPS, you can turn off the workaround patches with a mount option.

Applications that rely on this behaviour won't necessarily work well on other operating systems, and on other filesystems. But if you only care about Linux and ext3/ext4 file systems, you don't have to change anything. I will still reserve the right to call them broken, though.

Re:No kidding by AvitarX · 2009-03-19 15:48 · Score: 3, Informative

But if the application syncs the file, the new data is written to disk.

This wastes time and performance, and for most files is un-needed.

There are not only "important" and "unimportant" files, there are also "typical" files.

We don't want to lose them, but who cares if recent changes are lost.

Take for example a KDE config file. I am willing to risk all changes made to it since boot (I generally leave my computer off at night, so this is 12 or so hours). I do not want to lose all of my changes since install (this is 10,000 hours).

The method of writing a temporary file and then renaming prevents the second from happening (in EXT3, XFS now, ReiserFS now, and soon EXT4) while still allowing for very aggressive write caching.

EXT4 currently allows for the the second to happen unless a disk write is forced preventing either of the scenarios.

The loss of the file already synced to disk potentially years ago is the issue, not the loss of the relatively recent data.

EXT4 has essentially removed the option for having "typical" files, and forces them to be treated as "important".

So everything becomes every change forces a write, or we care not about this (cache for example). The typical stuff that every change is not so critical (in the rare event of a crash), but it is sure nice to have something becomes elevated to an "important" file that does all of those bad things you describe, and eliminates the ability to cache writes.

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg

Re:LOL: Bug Report by Eskarel · 2009-03-19 20:53 · Score: 3, Informative

I did flip read and write, long day.

Slashdot Mirror

Ext4 Data Losses Explained, Worked Around

14 of 421 comments (clear)