Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4
cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54
This is NOT a bug. Read the POSIX documents.
Filesystem metadata and file contents is NOT required to be synchronous and a sync is needed to ensure they are syncronised.
It's just down to retarded programmers who assume they can truncate/rename files and any data pending writes will magically meet up a-la ext3 (which has a mount option which does not sync automatically btw).
RTFPS (Read The Fine POSIX Spec).
Its even WORSE than just being asynchronous:
EXT4 reproducably delays write ops, but commits journal updates concerning this write.
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
Ext3 doesn't write out immediately either. If the system crashes within the commit interval, you'll lose whatever data was written during that interval. That's only 5 seconds of data if you're lucky, much more data if you're unlucky. Ext4 simply made that commit interval and backend behavior different than what applications were expecting.
All modern fs drivers, including ext3 and NTFS, do not write immediately to disk. If they did then system performance would really slow down to almost unbearable speeds (only about 100 syncs/sec on standard consumer magnetic drives). And sometimes the sync call will not occur since some hardware fakes syncs (RAID controllers often do this).
POSIX doesn't define flushing behavior when writing and closing files. If your applications needs data to be in NV memory, use fsync. If it doesn't care, good. If it does care and it doesn't sync, it's a bad application and is flawed, plain and simple.
> Delayed writes should lose at most any data between commit and actual write to disk.
And that's exactly what ext4 does.
Application decides to update some file:
1) Reads the some file
2) Modifies the buffer as needed
3) Truncates the file
4) Writes the buffer to the file
Now, if the filesystem commit happens right between, 3 and 4, the truncation hits the disk, but the new content does not (yet). If a crash happens before the next commit, all what remains is the truncated file.
Nothing- except that it's not in the spec.
POSIX is like a contract. KDE is breaking the contract and then whining about it to ext4- which isn't breaking the contract. Just as in a court, KDE here doesn't have much of a leg to stand on.
The ringing of the division bell has begun... -PF
Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
You seem to misunderstand that's *exactly* what is happening.
KDE is *DELETING* all of its config files, then writing them back out again in two operations.
Three states now exist, the 'old old' state, where the original file existed, the 'old' state, where it is empty, and the 'new' state where it is full again.
The problem is getting caught between step #2 and step #3, which on ext3 was mostly mitigated by the write delay being only 5 seconds.
KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.
Whats wrong with "After a file is closed, its synced to disk"?!?
What, you want people to have to delay/stagger/coordinate their file closes in order to avoid overloading the filesystem? That is the wrong approach. close() just means that the application is done with the file. The sync calls are not a joke, they are there precisely for the reason that close() already has an antirely sensible but different semantics. Anybody that wants close also to sync can code it that way without problem. Anybody else probably does not want this behaviour in the first place.
This is not hidden in any way. A simple "man close" not warns of this, it also refers the reader to the fsync call. Anybody getting bitten by this did not no their homework.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
You are quite simply wrong. The GP states the correct POSIX behaviour. If anything is a flaw it is a flaw in POSIX, *not*the filesystem.
This kind of crap coupled with the recent Active Directory question where the Slashdot community proved that it does not know what the hell group policies do is the reason that GNU/Linux/GNOME/KDE will not get a (significant) share of the enterprise desktop - Linux fucking weenies who don't know jack.
The filesystem doesn't guarantee anything is written until you've called fsync and it has returned.
ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!
I read the FA, and it actually really does look like the applications are simply using stupidly risky practices:
These applications are truncating the file before writing (i.e., opening with O_TRUNC), and then assuming that the truncation and any following write are atomic. That's obviously not true -- what happens if your system is very busy (not surprising in the startup flurry which is apparently where this stuff happens), the process doesn't get scheduled for a while after the truncate (but before the write), and the system happens to crash in that interval?
I'm as lazy as they get, but even I know enough not to do that kind of crap...
There's probably some way the FS could finesse this issue -- e.g., don't actually schedule truncation until you see the first write or close -- but it would be a workaround for buggy applications, not a FS bugfix.
We live, as we dream -- alone....
mount -o sync. Enjoy your slow returns and strictly ordered writes.
It isn't a flaw. It is documented and the programmers didn't follow the docs. There is a specific command called fsync to flush the buffers to prevent the problem.
In fact here is a link to that call http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html
Yes if we had a prefect world we would have instant IO but we do not. The flaw is in the application plan and simple.
They didn't use the api properly and it really is just that simple.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Right... that way a single error can brick the whole system at once.
Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. While that may be correct from a POSIX lawyer pont of view, it is still heavily undesirable.
Much as I love to fallback on the "POSIX says that this could be the case so it is OK that it is the case" excuse, it really does not fly in this case. The POSIX doesn't allow this sort of behavior because it is a "good" thing to do, it allows it because there are systems where this is an OK thing to do -- systems intended to manage database, systems that are heavily verified and have backup power supplies, etc. This is not appropriate for desktop systems -- desktop systems have to be robust against all kinds of stupid situations, like sudden power losses, users hitting the reset button because an application is hanging, and so forth. EXT4 should not be used in a desktop system if it can cause data loss when the unexpected happens, regardless of the technical merits of writing to small configuration files.
Palm trees and 8
change their applications because a new version of the file system breaks their stuff is madness
Their applications were already broken, committing everything every 5 seconds* regardless of what the applications had wanted was the workaround in ext3, but I guess it's only madness when street-makers demand that you drive with round wheels, not when you demand that street-makers accommodate your square ones.
* Unless you increased the commit time to reduce power usage (eg laptop_mode)
If I have been able to see further than others, it is because I bought a pair of binoculars.
Citing from the message Ts'o post:
----
So, what is the problem. POSIX fundamentally says that what happens if the system is not shutdown cleanly is undefined. If you want to force things to be stored on disk, you must use fsync() or fdatasync(). There may be performance problems with this, which is what happened with FireFox 3.0[1] --- but that's why POSIX doesn't require that things be synched to disk as soon as the file is closed.
----
And indeed, and reading the NOTES section of "man -S2 close" explicitely notes what is not mentioned in the other sections. I up to this day also lived under the assumption that a close implies a fsync. Now i have to change my ptograms where it matters. All the Idiots who scream here that the OS is doing something worng: no, it's not. AFAIU it's following the befined behaviour which is what i expect an OS to do. It should NOT try to magically guess where i forgot to fsync my files.
Just use fsync()
Problem solved. Read the Posix docs, or the clib docs and you will never run into this problem.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Who modded this up? Jane Q. Public is completely clueless on this topic, but she manages to sound like she has an idea to fellow clueless moderators. She should be called out for the karma whoring ignoramus she is.
Some choice quotes from her on this thread.
Delayed allocation is like leading a moving target when shooting.
BadAnalogyGuy would be proud. Probably also worth mentioning that without delayed allocation, the system would be unbearably slow.
The longer you delay allocation after writing the journal (and Ext4 seems to take this to extremes), the more chance there is of something -- almost anything really -- going wrong
A kernel crash or power outage is certainly something that could go wrong. Modern journalling file-systems handle this gracefully by making sure the file-system is in a consistent state when it comes back up.
The filesystem is flawed, plain and simple.
You'll realize why that one is a gem when you read her next quote. As the discussion continues, she begins to realize how far off the mark she is and begins to correct...
It most definitely is a filesystem limitation. That is different from saying that it's the filesystem's fault.
Still off the mark, but perhaps she is beginning to figure out what a file system should offer and what the issue being discussed is.
If an application that reads and writes lots of small files fails under Ext4, then it is Ext4's fault, not the application. An application should be able to read and write lots of small files if it wants... I can think of a great many practical examples.
Go ahead and do that. But if you want to make sure you're data is written, in case of a kernel crash or power outage, then you had better understand what is going on at the FS level.
As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter.
No, but you should understand the API of the language you are dealing with. Since when does a compiler handle disk I/O anyway? As for your interpreter, it is free to call fsync whenever it wants, but what has that got to do with the FS again?
Excuse the heck out of me, but the issue being discussed was a failure that was NOT due to a power loss or other such system problem. It was a crash caused by this very issue. If it were simply missing data due to power loss or some such, there would be no point in this discussion at all.
The purpose of this quote is to demonstrate that she both has no regard for TFA and also has no idea what this issue being discussed is. I encourage anyone looking to give her mod points actually RTFA and also do a bit of background reading on file systems and in particular delayed writes.
My point was and still is: if the data is not flushed to disk yet, it should either be accessible from the buffer, or not at all.
This sentence alone deserves a -1 Huh? If you do a write, and it is successful, then you can do a read on the same file and it will return what you wrote, whether or not it had been flushed to disk. This is the way it is supposed to work. Think about it for like 10 seconds and you'll begin to get it.
not supposed to have to worry about OS-specific details
WE ARE TALKING ABOUT UNEXPECTED KERNEL CRASHED AND POWER OUTAGES. If you care about that situation then you should get a clue before you start coding. If not, then what is the problem, or was it fault... er, sorry limitation?
One should not have to know about syncing to do something like a few simple file writes
And one doesn't need to if she is not concerned with the rare possibility that the system CRASHES OR LOSES POWER in the next few minutes.
Anyway, I've never called out another poster like this before and now I feel dirty.
Today, I was experimenting with some BIOS settings that made the system crash right after loading the desktop. After a clean reboot pretty much any file written to by any application (during the previous boot) was 0 bytes. For example Plasma and some of the KDE core config files were reset. Also some of my MySQL databases were killed...
My EXT4 partitions all use the default settings with no performance tweaks. Barriers on, extents on, ordered data mode..
I used Ext3 for 2 years and I never had any problems after power losses or system crashes.
The crash was not caused by ext4 but by something else. The file system was in a consistent state because of the journal. Some data had not yet been written to disk, because of the delayed write and was thus lost.
Maybe you need to take a break, or have a coffee, or get some sleep or something. But you really are way off and posting way too much on this topic that you are not well informed of.
This is not a bug, not a flaw, not a limitation. You can write and then read regardless of whether or not actual disk commits take place. The file system takes care of that for you. If you're doing file I/O, and you want to call yourself half-way competent, then you should have some clue about the possibility that the underlying file-system will be doing delayed writes. If you a writing critical applications for which this may cause issue then you might decide to throw in some fsync calls (or there equivalent in whatever platform you are using).
I know you have learnt something today. Glad to help out.
It most definitely is a filesystem limitation.
No, it's not. The file system is perfectly capable of making sure all your writes hit the disk as soon as possible.
Just mount it with the 'sync' option.
If you want the significant performance benefits of delayed writes, however, you should not use 'sync' and accept that, with Ext4, write() works the way the documentation says it does.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
They are referring to the case when the system isn't shut down cleanly. This means a kernel crash or a power outage. What is your point exactly? Seriously, and I really am doing my best to hold back on the personal insults (even when you something as annoying as "And calm down !!"), what is so difficult that you fail to comprehend what the real issue being discussed here is?
If your battery-backed RAID controller ever fakes a fsync it is fundamentally broken or misconfigured. When the cache is filled with a write backlog and you try to write something else, that write will block until there is free space. Same as any other write cache that fills up.
When cache space is available to cache the write again, the data goes into there, and then a fsync request after it can then return success.
5 seconds might reduce the probability of problems, but it doesn't make the assumption a non-bug.
That's like saying if my code has a buffer overflow in it, but if it's only by 5 bytes, everything's ok, whereas if it's by 150 bytes, I should panic...
One way to test if your argument makes sense is to extend it to absurdity.
And the result has absolutely no bearing on the issue. Extending 5 seconds to infinity is nothing like extending 5 seconds to 150 seconds.
If this was done, the FS would (sooner or later) have to ignore fsync totally and re-assert control of commits in order to achieve any reasonable performance.
On some systems you may actually find this to be the case. On certain kernels, certain hard drives had write cache, and sync() would not force the drive itself to flush its own cache, data could be in there for minutes, to be lost in the event of an untimely power failure..
Most applications handle this reasonably; maintain transactional integrity, and sync() when it is critical that a write finish on a timely basis, and in event of a crash, revert to the last 'good' state.
Transactional database software like PostgreSQL are exceptional at this, and they do use sync.
If you have a lot of critical data, the right place to put it is in a DBM, that will handle and manage syncing correctly and optimally for the OS.
If you have small amounts of critical data, then you write them to flatfiles, and sync. The small size of the files, and the small number of writes you do to them will make performance a non-issue.
Maintaining integrity of critical data requires a lot more than a good filesystem, and the ability to ensure data is sync'ed to disk.
Because even 5 seconds is non-zero, which is all the time in the world, if you leave the files on disk such that they would be corrupt or inconsistent (should the system crash at that moment)
Filesystems don't and never did totally relieve application developers of having to worry about what might (or might not) be written to disk by the OS.
Certainly it's unreasonable they make particular assumptions about the exact nature of the duration it takes, since there are so many filesystems available, including some unusual ones like NFS.
(void)sleep(5); after a write is not, and never was a substitute for fsync(); for assuring data is written before writing more.
Telling application developers to use a database is bullshit. The filesystem is a database, albeit not a relational one. A open-write-close-rename sequence merely asks for atomicity without durability, something that's perfectly reasonable. As other posters have mentioned in vain, all the application wants is for either the old version of a file or the entire new version to appear on a reboot. He doesn't care at the instant of the rename whether that replacement has been recorded on disk, just that eventually, when the filesystem does record that replacement, that it's recorded atomically.
You might want the open-write-fsync-close-rename behavior for a mailserver, in which you must acknowledge receipt (i.e., you need durability), but asking for that same durability in a multi-file configuration setup is just stupidly degrading performance.
open-write-close-rename is saying something fundamentally different from open-write-fsync-close-rename, and it's perfectly reasonable for a filesystem to act sanely in response to both kinds of request.
All the Idiots who scream here that the OS is doing something worng: no, it's not.
This is called "hiding behind the standard" (a disease very common among kernel developers). Just because the standard doesn't specify behaviour in a certain situation doesn't mean that any behaviour is equally okay. In this case, ext4's behaviour very much hurts the robustness of the system, which is rather important in unreliable environments like laptops.
In this case, what KDE does is certainly not unreasonable (and its developers are certainly not "idiots"). It doesn't overwrite configuration files in place, which would be bad even in the absence of system crashes, as doing it that way is not atomic. Instead it creates a new temporary file, writes the new contents, then renames the temporary file to the old one. This is an atomic operation on Unix: you either see the old contents or the new contents, but nothing in between. Now, the problem is that in case of a crash, ext4 gives you the worst possible outcome by reordering the operations: it will "recover" the rename for you, but not the actual write of the new data. So you end up with a 0-byte file - far from atomic. POSIX of course allows this, but POSIX allows just about anything: that doesn't mean its reasonable. The only guaranteed solution - use an fsync/fdatasync - is something that almost nobody does because the performance is horrible (ext3 in fact will write the entire journal, IIRC, when doing an fsync() on a single file - this really hurt Firefox 3 performance). So the KDE developers can be excused for not doing that.
It's the job of a modern filesystem to ensure robustness and performance. If you don't use an fsync, you should expect that there is a time window during which transactions might become undone (not the end of the world for configuration files), but they should never be reordered. For instance, this is how Berkeley DB works if you disable fsync: it guarantees ACI but not ACID. For many desktop applications, that's good enough. Destroying every file that has been updated since the last fsync isn't. And your users aren't going to be impressed by the argument that POSIX allows it.
open-write-close-rename already asks for atomic but asynchronous rename under all sane systems
I'm not sure what you're saying here. Are you arguing that such a sequence should be treated specially by the OS? Why?
XFS and ext4 break that perfectly sane sequence of operations
It isn't sane. It's like replacing your tires with the engine running and your kid sitting behind the wheel. Sure it might work 9 out of 10 times, until your kid switches the car into gear.
KDE (and Gnome) are truncating critical system files without a backup available. How is that sane? Sure they will immediately rewrite the file, but who will guarantee that the system will not crash between the truncate and the write?
And finally, they aren't doing open-write-close-rename. They're doing truncate-write-close. What they should be doing is create-write-close-sync-rename, i.e. do not overwrite the old config file before the new content is safely stored on disk. And I think the reason that they did not go the correct way (assuming they were aware of the issue) is because the "safe" way sucked performance-wise. Well duh, if you write hundreds of 50-byte files, performance will suck, unless you skip safety protocol.
Well, it has to be said that fsync() on ext3 is slow because of an ext3 bug - fsync() is the same as sync() on ext3.
Watch this Heartland Institute video
Re: "backup old file and write a new one" - A transactional copy-on-write filesystem such as Sun's ZFS is doing almost the same job, transparently.
I have little doubt that copy-on-write will eventually supersede overwrite-and-pray filesystems. The wins are numerous, including cheap snapshotting, etc, etc. Install OpenSolaris and give ZFS a try today!
you had me at #!