Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4
cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54
Don't worry guys, I read the summary this time, and it only affects the German version of ext4.
Real reason for the bug report: Someone's angry and wants his porn back.
Blaming it on the applications is a cop-out. The filesystem is flawed, plain and simple. The journal should not be written so far in advance of the records actually being stored. That is a recipe for disaster, no matter how much you try to explain it away.
The problem here is that delaying writes speeds up things greatly but has this possible side-effect. For a shorter commit time, simply stay with ext3. You can also mount your filesystems "sync" for a dramatic performance hit, but no write delay at all.
Anyways, with moderen filesystems data does not go to disk immediately, unless you take additional measures, like a call to fsync. This should be well known to anybody that develops software and is really not a surprise. It has been done like that on server OSes for a very long time. Also note that there is no loss of data older than the write delay period and this only happens on a system crash or power-failure.
Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
It is a trade-off between reliability and performance. In this case, Older!== better either. A lot of OS design decisions are trade-offs.
It's amazing how fast a filesystem can be if it makes no guarantees that your data will actually be on disk when the application writes it.
Anyone who assumes modern filesystems are synchronous by default is deluded. If you need to guarantee your data is actually on disk, open the file with O_SYNC semantics. Otherwise, you take your chances.
Moreover, there's no assertion that the filesystem was corrupt as a result of the crash. That would be a far more serious concern.
Meh, this is crap that happens only when the system crashes, and is pretty much unavoidable if you're doing a lot of caching in memory -- which, coincidentally, is what you need to do to maximize performance. This doesn't sound like the filesystem's "fault" or the application's "fault;" it's just the way things are. Everybody knows that if you don't cleanly unmount, most bets are off.
So, POSIX never guarantees your data is safe unless you do fsync(). So, ext3 was not 100% safer either. So, it's the applications' fault that they truncate files before writing.
But it doesn't matter what POSIX says. It doesn't matter where the fault belongs to. To the users, a system either works nor not, as a whole.
EXT4 aims to replace EXT3 and becomes the next gen de-facto filesystem on Linux desktop. So it has to compete with EXT3 in all regards; not just performance, but data integrity and reliability as well. If in the common scenarios people lose data on EXT4 but not EXT3, the blame is on EXT4. Period.
It's the same thing that a kernel does. You have to COPE with crappy hardware and user applications, because that's your job.
There are several excuses circulating: 1. This is not a bug, 2. It's the apps' fault, 3. all modern filesystems are at risk.
This is all a bunch of BS! Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!
I'll take "I didn't lose my data" over "ext4 runs 1.5x faster than ext3," thank you. What use is performance to me if I have to be absolutely certain that it won't crash, or I lose my (in my very high performance filesystem) data?
Also, ext4 is toted as having additional reliability checks to keep up with scalability, etc... not less reliable at expense of performance.
Reliability
As file systems scale to the massive sizes possible with ext4, greater reliability concerns will certainly follow. Ext4 includes numerous self-protection and self-healing mechanisms to address this.
(from Anatomy of ext4)
I can only imagine the response if tests were done on Windows 7 beta that showed a crash after this or that resulted in loss of data. :)
The problem is not the many small files, but the missing disk sync. The many small files just make the issue more pbvous.
True, with ext4 this is more likely to cause problems, but any delayed write can cause this type of issue when no explicit flush-to-disk is done. And lets face it: fsync/fdatasync are not really a secret to any competent developer.
What however is a mistake, and a bad one, is making ext4 the default filesystem at this time. I say give it another half year, for exactly this type of problem.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Thing is that ext3 is using the same strategy on a smaller scale. The same argument could be made to say that 3 seconds is far too long to be out of date. How many instructions are you going to run in 3 seconds? Defects run at 5-8 per/kloc on average. Certainly not all are fatal, but how long of a delay is too long to avoid a potentially fatal defect? Obviously the delay they have chosen is too long, but is the performance hit that ext3 takes for having a 3 second delay rather than a 5 or 10 or 15 second delay worth it?
When I write data to a file (either through a descriptor or FILE *), I expect it to be stored on media at the earliest practical moment. That doesn't mean "immediately", but 150 seconds is brain-damaged. Regardless of how many reads are pending, writes must be scheduled, at least in proportion of the overall file system activity, or you might as well run on a ramdisk.
While reading/writing a flurry of small files at application startup is sloppy from a system performance point of view, data loss is not the application developers' fault, it's the file system designers'.
BTW, I write drivers for a living, have tuned SMD drive format for performance, and written microkernels, so this is not a developer's rant.
We use techniques that show great performance so people can see we beat ext3 and other filesystems.
Oh shit, as a tradeoff we lose more data in case of a crash. But it's not our fault.
Honestly, you cannot eat your cake and have it too.
As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter. Do not expect me to know Assembly language for a given chip, for example, in order to implement a calendar program. The very idea is ridiculous.
*No* modern, desktop-usable file systems today guarantee new files to be there if the power goes out except if the application specifically requests it with O_SYNC, fsync() and similar techniques (and then only "within reason" - actually the most guarantee that the file system will recover itself, not the data). It is universally true - for UFS (the Unix file system), ext2/3, JFS, XFS, ZFS, raiserfs, NTFS, everything. This can only be a topic for inexperienced developers that don't know the assumptions behind the systems they use.
The same is true for data ordering - only by separating the writes with fync() can one piece of data be forced to be written before another.
This is an issue of great sensitivity for databases. See for example:
That there exist reasonably reliable databases is a testament that it *can* be done, with enough understanding and effort, but is not something that's automagically available.
-- Sig down
Nothing- except that it's not in the spec.
POSIX is like a contract. KDE is breaking the contract and then whining about it to ext4- which isn't breaking the contract. Just as in a court, KDE here doesn't have much of a leg to stand on.
The ringing of the division bell has begun... -PF
Whats wrong with "After a file is closed, its synced to disk"?!?
What, you want people to have to delay/stagger/coordinate their file closes in order to avoid overloading the filesystem? That is the wrong approach. close() just means that the application is done with the file. The sync calls are not a joke, they are there precisely for the reason that close() already has an antirely sensible but different semantics. Anybody that wants close also to sync can code it that way without problem. Anybody else probably does not want this behaviour in the first place.
This is not hidden in any way. A simple "man close" not warns of this, it also refers the reader to the fsync call. Anybody getting bitten by this did not no their homework.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
The filesystem doesn't guarantee anything is written until you've called fsync and it has returned.
"And lets face it: fsync/fdatasync are not really a secret to any competent developer."
I disagree. Users of high-level languages (especially those that are cross-platform) are not necessarily aware of this situation, and arguably should not need to be.
And I disagree with your disagreement. This is something any competent developer has to know. There are fundamental limits in practical computing. This is one. It cannot be hidden without dramatic negative effects on performance. It is not a platform-specific problem. It is not a language-specific problem. It is not a hidden issue. A simple "man close" will already tell you about it. Any decent OS course will cover the issue.
I reiterate: Any good developer knows about write-buffering and knows at least that extra measures have to be taken to ensure data is on disk. Those that do not are simply not good developers.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
mount -o sync. Enjoy your slow returns and strictly ordered writes.
Bah. Maybe all computers should come with a single-cell battery, for a couple of minutes of backup power.
As soon as power fails to the system and it resorts to battery, all calls to write() should also call fsync(), even if that slows the system down.
Never mind an option that implicitly calls fsync() if it hasn't been called in the past 3 seconds, for a minimal performance hit. If you have a specific application that doesn't want fsync() then you can disable that feature, but clearly on a consumer box, no UPS, potentially dodgy hardware and drivers, it makes sense. 150 seconds without a sync, just dumping into a buffer for writing ... sheesh.
Unix philosophy is to make configuration files user- and script-editable. NOT to create hundreds of files per app making it utterly unmanageable.
-Billco, Fnarg.com
You are an idiot. The design of the POSIX API dictates that fsync (or equivalent) is required to ensure data is flushed to disk. This has been true forever. If an abstraction in an i/o library is not using the API correctly, it is the fault of the library.
You are correct that the user of the abstraction should not care, but you are putting the blame in the wrong place. The whole point of using an abstraction is to hide details such as this. If the library author is too stupid to learn the API he is abstracting that is HIS fault.
Optimize the reads all you want, but those writes better damn well happen before the calls that say data is written return.
And this is where most of the confusion comes from. There is a difference between a logical write and a physical write. When your write call completes, it says the logical write has completed. It says nothing about the physical write. Depending on file system semantics, your physical write may have already completed too - or shortly after. If you must explicitly ensure the physical write is complete then you must explicitly ensure it via code - otherwise the physical write can only be assumed. And this is where the the lessor informed seem confused by their own poor expectations and ignorance. Unless they are actually following their write with some sort of file system synchronization call, ignoring their ignorant expectation, they have no right what-so-ever to assume the data will still be there in the face of a system crash. Its a very poor coder who falls into that trap.
Good programmers know this and have known it for tens of years. Good database programmers know this. Good file system developers know this. Those that are outraged by their own ignorance are either not programmers or are not good programmers.
And lastly, I'll point out, which is exactly why Tso pointed it out - use a solution where its foundation is built by coders who already understand the proper way to ensure data is safe on the file system - for example, use a database. While I don't consider the use of a database to be an ideal solution here, it does a wonderful job of highlighting the crappy design both KDE and GNOME have used to store configuration data - and how unconcerned they are about data loss and data corruption. If the developers of KDE and GNOME don't give a crap about your configuration data then how on earth can you possibly be upset at the file system for doing what its suppose to do?
In short, both KDE and GNOME need to give a crap about how, when, and why they write configuration data. Since they don't care about data integrity, you now know who you should be angry at. Here's a hint, and it doesn't have anything to do with the file system.
Much as I love to fallback on the "POSIX says that this could be the case so it is OK that it is the case" excuse, it really does not fly in this case. The POSIX doesn't allow this sort of behavior because it is a "good" thing to do, it allows it because there are systems where this is an OK thing to do -- systems intended to manage database, systems that are heavily verified and have backup power supplies, etc. This is not appropriate for desktop systems -- desktop systems have to be robust against all kinds of stupid situations, like sudden power losses, users hitting the reset button because an application is hanging, and so forth. EXT4 should not be used in a desktop system if it can cause data loss when the unexpected happens, regardless of the technical merits of writing to small configuration files.
Palm trees and 8
It isn't a file system limitation. And here is why.
1. The POSIX standard specifies that writes may be delayed. Every modern file system may delay writes.
2. The POSIX standard then gives you a way to flush the buffer at the time of the programs choosing. It is called fsync(). If the programmer called that well documented function then all would have been well.
You have the best performance possible and you can insure that file is flushed before you do something else.
The file system didn't cause this bug. The posix spec didn't cause this bug. The programmer that didn't use the tools as documented caused his own bug.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
change their applications because a new version of the file system breaks their stuff is madness
Their applications were already broken, committing everything every 5 seconds* regardless of what the applications had wanted was the workaround in ext3, but I guess it's only madness when street-makers demand that you drive with round wheels, not when you demand that street-makers accommodate your square ones.
* Unless you increased the commit time to reduce power usage (eg laptop_mode)
If I have been able to see further than others, it is because I bought a pair of binoculars.
Citing from the message Ts'o post:
----
So, what is the problem. POSIX fundamentally says that what happens if the system is not shutdown cleanly is undefined. If you want to force things to be stored on disk, you must use fsync() or fdatasync(). There may be performance problems with this, which is what happened with FireFox 3.0[1] --- but that's why POSIX doesn't require that things be synched to disk as soon as the file is closed.
----
And indeed, and reading the NOTES section of "man -S2 close" explicitely notes what is not mentioned in the other sections. I up to this day also lived under the assumption that a close implies a fsync. Now i have to change my ptograms where it matters. All the Idiots who scream here that the OS is doing something worng: no, it's not. AFAIU it's following the befined behaviour which is what i expect an OS to do. It should NOT try to magically guess where i forgot to fsync my files.
People keep making arguments about the spec, but this seems like a case of throwing the baby out with the bathwater. The spec is intended to serve the interest of robustness, not the other way around; demolishing robustness and then citing the spec is forgetting why there is a spec in the first place.
Yes, you can design something that's intentionally brain-dead, but still true to spec as a kind of intellectual exercise about extremes, but in the real world, the idea should be the opposite:
Stay true to the spec and try to robustly handle as many contingencies as is possible. Both developers should do this, filesystem and application, not "just" one or the other.
It's not enough just to be true to spec; the idea is to get something that works as well, not jump through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes.
It's the bad outcomes that we're trying to mitigate by having a spec in the first place!
So my point: what exactly is wrong with meeting the spec and trying to prevent serious problems by other coders from affecting your own code? I thought this was a basic part of coding: even if someone else is an idiot programmer, that doesn't make it okay to let the whole system fall down. Or did we all miss the part where we went for protected memory access and pre-emptive multitasking? Hell, if everybody had just been a great programmer, none of that would have been needed.
The point is to have a working system by following the spec and to try to clean up behind other programmers when they don't as much as possible within your own spec-compliant code. The point is not simply to "meet spec" and the actual utility of the system or vulnerability to the mistakes of others be damned.
STOP . AMERICA . NOW
The old defaults were: 5 seconds in ext3, in NTFS metadata is always and data flushed asap with but no guarantees. In practice, people don't lose huge amount of work.
Actually, I've lost multi-gigabyte files on NTFS; in one particular case I left IE downloading a game installer overnight, heard it beep around 8am to tell me it had completed, and then the power went out a couple of hours later before I got up. The file system was magically 'consistent' after the power came back and it rebooted, but it achieved that by deleting over two gigabytes of my data.
Modern file systems may be a bit faster than FAT32, but they're shit when it comes to reliably storing data.
In this case, yes, the KDE developers are retarded, but if the ext4 developers want ext4 to become the default filesystem for Linux, they need to make it work with retarded developers. 'But POSIX says we can do this' is worthless if it loses large amounts of user data; heck, you can easily guarantee 'file system consistency' by simply reformatting the disk on every reboot, but your users would be pretty damn pissed.
Today, I was experimenting with some BIOS settings that made the system crash right after loading the desktop. After a clean reboot pretty much any file written to by any application (during the previous boot) was 0 bytes. For example Plasma and some of the KDE core config files were reset. Also some of my MySQL databases were killed...
My EXT4 partitions all use the default settings with no performance tweaks. Barriers on, extents on, ordered data mode..
I used Ext3 for 2 years and I never had any problems after power losses or system crashes.
The crash was not caused by ext4 but by something else. The file system was in a consistent state because of the journal. Some data had not yet been written to disk, because of the delayed write and was thus lost.
Maybe you need to take a break, or have a coffee, or get some sleep or something. But you really are way off and posting way too much on this topic that you are not well informed of.
This is not a bug, not a flaw, not a limitation. You can write and then read regardless of whether or not actual disk commits take place. The file system takes care of that for you. If you're doing file I/O, and you want to call yourself half-way competent, then you should have some clue about the possibility that the underlying file-system will be doing delayed writes. If you a writing critical applications for which this may cause issue then you might decide to throw in some fsync calls (or there equivalent in whatever platform you are using).
I know you have learnt something today. Glad to help out.
As it turns out, the point is probably moot. As someone else has pointed out, the bug report itself (not TFA) makes it clear that the trashed data was, in fact, caused by a system crash and not by filesystem access per se. TFA and the headline both strongly implied otherwise, but as it turns out, this is a non-issue.
It most definitely is a filesystem limitation.
No, it's not. The file system is perfectly capable of making sure all your writes hit the disk as soon as possible.
Just mount it with the 'sync' option.
If you want the significant performance benefits of delayed writes, however, you should not use 'sync' and accept that, with Ext4, write() works the way the documentation says it does.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
My core2quad machine with 3 SATA disk RAID runs for about 20 minutes on a tiny APC UPS I bought from newegg for less than $100.
Sure, but that's assuming you can save your work in all open applications without power to your display. Me, I like a UPS with a little more juice so I can reap the fullness of my 52" plasma while cleaning up and shutting down.
I am literally 3000 tokens away from the chaotic crossbow --Stephen
"Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. "
Two things are happening:
(1) KDE is writing a new inode.
(2) KDE is renaming the directory entry for the inode, replacing an existing inode in the process.
KDE never calls fsync(2), so the data from step one is not committed to be on disk. Thus, KDE is atomically replacing the old file with an uncommitted file. If the system crashes before it gets around to writing the data, too bad.
EXT4 isn't "broken" for doing this, as endless people have pointed out. The spec says if you don't call fsync(2) you're taking your chances. In this case, you gambled and lost.
KDE isn't "broken" for doing this unless KDE promised never to leave the disk in an inconsistent state during a crash. That's a hard promise to keep, so I doubt KDE ever made it.
A system crash means loss of data not committed to disk. A system crash frequently means loss of lots of other things, too. Unsaved application data in memory which never even made it to write(2). Process state. Service availability. Jobs. Money. System crashes are bad; this should not be news.
The database suggestion some are making comes from the fact that if you want on-disk consistency *and* good performance, you have to do a lot of implementation work, and do things like batching your updates into calls to write(2) and fsync(2). Otherwise, performance will stink. This is a big part of what databases do.
As someone else suggested, it's perfectly easy to make writes atomic in most filesystems. Mount with the "sync" option. Write performance will absolutely suck, but you get a never-loses-uncommitted-data filesystem.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
If your battery-backed RAID controller ever fakes a fsync it is fundamentally broken or misconfigured. When the cache is filled with a write backlog and you try to write something else, that write will block until there is free space. Same as any other write cache that fills up.
When cache space is available to cache the write again, the data goes into there, and then a fsync request after it can then return success.
All the Idiots who scream here that the OS is doing something worng: no, it's not.
This is called "hiding behind the standard" (a disease very common among kernel developers). Just because the standard doesn't specify behaviour in a certain situation doesn't mean that any behaviour is equally okay. In this case, ext4's behaviour very much hurts the robustness of the system, which is rather important in unreliable environments like laptops.
In this case, what KDE does is certainly not unreasonable (and its developers are certainly not "idiots"). It doesn't overwrite configuration files in place, which would be bad even in the absence of system crashes, as doing it that way is not atomic. Instead it creates a new temporary file, writes the new contents, then renames the temporary file to the old one. This is an atomic operation on Unix: you either see the old contents or the new contents, but nothing in between. Now, the problem is that in case of a crash, ext4 gives you the worst possible outcome by reordering the operations: it will "recover" the rename for you, but not the actual write of the new data. So you end up with a 0-byte file - far from atomic. POSIX of course allows this, but POSIX allows just about anything: that doesn't mean its reasonable. The only guaranteed solution - use an fsync/fdatasync - is something that almost nobody does because the performance is horrible (ext3 in fact will write the entire journal, IIRC, when doing an fsync() on a single file - this really hurt Firefox 3 performance). So the KDE developers can be excused for not doing that.
It's the job of a modern filesystem to ensure robustness and performance. If you don't use an fsync, you should expect that there is a time window during which transactions might become undone (not the end of the world for configuration files), but they should never be reordered. For instance, this is how Berkeley DB works if you disable fsync: it guarantees ACI but not ACID. For many desktop applications, that's good enough. Destroying every file that has been updated since the last fsync isn't. And your users aren't going to be impressed by the argument that POSIX allows it.
Well, it has to be said that fsync() on ext3 is slow because of an ext3 bug - fsync() is the same as sync() on ext3.
Watch this Heartland Institute video
If the guys writing the FS can't figure out how to properly write a cache that's not the problem of the application writers.
If I save a file via an OS call and the OS tells me it didn't fail then if I can't immediately reread it then the OS is broken.
Data loss from write caching is not a new problem either. Guess this year's crop of programmers can't figure out how to use google to find out about past problems or they just figure they're smarter than everyone else that came before them.
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
Re: "backup old file and write a new one" - A transactional copy-on-write filesystem such as Sun's ZFS is doing almost the same job, transparently.
I have little doubt that copy-on-write will eventually supersede overwrite-and-pray filesystems. The wins are numerous, including cheap snapshotting, etc, etc. Install OpenSolaris and give ZFS a try today!
you had me at #!