Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4
cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54
He advises that "this is really more of an application design problem more than anything else."
On the Oregon Cost born and raised, On the beach is where I spent most of my days
Don't worry guys, I read the summary this time, and it only affects the German version of ext4.
Newer !== better
"The problem with socialism is eventually you run out of other people's money" - Thatcher.
Real reason for the bug report: Someone's angry and wants his porn back.
Blaming it on the applications is a cop-out. The filesystem is flawed, plain and simple. The journal should not be written so far in advance of the records actually being stored. That is a recipe for disaster, no matter how much you try to explain it away.
The problem here is that delaying writes speeds up things greatly but has this possible side-effect. For a shorter commit time, simply stay with ext3. You can also mount your filesystems "sync" for a dramatic performance hit, but no write delay at all.
Anyways, with moderen filesystems data does not go to disk immediately, unless you take additional measures, like a call to fsync. This should be well known to anybody that develops software and is really not a surprise. It has been done like that on server OSes for a very long time. Also note that there is no loss of data older than the write delay period and this only happens on a system crash or power-failure.
Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
No problem, just run Firefox and it'll make sure your disks are synch'd all the time ;)
not with ext4, but with xfs. lat month, I created an xfs logical volume and exported it with nfs (with fsync). I chose xfs because this was for large files (videos). After copying a couple files, the xfs volume developed errors and was unrecoverable. I've never seen a file system so fucked up so easily without hardware problems.
If Microsoft hadn't written this crappy code, and they'd used Linux instead, this wouldn't have happened.
It's amazing how fast a filesystem can be if it makes no guarantees that your data will actually be on disk when the application writes it.
Anyone who assumes modern filesystems are synchronous by default is deluded. If you need to guarantee your data is actually on disk, open the file with O_SYNC semantics. Otherwise, you take your chances.
Moreover, there's no assertion that the filesystem was corrupt as a result of the crash. That would be a far more serious concern.
Meh, this is crap that happens only when the system crashes, and is pretty much unavoidable if you're doing a lot of caching in memory -- which, coincidentally, is what you need to do to maximize performance. This doesn't sound like the filesystem's "fault" or the application's "fault;" it's just the way things are. Everybody knows that if you don't cleanly unmount, most bets are off.
So, POSIX never guarantees your data is safe unless you do fsync(). So, ext3 was not 100% safer either. So, it's the applications' fault that they truncate files before writing.
But it doesn't matter what POSIX says. It doesn't matter where the fault belongs to. To the users, a system either works nor not, as a whole.
EXT4 aims to replace EXT3 and becomes the next gen de-facto filesystem on Linux desktop. So it has to compete with EXT3 in all regards; not just performance, but data integrity and reliability as well. If in the common scenarios people lose data on EXT4 but not EXT3, the blame is on EXT4. Period.
It's the same thing that a kernel does. You have to COPE with crappy hardware and user applications, because that's your job.
There are several excuses circulating: 1. This is not a bug, 2. It's the apps' fault, 3. all modern filesystems are at risk.
This is all a bunch of BS! Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!
The problem is not the many small files, but the missing disk sync. The many small files just make the issue more pbvous.
True, with ext4 this is more likely to cause problems, but any delayed write can cause this type of issue when no explicit flush-to-disk is done. And lets face it: fsync/fdatasync are not really a secret to any competent developer.
What however is a mistake, and a bad one, is making ext4 the default filesystem at this time. I say give it another half year, for exactly this type of problem.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
and I don't know what's going on behind the curtain, nor do I care. I can't recall losing any data in such a manner since, well, ever. even given the fat32 and fat16 days. there was that one time I managed to destroy someone's data with doublespace... anyways, the important thing is that I had an onion tied to my belt, which was the style at the time...
Lack of data loss during unexpected power outages or shutdowns was the primary reason people adopted ext3. Journaling was supposed to fix exactly this.
When I write data to a file (either through a descriptor or FILE *), I expect it to be stored on media at the earliest practical moment. That doesn't mean "immediately", but 150 seconds is brain-damaged. Regardless of how many reads are pending, writes must be scheduled, at least in proportion of the overall file system activity, or you might as well run on a ramdisk.
While reading/writing a flurry of small files at application startup is sloppy from a system performance point of view, data loss is not the application developers' fault, it's the file system designers'.
BTW, I write drivers for a living, have tuned SMD drive format for performance, and written microkernels, so this is not a developer's rant.
News at 11.
True, posix says that unless you do a fsync(), the file might never be written to disk before the system crashes. But Whiskey-tango-Foxtrot?
Whats wrong with "After a file is closed, its synced to disk"?!?
Test your net with Netalyzr
If an application that reads and writes lots of small files fails under Ext4, then it is Ext4's fault, not the application. An application should be able to read and write lots of small files if it wants... I can think of a great many practical examples
Yeah, but it's not just ext4, it's any modern filesystem. If the application writes thousands of individual files (without fsync()) and there is a power failure or system crash then data loss is possible. This isn't ext4's 'fault' any more than it's the applications 'fault'. It isn't a bug or a bad design decision either; it's just how things are.
Your comment scares me. See my comment below. Are you sure you're not me?
No, not really. Journalling is done so that after a crash the filesystem is in a consistent state, and that does *not* include the no-data-loss requirement you are talking about.
We use techniques that show great performance so people can see we beat ext3 and other filesystems.
Oh shit, as a tradeoff we lose more data in case of a crash. But it's not our fault.
Honestly, you cannot eat your cake and have it too.
T'so may argue that we can't "have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories", but isn't UNIX philosophy all about having just that? And isn't it the filesystem's job to handle the files? Fix EXT4!
Wait, are you saying the crashing of an alpha level OS could cause data loss? I find this unfathomable.
Lots of small files isn't bad on its own. In fact, it's downright common. Ext4's design does consider this case and makes these operations efficient.
The problem with small files is data consistency. If the application requires a file hierarchy and associated buffers to be on disk before continuing, then a call to fsync() is required (even on ext3). Implicitly syncing on every small file will kill performance, so don't do that.
So is this a new trend to design systems? Make them reliable from top to bottom? Designing an upper-layer part of the system to work around the flaws of a lower layer system component is often necessary but is not the right thing to do it. Telling application developers to change their applications because a new version of the file system breaks their stuff is madness. No matter what POSIX standards say: it worked before, it is broken now: go fix it.
As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter. Do not expect me to know Assembly language for a given chip, for example, in order to implement a calendar program. The very idea is ridiculous.
I thought everybody knew that you don't use a new filesystem until it's stable enough for even Debian to use it as their default.
*No* modern, desktop-usable file systems today guarantee new files to be there if the power goes out except if the application specifically requests it with O_SYNC, fsync() and similar techniques (and then only "within reason" - actually the most guarantee that the file system will recover itself, not the data). It is universally true - for UFS (the Unix file system), ext2/3, JFS, XFS, ZFS, raiserfs, NTFS, everything. This can only be a topic for inexperienced developers that don't know the assumptions behind the systems they use.
The same is true for data ordering - only by separating the writes with fync() can one piece of data be forced to be written before another.
This is an issue of great sensitivity for databases. See for example:
That there exist reasonably reliable databases is a testament that it *can* be done, with enough understanding and effort, but is not something that's automagically available.
-- Sig down
They are.
The only calls that say data is written are the fsync() family (or files opened with O_SYNC.
Those calls do not lie.
Unfortunately, application developers are assuming that other calls say something they do not, in fact, say. This is where the problem comes in. close() does not guarantee that data has been written yet. fsync();close(); does.
The ringing of the division bell has begun... -PF
There's no magic in ZFS Intent Log. It slows down writes greatly, but give back data integrity. Google "zil_disable" to find a bunch of people that are surprised by the slow write performance of ZFS.
No thanks. Been there. Suffered through that. Give me lots of little files.
DON'T hide them in an all-or-nothing database!
Can you please but a brain and then return here, I wonder how you even get mods up, the amount of crap you are posting here is just amazing.
Go read the specs or even just man fsync
It has not changed in *ages*.
The fact is that with ext3 delayed writes where only 5 seconds apart so by *SHEER LUCK* any application that didn't use fsync *MOST OF THE TIME* did not had problems.
Now if you think that properly written application should keep relying on *LUCK* instead of properly using the POSIX interfaces Linux rely on, then go troll elsewhere. Probably a Visual Basic Forum is right about your level of knowledge ...
The filesystem doesn't guarantee anything is written until you've called fsync and it has returned.
The job of the filesystem is to provide system calls whose behavior has been clearly specified. Now point to me where SUS or POSIX or your favorite Unix standard says, e.g., that write(2) ensures data has been flushed to disk upon return.
The write(2) manpage on my Linux system says
And the situation is similar for many other filesystem-related system calls.
If the applications are making wrong assumptions about what the system calls they use provide, they are indeed "poorly written". And the filesystem shouldn't get the blame for that.
Score: i, Imaginary
They use battery-backed cache. The data is written and stored, just not on disk yet. The battery is supposed to last a couple days. If you need to shut a server down for longer than that ... well just don't yank the power cord, perform a clean shutdown.
That is the point: Ext4 greatly increases the delays, and thereby increases the risk of something going wrong. Sure, it is a tradeoff... but it is beginning to appear that Ext4 traded off a bit too much.
"And lets face it: fsync/fdatasync are not really a secret to any competent developer."
I disagree. Users of high-level languages (especially those that are cross-platform) are not necessarily aware of this situation, and arguably should not need to be.
AFAIK, nobody's calling for ext4 as default. Just calling for people to test it. This is a report in a development version of Ubuntu, after all.
I Browse at +4 Flamebait
Open Source Sysadmin
If I understand things correctly, while there is a significant hit when writing lots of small files and fsyncing after each of them, you take a hit except when you're journalling data. But in that case you take a hit when writing big files, since data has to be written twice (first in the journal, then when the journal is flushed).
This is a classic case of bad defaults. Yes, you will always have a trade off between performance and security, but going for either extreme is bad usability!
People expect that, without explicit syncing, the data is safe after a short period of time, measure in seconds. The old defaults were: 5 seconds in ext3, in NTFS metadata is always and data flushed asap with but no guarantees. In practice, people don't lose huge amount of work.
What happened is that the ext4 team thought waiting up to a *minute* to reorder writes is a good idea - choosing for the extreme end of performance.
My question is: WHY? Does it really matters to home users that KDE or Firefox starts 0.005 seconds faster? Apparently, the wait period is long enough to have real life consequences even with limited amount of testers, imaging what happens when it gets rolled out to everyone. On servers, it's redundant. Data is worth much, much more than anything you hope to gain and SSD's, battery backed write cache on controllers and SAN's have taken care of fsync's() already. If you run databases, those sync their disks anyway, so you just traded a huge chunk of reliability for "performance" on stuff like /home, /var/mail and /etc.
The "solution" of mounting the volume with the sync everything flag is just stupid. Yay, lets go for the other extreme - sync every bit moving to the disk. Isn't it already obvious that either extreme is silly?
Just set innodb^W ext4_flush_log_at_trx_commit on something less stupid already, flushing once every second shouldn't kill any disk. Copy Microsoft for config options:
* Disable flush metadata on write -> "This setting improves disk performance, but a power outage or equipment failure might result in data loss".
* Enable "advanced performance" disk write cache -> "Recommended only for disks with a battery backup power supply" etc etc.
* Enable cache stuff in RAM for 60s -> "Just don't do it okay, it's stupid."
This AC is spot on! Well said sir/madam!
"... the standards are particularly clear about what is guaranteed and what is not."
That still does not make it any less of a filesystem limitation! Are we speaking the same language?
"And lets face it: fsync/fdatasync are not really a secret to any competent developer."
I disagree. Users of high-level languages (especially those that are cross-platform) are not necessarily aware of this situation, and arguably should not need to be.
And I disagree with your disagreement. This is something any competent developer has to know. There are fundamental limits in practical computing. This is one. It cannot be hidden without dramatic negative effects on performance. It is not a platform-specific problem. It is not a language-specific problem. It is not a hidden issue. A simple "man close" will already tell you about it. Any decent OS course will cover the issue.
I reiterate: Any good developer knows about write-buffering and knows at least that extra measures have to be taken to ensure data is on disk. Those that do not are simply not good developers.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
AFAIK, nobody's calling for ext4 as default. Just calling for people to test it. This is a report in a development version of Ubuntu, after all.
Ah, ok. Then the story should perhaps have been called "Experimental version of Ubuntu is not totally reliable"?
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Read the POSIX specification. Any developer of applications that need on-disk data integrity already knows this! The KDE/GNOME developers (and many others) have just gotten lazy about doing things properly. They now need to fix the Gnome/KDE libraries and applications to NOT do stupid things!
If you don't understand this, you really need to refrain from talking about things you don't know anything about.
Over-the-top Response Guy! Giving "Over-the-Top Responses" since 1970.
Bah. Maybe all computers should come with a single-cell battery, for a couple of minutes of backup power.
As soon as power fails to the system and it resorts to battery, all calls to write() should also call fsync(), even if that slows the system down.
Never mind an option that implicitly calls fsync() if it hasn't been called in the past 3 seconds, for a minimal performance hit. If you have a specific application that doesn't want fsync() then you can disable that feature, but clearly on a consumer box, no UPS, potentially dodgy hardware and drivers, it makes sense. 150 seconds without a sync, just dumping into a buffer for writing ... sheesh.
The problem isn't that a pending file UPDATE is lost, it's that the ORIGINAL is lost, too.
As the 2nd post points out (and I think that's about the only post so far that got it right) the FS shouldn't record the journal update before the actual file update -- else the original file is lost!
Ext3's commit interval was one of its best features.
Sure, it doesn't have to make guarantees when the app doesn't explicitly sync, but losing data 1% of the time in an outage is better than losing data 99% of those in those cases.
Whenever I saw people complaining of losses in XFS that wouldn't have happened in ext3, the "doesn't have to guarantee unless synced" thing was brought up as an excuse.
Excuse the heck out of me, but the issue being discussed was a failure that was NOT due to a power loss or other such system problem. It was a crash caused by this very issue. If it were simply missing data due to power loss or some such, there would be no point in this discussion at all.
Perhaps you would benefit from reading TFA? Then maybe you would know what we are discussing and see the point after all.
You are an idiot. The design of the POSIX API dictates that fsync (or equivalent) is required to ensure data is flushed to disk. This has been true forever. If an abstraction in an i/o library is not using the API correctly, it is the fault of the library.
You are correct that the user of the abstraction should not care, but you are putting the blame in the wrong place. The whole point of using an abstraction is to hide details such as this. If the library author is too stupid to learn the API he is abstracting that is HIS fault.
If you can't be bothered learning how the API works, then how about you use a library which takes care of it for you?
Just because you're using a high level language doesn't mean you can ignore learning it's API.
For example, from the Python docs:
If you're writing applications robustly, this is something you need to be aware of.
Optimize the reads all you want, but those writes better damn well happen before the calls that say data is written return.
And this is where most of the confusion comes from. There is a difference between a logical write and a physical write. When your write call completes, it says the logical write has completed. It says nothing about the physical write. Depending on file system semantics, your physical write may have already completed too - or shortly after. If you must explicitly ensure the physical write is complete then you must explicitly ensure it via code - otherwise the physical write can only be assumed. And this is where the the lessor informed seem confused by their own poor expectations and ignorance. Unless they are actually following their write with some sort of file system synchronization call, ignoring their ignorant expectation, they have no right what-so-ever to assume the data will still be there in the face of a system crash. Its a very poor coder who falls into that trap.
Good programmers know this and have known it for tens of years. Good database programmers know this. Good file system developers know this. Those that are outraged by their own ignorance are either not programmers or are not good programmers.
And lastly, I'll point out, which is exactly why Tso pointed it out - use a solution where its foundation is built by coders who already understand the proper way to ensure data is safe on the file system - for example, use a database. While I don't consider the use of a database to be an ideal solution here, it does a wonderful job of highlighting the crappy design both KDE and GNOME have used to store configuration data - and how unconcerned they are about data loss and data corruption. If the developers of KDE and GNOME don't give a crap about your configuration data then how on earth can you possibly be upset at the file system for doing what its suppose to do?
In short, both KDE and GNOME need to give a crap about how, when, and why they write configuration data. Since they don't care about data integrity, you now know who you should be angry at. Here's a hint, and it doesn't have anything to do with the file system.
So what your saying is: Its not a bug, its a feature?
Where have I heard the before? Hmm.....
And I asked my buddy who writes *nix disk drivers at a very well-known outfit. He was a little shocked that someone would measure commit time in minutes. He writes mostly RAID drivers now, for server hardware, and thinks in terms of single-digit seconds is chancing it, even with battery-backed cache (which his hardware does NOT have, BTW). He is of the opinion that this is a terrible mistake, and someone should change these defaults and issue the patch, quietly, so no one gets hurt more than they aready have. He says he wouldn't what was done, but then again, he spends his days troubleshooting race conditions and interrupt conflicts, what does he know... And he is getting old before his time. I tell him he oughta go into display drivers and save his life...
But this reminds me of the problems of networked drives - delayed writes on Windows servers often lead to corruption and lost data if the network connection broke and then the server borked. Some legendary fiascos I presided over, and very unhappy people who didn't understand the concepts of networking and Microsoft's brain dead implementations. Lots of lost sleep.
So does this also potentially affect NFS and SAMBA shares? Add in the possibility of network connection dropouts, and this sounds worse than ever.
Are we making progress yet?
deleting the extra space after periods so i can stay relevant, yeah.
Much as I love to fallback on the "POSIX says that this could be the case so it is OK that it is the case" excuse, it really does not fly in this case. The POSIX doesn't allow this sort of behavior because it is a "good" thing to do, it allows it because there are systems where this is an OK thing to do -- systems intended to manage database, systems that are heavily verified and have backup power supplies, etc. This is not appropriate for desktop systems -- desktop systems have to be robust against all kinds of stupid situations, like sudden power losses, users hitting the reset button because an application is hanging, and so forth. EXT4 should not be used in a desktop system if it can cause data loss when the unexpected happens, regardless of the technical merits of writing to small configuration files.
Palm trees and 8
Thank you for playing, but, unfortunately, you are the idiot. You do not understand the purpose of the file-system, what journaling is, nor do you understand proper use of an API.
Please, SHUT THE FUCK UP!
Over-the-top Response Guy! Giving "Over-the-Top Responses" since 1970.
Well, the dev version's slated for release shortly. But even then, ext4 will not be default. On the other hand, I have heard calls to make ext4 default for the next Fedora, whenever that is. Please direct your anger and abuse their way, please ;-)
I Browse at +4 Flamebait
Open Source Sysadmin
at least partially. The commit interval of ext4 seems too long to me. The developers seem to have sacrifized reliability for speed.
If you want a quick file system that does not write to disk, use a RAM disk.
If you want persisent data, write to a disk. This should not be a matter of how the apps are written.
Citing from the message Ts'o post:
----
So, what is the problem. POSIX fundamentally says that what happens if the system is not shutdown cleanly is undefined. If you want to force things to be stored on disk, you must use fsync() or fdatasync(). There may be performance problems with this, which is what happened with FireFox 3.0[1] --- but that's why POSIX doesn't require that things be synched to disk as soon as the file is closed.
----
And indeed, and reading the NOTES section of "man -S2 close" explicitely notes what is not mentioned in the other sections. I up to this day also lived under the assumption that a close implies a fsync. Now i have to change my ptograms where it matters. All the Idiots who scream here that the OS is doing something worng: no, it's not. AFAIU it's following the befined behaviour which is what i expect an OS to do. It should NOT try to magically guess where i forgot to fsync my files.
those writes better damn well happen before the calls that say data is written return.
The manpages agree with what you are saying, the problem is that the application developers who forgot to/don't want to use fsync() don't, because the *sync() functions are the only ones that say anything has been written, and they do in fact stop and wait until the data really is written.
If I have been able to see further than others, it is because I bought a pair of binoculars.
and definitely don't hire to write safety-critical software...
"Not an actor, but he plays one on TV."
1) You can adjust your commit interval and 2) Look into PC-BSD/Solaris, ZFS is fairly solid, from what experimentation I've done. Wish Linux could use it properly. Loving the BSD implementation of ZFS.
Yet the whole point of journaling filesystem is to protect against data loss.
No, it isn't!
Over-the-top Response Guy! Giving "Over-the-Top Responses" since 1970.
According to TFA:
Ts'o says that the application should be fixed so it does not write and rewrite small files. He advises that "this is really more of an application design problem more than anything else."
Unbelievable that this guy is the main author of a file system. To paraphrase him:
"My file system is awesome. Just don't expect that when you issue a file write command that the file system will ensure that the file
will be written."
Cool.
Where are we going and why are we in a handbasket?
People keep making arguments about the spec, but this seems like a case of throwing the baby out with the bathwater. The spec is intended to serve the interest of robustness, not the other way around; demolishing robustness and then citing the spec is forgetting why there is a spec in the first place.
Yes, you can design something that's intentionally brain-dead, but still true to spec as a kind of intellectual exercise about extremes, but in the real world, the idea should be the opposite:
Stay true to the spec and try to robustly handle as many contingencies as is possible. Both developers should do this, filesystem and application, not "just" one or the other.
It's not enough just to be true to spec; the idea is to get something that works as well, not jump through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes.
It's the bad outcomes that we're trying to mitigate by having a spec in the first place!
So my point: what exactly is wrong with meeting the spec and trying to prevent serious problems by other coders from affecting your own code? I thought this was a basic part of coding: even if someone else is an idiot programmer, that doesn't make it okay to let the whole system fall down. Or did we all miss the part where we went for protected memory access and pre-emptive multitasking? Hell, if everybody had just been a great programmer, none of that would have been needed.
The point is to have a working system by following the spec and to try to clean up behind other programmers when they don't as much as possible within your own spec-compliant code. The point is not simply to "meet spec" and the actual utility of the system or vulnerability to the mistakes of others be damned.
STOP . AMERICA . NOW
Why not just accumulate all disk changes in cached RAM and wait until the next shutdown to sync it all. The maximum time spent writing to the hard disk no matter what the computer did while it was on or how long it was on would then be O(1) (no more than the total size of the disk) and write performance would be astronomical!
Of course, reliability would suffer...
STOP . AMERICA . NOW
Those that do not are simply not good developers.
Because blaming the user is a strategy that has been used with complete success for generations?
Seriously - who cares whose fault it is? Quit attributing blame, and think about ways to mitigate the symptom of said widespread failure to code nicely.
No, it was about the user's data getting trashed because the application wrote the data to disk in orders that weren't stable if the system crashed for whatever reason, including power loss, and about the change in file system behaviour increasing the potential delay between writes and therefore increasing the risk that the badly written application would lose data by about 30x.
TFA shows examples of applications that work by
and comparing them to applications that rename the old version, write the new version, and do their fsyncs in orders that will always leave the disk with a correct old version, at least until the new version is stably written. In the latter case, if the new version didn't get written, the application can use the old version, and it'll be fine.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
High-level languages usually provide buffered stream APIs. All of these APIs contain sync/flush calls. Its bloody obvious that if you have buffers then shit is in memory, and not necessarily where you sent it. That should be the first hint to an incompetant monkey who thinks "Hey my high-level language abstracts me from knowing what the fuck I'm doing!"
3laws: No freebies, no backsies, GTFO.
And i was going to ask if this issue had any relation to why KDE 4 (in my Mandriva 2009 Free system) NEVER remembers what opened apps and folders i had open. I NEVER had KDE 3.x "forget" to remember my settings across sessions once i checked the box for it to do so. KDE 4, no matter what i try, keeps returning me to a blank/no previous session items desktop. Making changes in KDE 3 messes around with KDE 4, and that's a shame. Certain settings in KDE 4 are grayed out, and that's annoying.
But, i suppose someone will say my comment is off-topic, or not related. But, thought I'd mention this anyway...
Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
No!...it does not work as expected, or did I misunderstand something?
I don't mind the file system losing the last 2 minutes of work, as long as it's a consistent state. But if you mean the metadata and data can be committed at different time, which means the file size could change without actual data going in or vice versa, no it's a big no!
If the data and metadata could be committed onto the disk out of order would also be a big concern.
It has to be consistent please. 5 mins lost of work is ok, but if it lost 50% of work but 50% state, then it's really bad.
Could someone please clarify if the above is not the case please? Otherwise I wonder if database application around is doing the right thing?
And you fix it in GNOME/KDE. All these advocates of ext4 shooting itself in the foot to work around bad practice on the part of the UI developers need to fuck off. They are advocating MS-style back-compatibility kludges... for the fkn filesystem.
Even MS has managed to stop their UI people kludging up their core shit like NTFS.
3laws: No freebies, no backsies, GTFO.
Is there a way to simply change the delay to what it had been in ext3?
Comment removed based on user account deletion
If you are a .NET developer, FileOptions.WriteThrough is what you are looking for it you need your shit to get written out to the filesystem right away.
using(FileStream fWrite = new FileStream("test.txt", FileMode.Create, FileSystemRights.Modify, FileShare.None, 8, FileOptions.WriteThrough)) {
} // this will be written out // do shit.... // flush the shit do disk, kinda like fsync();
using(FileStream fWrite = new FileStream("test2.txt", FileMode.Create, FileSystemRights.Modify, FileShare.None, 8, FileOptions.None)) {
fWrite.Flush();
}
It's been discussed for about an hour in the corridor at FOSDEM after Ted T'so's talk about ext4... :P
Basically he said app writers are to blame for abusing fs-specific behaviour
My core2quad machine with 3 SATA disk RAID runs for about 20 minutes on a tiny APC UPS I bought from newegg for less than $100.
Sure, but that's assuming you can save your work in all open applications without power to your display. Me, I like a UPS with a little more juice so I can reap the fullness of my 52" plasma while cleaning up and shutting down.
I am literally 3000 tokens away from the chaotic crossbow --Stephen
it's at least very poor implementation decision.
They basically choose performance (and/or factually correct) over safe.
Create a new fopen flag for delayed write, create a fsync'er daemon, add tunable parameters to (disable) fsync on file close or on app close, I don't care, but as is the "new" behavior is pretty disturbing.
"Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. "
Two things are happening:
(1) KDE is writing a new inode.
(2) KDE is renaming the directory entry for the inode, replacing an existing inode in the process.
KDE never calls fsync(2), so the data from step one is not committed to be on disk. Thus, KDE is atomically replacing the old file with an uncommitted file. If the system crashes before it gets around to writing the data, too bad.
EXT4 isn't "broken" for doing this, as endless people have pointed out. The spec says if you don't call fsync(2) you're taking your chances. In this case, you gambled and lost.
KDE isn't "broken" for doing this unless KDE promised never to leave the disk in an inconsistent state during a crash. That's a hard promise to keep, so I doubt KDE ever made it.
A system crash means loss of data not committed to disk. A system crash frequently means loss of lots of other things, too. Unsaved application data in memory which never even made it to write(2). Process state. Service availability. Jobs. Money. System crashes are bad; this should not be news.
The database suggestion some are making comes from the fact that if you want on-disk consistency *and* good performance, you have to do a lot of implementation work, and do things like batching your updates into calls to write(2) and fsync(2). Otherwise, performance will stink. This is a big part of what databases do.
As someone else suggested, it's perfectly easy to make writes atomic in most filesystems. Mount with the "sync" option. Write performance will absolutely suck, but you get a never-loses-uncommitted-data filesystem.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
Having file system metadata not match file system data is a pretty big bug. Ext3 defaulted to having everything mounted such that the writes to the disk were "ordered" ie: (data=ordered). Ext4 does not force "ordered".
Userland can solve this problem by calling fsync() all over the place, like before every close. However, that completely defeats the purpose of having a buffered write-back file system. If the new rule is to change every userland program to force all data to be flushed to disk after every close, then we might as well mount the filesystem "o=sync", and flush our performance down the toilet. (Pun intended.)
The problem here is no call exist to force writes to disk to be "ordered". fsync() is not a substitute for ordered writes to disk. There are just too many ways an application can get into trouble if writes to disk aren't ordered. Having situations where neither the backup file nor the new file are valid is just beginning of the problems.
I write data acquisition applications that write lots of data in many files to disk. I don't care if my newest file is blank. This "bug" could mean that I have a pseudo-random number of blank files, and they might not even be ordered. My only solution, fsync(), will tank the applications performance, by causing huge amounts of disk activity. fsync() is not a substitute for ordered writes.
POSIX or no, this kind of crap won't cut it. Losing a file or two every few years was acceptable, but zeroing or orphaning hundreds of the most recently written files every time there's a crash is NOT, I don't care whether you're in a home or production environment.
Whenever you are talking anything with some kind of delay, you need to think about what is reasonable for the situation. For something like disk writes, a few seconds is probably the most that is reasonable. It is ok to say that no, a write isn't going to happen right away and put everything on hold, but the expectation is that it will be serviced ASAP, not put off for minutes. That is just waaay too long.
To me this would be like depositing a check at the bank and then not having it show up in your account for two months. Sure, there's always a delay between the deposit and when it posts since it needs to clear, but for that days are reasonable, not months.
Performance for file systems is great and all, but not if it comes at the severe expensive of reliability. I just don't see minutes as being an acceptable delay for writes, no matter what the case. You can argue all you like about the theory, the fact of the matter is I don't know of any widely in use OS/FS combo that does this. Windows/NTFS doesn't, Linux/EXT3 doesn't, etc.
As an application programmer, one of the more common filesystem operations I want to do is "replace this file atomically; and feel free to delay commit of the replace for power/performance reasons as long as it happens atomically." The POSIX API provides no documented way to express this, so a common POSIX call sequence is used to express this semantic (write-new-file, rename on top of old-file).
The problem is that EXT4 now interprets that common calling sequence which traditionally has useful semantics on most filesystems in a way that is both useless and harmful to data integrity. And furthermore it leaves application programmers no way to express the "atomic replace defer ok" semantics. So in pursuit of filesystem performance EXT4 has broken a performance-optimizing semantic. If applications are changed to fsync when it's completely unnecessary (only sequence preservation is needed), we will all pay the performance cost.
So EXT4 may comply with POSIX, but it does so in a way that is harmful to overall system performance, harmful to data integrity and harmful to performance optimization of application file operations.
As an application developer highly concerned with optimal performance, my response will be to refuse to support EXT4, and to discourage use of EXT4+workaround as it has suboptimal performance. The correct fix is to make EXT4 guarantee to commit the rename after the data write operations, but for performance, it should delay both commits until the next flush interval. If I replace the same file twice within a flush interval, I'd prefer the intermediate version never be written to disk.
Until an "atomic replace" operation is added to POSIX, I want the filesystem to interpret that common sequence of calls with the sensible and rational interpretation.
If a file is closed queue an fsync within a few seconds. If many files are closed by the same application before the first fsync bump the SHORT delay each time. This means the last fsync will only be 1 delay period away from write.
An fsync caused by a close should be less than 2 seconds away.
DONE.
Problem stays OUTSIDE of the application space where it sure as fuck does not belong in the first place.
From what I can see, this was a case of kde writing to the primary config files in /usr/kde. If this is the case, then it's definately a screwup by the KDE Devs as there is absolutely no reason for KDE to be writing to those files after it's installed simply because the base installation has no way of knowing if that directory is mounted -ro (read-only) as I do for /usr. If the data loss occured in the /home/~.kde folder, then that indicates a problem with such a long write delay. I'm sorry but although I use Ext3 with longer write delay (15 seconds) I also have a UPS connected to the system to reduce probability of data loss but I will never extend the cache delay to more then 30 seconds unless it's a laptop.
Mod me up/Mod me down: I wont frown as I've no crown
fflush() just flushes the user mode buffer in FILE to the kernel buffer via _write (WriteFile).
If you want fsync then you need to call _commit (FlushFileBuffers) or call fopen with open mode with a "c" (MS specific) and then call fflush. See documentation for fopen:
const char *mode:
c
Enable the commit flag for the associated filename so that the contents of the file buffer are written directly to disk if either fflush or _flushall is called.
If the disk isn't otherwise busy, why not flush dirty buffers, regardless of their age? This one simple heuristic would minimize a lot of these data-loss issues on forgotten fsyncs in typical desktop useage patterns.
In a typical desktop, there is a lot of activity for a short time, and then the disk sits idle. There is no reason to hold on to the dirty data in the hopes of combining more writes, because the disk isn't doing anything right now anyway. So flush things out. Do them one at a time so that if a sudden burst of traffic comes in it doesn't sit behind a long queue.
If the disk *is* busy, then the usual delayed algorithm applies. And the apps still should use their fsyncs. But the ones that don't, in the typical case of the computer doing a flurry of work and then sitting idle, it'll minimize the exposure should a power failure hit.
The only exception to this heuristic might be for flash drives, where you wouldn't want to re-write the same block soon afterwards.
In Vista and Win7, NTFS supports atomic transactions. With TxNTFS KDE could do all those config file updates as a transaction and have guaranteed atomicity. No need for an extra registry-like database.
How ironic. :-)
Right. Atomic rename is a special case of ordered write, really. Atomic-but-asynchronous-rename is great, but something more powerful would be nice too.
What we really need is a user-level fbarrier. I'm not the first person to think of this syscall.
Also,
When you think about it, that's a very powerful guarantee. (Personally, though, I'd rather have fbarrier.)
All the Idiots who scream here that the OS is doing something worng: no, it's not.
This is called "hiding behind the standard" (a disease very common among kernel developers). Just because the standard doesn't specify behaviour in a certain situation doesn't mean that any behaviour is equally okay. In this case, ext4's behaviour very much hurts the robustness of the system, which is rather important in unreliable environments like laptops.
In this case, what KDE does is certainly not unreasonable (and its developers are certainly not "idiots"). It doesn't overwrite configuration files in place, which would be bad even in the absence of system crashes, as doing it that way is not atomic. Instead it creates a new temporary file, writes the new contents, then renames the temporary file to the old one. This is an atomic operation on Unix: you either see the old contents or the new contents, but nothing in between. Now, the problem is that in case of a crash, ext4 gives you the worst possible outcome by reordering the operations: it will "recover" the rename for you, but not the actual write of the new data. So you end up with a 0-byte file - far from atomic. POSIX of course allows this, but POSIX allows just about anything: that doesn't mean its reasonable. The only guaranteed solution - use an fsync/fdatasync - is something that almost nobody does because the performance is horrible (ext3 in fact will write the entire journal, IIRC, when doing an fsync() on a single file - this really hurt Firefox 3 performance). So the KDE developers can be excused for not doing that.
It's the job of a modern filesystem to ensure robustness and performance. If you don't use an fsync, you should expect that there is a time window during which transactions might become undone (not the end of the world for configuration files), but they should never be reordered. For instance, this is how Berkeley DB works if you disable fsync: it guarantees ACI but not ACID. For many desktop applications, that's good enough. Destroying every file that has been updated since the last fsync isn't. And your users aren't going to be impressed by the argument that POSIX allows it.
High end disks and modern FS (such as ext{3,4}) support write barriers to enforce ordering in critical sections.
I suggest using high end disks on critical systems.
In databases such as PostgreSQL, nothing is guaranteed to be recorded until a transaction has been committed *and* the DB has replied positively to the commit request. The application should not assume that the operation is succesful until then.
There is a nice tuning option in PG (and other high end DBs I suppose) where you can tell it to wait a number of milliseconds on every query so that it has a chance to do just one fsync for several transactions. It slows down sequential operations with no load, so you might want to do disable this when doing certain maintenance operations (or you can arrange for those opeartions to be part of one large transaction instead of several small ones, such as with auto commit). In production and under load, however, this improves overall throughput dramatically.
I hate filesystems with a passion, as far as I'm concerned they all workaround a hardware limitation.
Every hd should come with a tiny little battery, so if you write something to the hd (that hits the buffer) you can still be 100% sure it'll hit the platter.
Would increase performance a lot, especially when dealing with software like databases that sync's a lot. No need for hardware sync's and barriers.
Hell if a new sata standard came out, with that in the spec, and maybe even allowing the hd to use a configurable amount of system memory as a buffer... would be brilliant (yes, I realize that last part mixed with the first part would be considerably more complicated, and require a battery on the system memory).
The old data was presumably synced before. Perhaps it was even written using an excellent editor that fsynced nicely, or perhaps half an hour had passed. But that old data is gone, because of failing to fsync on replacement. If you think that's okay, you shouldn't be programming filesystems. Consider the implications. Totally unnecessary data loss risk. Forced to do an fsync on changing config files, files that may change often and don't really need to be flushed to disk right away as long as the old data does not disappear. Now the thing is, I've read the POSIX spec and it doesn't clearly foresee in this particular situation. Which is peculiar, considering the UNIX habit of keeping config data like this. Reading it one way, one could come to the conclusion that ext4 doesn't conform to the spec. Reading it another way, one could come to the conlusion that we need a new spec, let's say POSIX 2. I'm already looking forward to POSIX 3.1. :-)
Most people have a UPS these days anyway(laptops), so data loss due to power failure is very rare.
If you're writing important stuff to disk, using fsync() has been the rule for decades.
The reason it's not the default is because most applications write large amounts of useless junk to disk (caches of network data, scratch space etc.) which makes disk access very slow.
The KDE devs have no excuse and should know better.
...and that is all I have to say about that.
http://jessta.id.au
JUST USE FAT 32 OT NTFS
If you just write the file and rename without syncing and CHECKING(!) whether the sync worked you can get into the case where the file does not have what you thought might be in when a crash occurs before the file is completely written. If a crash occurs then you could learn that the rename did not act as a barrier to write. If no crash occurs then things will be as you expect (you won't see no/half the data in the file as the OS can present a "finished" view by showing you its buffers) while the OS is still writing the data to disk.
When you don't check whether your data was synced to disk all bets are off as to what the files you are writing will contain (different filesystems will show different behaviour - e.g. XFS is good at showing applications with this problem much to the chagrin of unwary users). Apps need to either arrange it so that they don't care or do an explicit fsync and check the result before going further (or use O_DIRECT I guess). As a user you can arrange for your filesystem to be mounted in strict "sync" mode (which ensures everything is being written out all the time) but you'll pay a heavy speed price for doing so. I guess users could also force a sync and wait for it to finish before doing any crashes/abrupt losses of power (but this requires future seeing abilities to work every time)...
I think that most of the comments so far have missed the point. The problem is not caused by delayed flushing of buffers, but in delaying allocating the space for the new (or the data in a truncated) file. ext4 still flushes with a (default) 5s delay, the same as ext3, but it only does so for blocks which have been allocated disk space.
There are a number of reasons for having delayed allocation. First it (together with ext4's use of extents) helps to lower file fragmentation where data is being written 'slowly' (eg when downloading from the internet). Secondly, in situations where files are being used as temporary scratchpads, it can remove the requirement to write the data to disk at all in cases where the file is unlinked before it is committed to disk.
If the guys writing the FS can't figure out how to properly write a cache that's not the problem of the application writers.
If I save a file via an OS call and the OS tells me it didn't fail then if I can't immediately reread it then the OS is broken.
Data loss from write caching is not a new problem either. Guess this year's crop of programmers can't figure out how to use google to find out about past problems or they just figure they're smarter than everyone else that came before them.
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
Some high-level languages (e.g. PHP) have no built-in fsync. Also fsync() is not part of the C standard, it's a POSIX extension. What you have in C is fflush(), but that will not fsync(). So books about programming in C usually don't cover fsync(), as it's not part of the language. I know that sounds like nitpicking, but truth is, if you've learned programming from books about some language, chances are you've never heard about fsync().
Basically, what happens is that you need to understand OS design in order to program in a high-level language, and nobody (at least none of the books) tells you so. This is a WTF on more than one level... either make it part of the language, or make sure it isn't needed.
Data and metadata is never flushed at the same time. To do so would require that both go through the journal, and are committed at the same time. In ext3, data was written *before* metadata, which avoid the problem of empty files. However, you will just get the opposite problem - data being written, but the file size not being adjusted, so part of the new data is lost because the file size wasn't updated.
File content being inconsistent after a crash is not something the filesystem can solve. It requires some kind of transaction support (fsync being kinda half a transaction - commit without begin). Which again means that the application needs to make use of the transaction feature for the data to be safe.
The problem here is applications NOT using fsync. Even if there was better transaction support, they would just not use that in the same way as they are not using what we have now.
The bug is an out-of-order sequencing issue. The application sequence is CreateFile, WriteData, RenameFile. What is actually happening on disk is CreateFile, RenameFile, WriteData. If the crash happens between RenameFile and WriteData, you lose the data written to disk and have a zero length file. This is a filesystem / kernel issue.
The length of time between disk writes exacerbates the problem. sync() forces a write and reduces the window when the filesystem is susceptible, but the bug is still there.
This is a common bug when designing caches, because the sequence of writes of interdependent data must be in-order to maintain integrity.
-- Hiten
Re: "backup old file and write a new one" - A transactional copy-on-write filesystem such as Sun's ZFS is doing almost the same job, transparently.
I have little doubt that copy-on-write will eventually supersede overwrite-and-pray filesystems. The wins are numerous, including cheap snapshotting, etc, etc. Install OpenSolaris and give ZFS a try today!
you had me at #!
I do mostly app development, not system, but, as I understand it, many apps including KDE and Gnome are doing a bunch of small truncate-and-writes in style (a) or (b), presumably because style (c) would be too expensive due to the overhead of fsync().
Am I missing something, or couldn't they just do the writes in style (c), except not do the fsync() each time, but rather call fsync() every five seconds or so in a separate thread? Wouldn't that allow for the reasonably fast writes without the risk of corruption?
Nonaggression works!
A filesystem that takes something to the extreme will hunt down and kill bugs in programs that make assumptions.
This is why porting a program to different OS's, trying it out on different architectures is great.
I've ran into this kind of bug before when you write to a file and expect the file to be there right away. It worked on one setup where the fileserver was the same as the application server, but when we moved the appserver it started failing. NFS didn't report the file there right away, it took a little while.
The less assumptions the better....better software.
In fairness to XFS, they finally accepted that binary NULLs were a problem and fixed it in the spring of 2007.
OptiPNG apparently doesn't care about my PNG files either, then. Firefox doesn't care enough about my downloads to write them fully to disk before saying the download is done; 'tar z' doesn't care enough about my backups to write them fully to disk before I can use the backup tarball, etc.
And this is where I state that the programs a user uses do not know the intent of said user in all cases. Imagine if the 'tar' utility called fsync on each file when I restored a .tar.gz file containing 1500 small files. The disk would thrash, unless there was some sort of read-ahead done on the .tar.gz before... but then the filesystem metadata for the extracted files would need to be written too, which means that the disk would thrash on writes alone, never mind interwoven with reads.
However, fsyncing a zip file which I'm only creating to send over my LAN and then deleting places unnecessary strain on my hard drive.
Azureus, the well-known Java BitTorrent client, does fsync calls (actually via Java's FileChannel.force(), but that's another story), and I hate that. My connection is liable to filling up the hard drive's seek queue due to metadata updates while downloading, thereby giving less I/O time to other applications and starving them. I would rather see it fsync once at the end of the download, before the hash check, or do data-only fsyncs that need to seek less. I don't care that the file's last-modification time is wrong while I'm downloading.
If all programs will now start to fsync files because of this POSIX rule and the ext4 filesystem, then I will use laptop_mode even on my desktop, because it drops fsyncs to delay writes up to its configured interval. The last thing I need as a desktop user is GNOME or KDE starting slower, which it will if it takes Tso's advice to heart... No more grouping writes across these hundreds of files!
The infamous XFS binary NULLs problem was fixed in 2007.
It *was* a problem, despite the XFS developers saying before 2007 exactly what the ext4 developers are saying now: "We're following spec, so it's your problem if you lose data."
Sooner or later, ext4 will be fixed, just like XFS was, once the developers realize that "omg my data is gone" is filesystem publicity death, no matter how on-spec they are.
You might be interested in this whitepaper from Intel. What they find is that the Windows CIFS client write pattern creates serious fragmentation problems for ext3. The problems are mitigated (though probably not completely solved) in XFS precisely by what you mention - delayed allocation.
Exactly. If I had mod points, I'd give them to you.
Basically, what happens is that you need to understand OS design in order to program in a high-level language, and nobody (at least none of the books) tells you so. This is a WTF on more than one level... either make it part of the language, or make sure it isn't needed.
Maybe this is why some computer scientists write software: They typically have a mandatory OS course and were taught these things. Still a lot out there that do not remember.
Face it: Writing good code is hard. Most code-writers cannot do it. Blaming the language for it is the wrong approach, as the language cannot fix all problems. Simply not possible. Well-wrtitten language books will also tell you that you also need to understand the system you are programming for, not only the language.
Probably the line between a programmer and a software engineer runs somewhere here, between those that do understand the environment they are creating software in and those that do not.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
There are some very intelligent posters above who have pointed out that the desired result is not achievable with EXT4. fsync fixes the symptom but is NOT the solution! fbarrier is also not right. And POSIX was not written by omnipotent oracles and can be ignored when it is obviously wrong. EXT4 is broken and needs to be fixed!
I'm going to try my own explanation:
If a program starts the call rename(A,B) and then the system crashes, and you reboot and you look at B (not A!!!!), you should either see the contents of A and A does not exist, or the contents of B and YOU DO NOT CARE WHAT IS IN A!!!!!
However as EXT4 is implemented currently, you can get another result: B will contain a partially-written version of A (typically an empty file) and A does not exist. This is VERY BAD, as most likely there was interesting information in B that was copied to A but now both are lost.
fsync as suggested will ensure that the contents of A are correct before the rename(). This is a stronger requirement. It is also dreadfully slow because it has to flush the disk.
fbarrier would be similar. Not as strong as fsync, but it does guarantee that the contents of A are correct before the rename() is committed to disk.
As a side effect of A being correct, fsync and fbarrier reduce the possible results in B of the rename on EXT4 to the desired set, so it does fix the symptoms.
However there are problems with this. By far the most important one is that there are probably ten million programs (including ones written in scripting languages) that assume rename() works as stated above and that it is IMPOSSIBLE to add the fsync/fbarrier call to every one of them. And fsync is dreadfully slow as it really forces the disk to be written right then. fbarrier is better but it is enforcing a slightly stricter limit and thus can still be slower for no actual useful gain.
A pile of locking options on files is not what is needed, despite Windows and VMS doing this. The Unix designers got it RIGHT, 30 years ago, with a correct and very small set of file operations that do what is really wanted (though I would add some sort of atomic create where the file does not appear until closed). Modern designers in OSS and at Microsoft should start showing a little humility and stop throwing all their highschool-level ideas into the designs without some research.
Hey, this situation reminds me of reiserfs. That filesystem also used to just randomly zero out some data when it crashed while writing.
I find it pretty convenient that ext3 leaves files in some "useful" state even when it crashes in the middle of a write.
I guess the Posix lawyers haven't defined exactly what "useful" is yet, and thus this feature is about to get lot in the next release of the filesystem. Too bad, actually...
"It's a consequence of not writing software properly."
When I write a file, I expect it to be stored quickly and reliably. Any operating system that doesn't do that is faulty. It's nice if an operating system manages to get a bit of extra performance through some clever caching, but that is secondary.
POSIX implies that rename() (well actually link()) is atomic. This breaks that assumption, as far as most programs are concerned.
Yes you can redefine what "atomic" means in order to somehow imply that EXT4 is obeying it. I mean we could say that each letter is indidually changed and thus if the crash only leaves the first letter changed in the file name is ok.
This violates POSIX for all practical understandings of the text. It has nothing to do with write(), it is rename()/link() that is at fault.
I cite from Ansi C:
----
The fclose function causes the stream pointed to by stream to be
flushed and the associated file to be closed. Any unwritten buffered
data for the stream are delivered to the host environment to be
written to the file; any unread buffered data are discarded. The
stream is disassociated from the file. If the associated buffer was
automatically allocated, it is deallocated.
----
As you see, ANSI C says only 'delivered to the host environment to be
written to the file' and not. 'return on the host environment having completed the write'.
Well. Again something i did not realize before. I was always under the impression that an fflush also does the underlying synchronizations (however i usually dont use streams because i am aware of the fact that the additional buffering in unnecessary, and using unneeded libs in always a source of error.) But the documentation says, as sad as i am, that it is impossible to write a program in ANSI C which can determine at any point during it's runtime whether a specific file was written to the disk. While this makes me strongly doubt about what exactly the people writing the standard have been thinking, it is hardly the fault of the file system if a standard which and library based on system calls, implemented *correctly* according to another standard, has an undefined behaviour in a certain situation....
And as hard as your mental trauma with XFS may be, i dont believe that it contributes to this discussion, besides that new FS should be taken with care if you dont need them. I started to use ext3 1.5 years after it entered the stable kernel and i plan to use ext4 not before 2010.
"Telling application developers to use a database is bullshit."
I'm not telling application developers to use a database; I'm explaining what's driving a remark others have made. Application developers can use whatever suits their need. If a database is what they want, then sure, use one. If something else is better, use that.
"A open-write-close-rename sequence merely asks for atomicity without durability, something that's perfectly reasonable."
You may think it's perfectly reasonable, but you're asking for atomicity across multiple operations. So really want you want is transactions. To the best of my knowledge, neither Linux, nor Windows, nor Mac OS X, nor any of the BSDs, offer transactions in the filesystem layer. I've always thought such would be a good idea, but I don't think it exists.
Further, even if filesystem transactions did exist, the application would have to request it. There's no way for the OS to magically divine what an application considers a filesystem transaction to be; the application has to tell it. So the order of operations would need to be begin-open-write-close-rename-commit.
"all the application wants is for either the old version of a file or the entire new version to appear on a reboot"
Then the application should call fsync on the new file before removing the old one. That's the only mechanism the POSIX specification provides to guarantee something has been committed to disk. It may be more than the application really wants or needs, but it's all POSIX provides. One can argue POSIX should do more, of course. More on that below.
"He doesn't care at the instant of the rename whether that replacement has been recorded on disk ..."
Actually, yes he does, because the operation he's requesting is to destroy the only known-good file. It's not the OS's fault that the programmer didn't actually make sure his new copy was good before he destroyed the old one.
The programmer may have intended for the OS to make sure the new copy was good, but he never asked it to do so (i.e., with fsync).
"asking for that same durability in a multi-file configuration setup is just stupidly degrading performance."
So, baring new system calls for filesystem transactions, what should the filesystem do, then? Serialize all I/O operations? Now you're destroying the I/O scheduler and killing multitasking performance.
Maybe there's another option here that I'm not seeing.
"open-write-close-rename is saying something fundamentally different from open-write-fsync-close-rename"
Yes, one is safe, the other is unsafe.
I think the problem here is you're implying semantics which don't actually exist in the OS or it's interface specification. Programming by "gee I really wish things worked this way" is a bad way to do things.
Now, maybe you want to make the argument that the OS should provide transactions. I'd even agree with you. But one doesn't write code based on a feature request; you write code based on what the system actually does.
Or am I missing something else?
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
"POSIX may allow it, but I was under the impression that filesystems should try and remain in a sane state."
You're asking for all I/O operations to be done serially. Linux doesn't do this today, and I don't think it has for more than a decade. Most OSes don't do this. The reason is performance. If you've got a bunch of writes to do in one part of the disk, you do them all there, and then do all the other writes for another part of the disk. Thus writes can be done out-of-order. This is called "I/O scheduling" or "elevator algorithm". If you've got multiple tasks doing serious I/O to the disk, you really want it.
If you want a way to for an application to request a group of operations to be done atomically, that's called a transaction. I wrote about that in my cousin post.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
This is not appropriate for desktop systems -- desktop systems have to be robust against all kinds of stupid situations, like sudden power losses, users hitting the reset button because an application is hanging, and so forth.
First off, PC hardware is not robust against sudden power-losses: it's literally possible for a write to be `half-done' inside the HDD, and no amount of higher-level `protection' can do anything about that.
Secondly, the atomicity of rename() (or any other operation) isn't contagious: rename() does an unlink() and a link() in `one operation' (and that's the only amount of atomicity rename() claims--it's even specifically not atomic in that it's possible to see the original link and the new link at the same time), but that doesn't have any impact--or even any relation to other `nearby' operations.
write() and rename() aren't even operating on the same object--write() is operating on the file, and rename() is operating on the directory (or directories that do/will contain links to the file). I don't quite get the `metadata' arguments, because `filenames' aren't `file metadata'--they're directory data, which is what allows you to have any number of links to the same file from any number of directories. File-metadata are things like timestamps, ownership, permissions....
Lastly, having said all that: the reason that we go through the write-close-rename sequence is to prevent a race-condition while the system is running, and (to a lesser extent) to guard against failure of the acting process itself, not failure of the system as a whole.
-rozzin.
The apps are doing this:
1. Open the file.
2. Delete the file (O_TRUNC).
3. Write data to the file.
Writing steps two and three to the disk is not generally bunched. It's certainly not an atomic operation. The fact that this ever worked is nigh miraculous.
I'm reminded of the transition from bash to dash for the default /bin/sh in Ubuntu; people relied on nonstandard behavior for convenience, and when that was taken away, dash was blamed, people were going to Leave! Linux! Forever!, and so on.
(This example was extraordinarily poorly handled; it should have been done like Debian Lenny did it, with a lot of lead time and making sure that everything worked as it should.)
Of course ext4 shouldn't be released without the workaround, but applications need to actually handle their I/O, not chuck a bunch of stuff at the disk and act surprised when it's not guaranteed to be properly transaction-y. If this is "fixed" in the filesystem (as the current patches do), they do so by making the entire filesystem be careful about what gets written to disk immediately. The filesystem can't know what's vital to write atomically; the app must tell it.
Laws do not persuade just because they threaten. --Seneca
Umm.. Isn't the entire idea of RAID that if a disk fails in your array, it does not cause catastrophic failure?
Every time I've ever had a disk failure, I find out about it in an email, and think to myself, "Hmmm... really should get to the store soon to buy a new HD, eh? ..."
Certainly never had to wipe an array as a result of one measly disk failure. A single disk failure should never be an emergency.
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock