Ext4 Data Losses Explained, Worked Around
ddfall writes "H-Online has a follow-up on the Ext4 file system — Last week's news about data loss with the Linux Ext4 file system is explained and new solutions have been provided by Ted Ts'o to allow Ext4 to behave more like Ext3."
User: My data, it's gone!
EXT4:"Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations."
Solution: WORKS AS DESIGNED
Sent from your iPad.
FTFA, this is the problem:
Ext4, on the other hand, has another mechanism: delayed block allocation. After a file has been closed, up to a minute may elapse before data blocks on the disk are actually allocated. Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. If the system crashes during this time, the rename() operation may already be committed in the journal, even though the new file still contains no data. The result is that after a crash the file is empty: both the old and the new data have been lost.
And now my question: Why did the Ext4 developers make the same mistakes Reiser and XFS both made (and later corrected) years ago? Before you get to write any filesystem code, you should have to study how other people have done it, including all the change history. Seriously.
Those who fail to learn the lessons of [change] history are doomed to repeat it.
My blog
Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.
I couldn't disagree more:
When applications want to overwrite an existing file with new or changed data [...] they first create a temporary file for the new data and then rename it with the system call - rename(). [...] Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until [up to 60 seconds later].
Application developers reasonably expect that writes to the disk which happen far apart in time will happen in order. If I write to a file and then rename the file, I expect that the rename will not complete significantly before the write. Certainly not 60 seconds before the write. It seems dead obvious, at least to me, that the update of the directory entry should be deferred until after ext4 flushes that part of the file written prior to the change in the directory entry.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Short version: "We're sorry we changed something that worked and everyone was used to, but hey -- it's compliant with a standard." If this were Microsoft, we'd give them a healthy helping of humble pie, but because it's Linux and the magic word "POSIX" gets used, I'm sure we'll forgive them for it. The workaround is laughable -- "call fsync(), and then wait(), wait(), wait(), for the Wizard to see you." How about writing a filesystem that actually does journaling in a reliable fashion, instead of finger-pointing after the user loses data due to your snazzy new optimization and say "The developer did it! It wasn't us, honest." Microsoft does it and we tar and feather them, but the guys making the "latest and greatest" Linux feature we salute them?
We let our own off with heineous mistakes while professionals who do the same thing we hang simply because they dared to ask to be paid for their effort. Lame.
#fuckbeta #iamslashdot #dicemustdie
That's General Ts'o to you!
This post climbed Mt. Washington.
I wish to suggest that this is the immediate solution. The complete solution involves a truckload of pissed-off users storming a POSIX committee meeting and bashing the committee members over the head with clue sticks.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
This is the problem with new features - the users have problems using them until they fully understands and appreciates the advantages and disadvantages.
And also consider - ext4 is relatively new, so it will improve over time. If you want stability stick to ext3 or ext2. If you want a really stupid filesystem go FAT and prepare for a patent attack.
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
If an application decides to check the name of the file system and if the name is "ext4" it erases everything in your home directory, should that be considered a file system bug too?
...does that make it ext4-, ext3.99, ext4less?
I sit just me, or would you expect that the change would only be committed once the data was written to disk under all circumstances?
To me, it sounds like somebody screwed up a part of the POSIX specification. I should look for the line that says "During a crash, loose the user's recently changed file data and wipe out the old data too."
IMarv
Trusting software vendors is no smarter than trus
The workaround (flushing everything to disk before the rename) is a disaster for laptops or anything else which might wish to spin down a disk drive.
The write-replace idiom is used when a program is updating a file and can tolerate the update being lost in a crash, but wants either the old or the new to be intact and uncorrupted. The proposed sync solution accomplishes this, but at the cost of spinning up the drive and writing the blocks at each write-replace. How often does your browser update a file while you surf? Every cache entry? Every history entry? What about your music player? Desktop manager? All of these will be spin up your disk drive.
Hiding behind POSIX is not the solution. There needs to be a solution that supports write-replace without spinning up the disk drive.
The ext4 people have kindly illuminated the problem. Now it is time to define a solution. Maybe it will be some sort of barrier logic, maybe a new kind of sync syscall. But it needs to be done.
but if you want a write later file system shouldn't it be restricted to hardware that can preserve it?
I understand that doing writes immediately when requested leads to performance degradation but that is why business systems which defer writes to disk only do so when the hardware can guarantee it. In other words, we have a battery backed cache, if the battery is low or nearing end of life the cache is turned off and all writes are made when the data changes.
Trying to make performance gains to overcome limitations of the hardware never wins out.
* Winners compare their achievements to their goals, losers compare theirs to that of others.
That is the issue. Ext3 generally gives me a consistent previous point in time in power failure or crash. I would expect ext4 to too. I used XFS and had a power cable get yanked accidentally in the middle of a project. Everything was gone. I immediately dumped XFS over this.
This is unacceptable behavior. Open files should not be zeroed by design. They should be at last point time. I understand HW issues of a power failure, but that is different than it doing it on purpose. Any system dev. that thinks its acceptable is a fool.
This problem is just something that slipped through the cracks and I'm sure the originator of this bug is kicking himself in the ass for being so "stupid".
Rubbish. Sorry, if the syncs were implicit, app developers would just be demanding a way to to turn them off most of the time because they were killing performance.
Yes. All new kernel features should do anything it takes to ensure they work with popular applications. If a new kernel feature breaks an application, even if it is because the developers made incorrect assumptions about how things work, then the new kernel feature should be discarded. This is simple common sense, and something that even Microsoft gets right.
And also consider - ext4 is relatively new, so it will improve over time. If you want stability stick to ext3 or ext2.
QFT
The filesystem was first released sometime towards the end of December 2008. The Linux distros that incorporated it, gave it as an option, but the default for /root and /home was always EXT3.
In addition, this problem is not a week old like the article states. People have been discussing this problem on forums ever since mid-January, when the benchmarks for EXT4 were published and several people decided to try it out to see how it fares. I have been using EXT4 for my /root partition since January. Fortunately I haven't had any data loss, but if I do end up losing some data, I'd understand that since I have been using a brand new file-system which has not been thoroughly tested by users, nor has it been used on any servers that I know of.
Face your daemons!
Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.
Mr Ts'o is mistaken about this. When he introduces optimasation features that other filesystems (Reiser, for example) have already tried and undone because it doesn't work he is not fit to write filing systems. First learn how others did it, then do it better.
With Ext4 now proven unstable, the only viable new filesystem is ZFS. Or just stick with ext3 or UFS.
Before you get to write any filesystem code, you should have to study how other people have done it...
No. Being innovative means being original, and that means taking new and different paths. Once you have seen somebody else's path, it is difficult to go out on your own original path. That is why there are alpha nad beta stages to a project, so that watchful eyes can find the mistakes that you will undoubtedly make, even those that have been made before you.
It is dangerous to be right when the government is wrong.
Pwn2Own 2009 Day 1 - Safari, Internet Explorer, and Firefox Taken Down by Four Zero-Day Exploits
Charlie Miller got the luck of the draw, and had the first time slot for the browser competition. His target- Safari on Mac OS X. Before I could even pull my camera out, it was over within 2 minutes- and Charlie (coincidentally also last year's first winner of the day) is now the proud owner of yet another MacBook, and $5,000 from the Zero Day Initiative.
Next up, Nils. Just Nils- you know, like "Prince" or "Madonna". With a little tweaking, he ran a sleek exploit against IE8, defying Microsoft's latest built in protection technologies- DEP (Data Execution Prevention) as well as ASLR (Address Space Layout Randomization) to take home the Sony Vaio and $5,000 from ZDI.
davecb5620@gmail.com
If you mount your ext4 partitions with nodelalloc you should be fine. You will of course no longer benefit from the performance enhancements that delayed allocation bring, but at least you'll have all of your freaking data. I'm running Debian on Linux 2.6.29-rc8-git4, and so far my limited testing has shown this to be very effective.
Never eat more than you can lift -- Miss Piggy
Standing on the shoulders of giants is usually the best way to make progress.
Stick Men
Someone above says that the POSIX standard is fine, but that ext4 violates it. Here is his quote:
"When applications want to overwrite an existing file with new or changed data [...] they first create a temporary file for the new data and then rename it with the system call - rename("
It seems that ext4 renames the file first, and then writes the file up to 60 seconds later.
All the stuff with Ext4 strikes me as amazingly arrogant, and ignorant of the past. The issue that FS authors, well any authors of any system programs/tools/etc need to understand is that your tool being usable is the #1 important thing. In the case of a file system, that means that it reliably stores data on the drive. So, if you do something that really screws that over, well then you probably did it wrong. Doesn't matter if you fully documented it, doesn't matter if it technically "follows the spec" what matters is that it isn't usable.
I mean I could write a spec for a file system that says "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.
I'd give these guys more credit if I was aware of any other major OS/FS combo that did shit like this, but I'm not. Linux/Ext3 doesn't, Windows/NTFS doesn't, OS-X/HFS+ doesn't, Solaris/ZFS doesn't, etc. Well that tells me something. That says that the way they are doing things isn't a good idea. If it is causing problems AND it is something else nobody else does, then probably you ought not do it.
This is just bad design, in my opinion.
The funny thing is Theodore claimed "all modern filesystems" suffered from this issue, when in reality, ZFS and others do not :-)
Making the same mistakes someone else made is NOT being innovative, it's being stupid or ignorant... or a number of other predicate adjectives.
Innovation is using something in a new way, not making the same mistake in a new way. That's still considered a mistake, and if it can be shown that you should have known about the mistake from someone else making it, you're still "making the same mistake" and not "innovating." Not to say you're not going to make mistakes and not know everything, but it's still a valid criticism.
Any system dev. that thinks its acceptable is a fool.
Yes, fools are the ones who actually understand the POSIX specification and plan accordingly. Those foolish admins who experience excellent performance and no data-loss. Those fools!
Surely some day they will see the error of their ways by refusing to understand the job for which they are paid! Damn them! Damn their intelligence! Damn their comprehension abilities. Damn them to hell!
This is the problem with new features - the users have problems using them until they fully understands and appreciates the advantages and disadvantages.
Advantages: Filesystem benchmarks improve. Real performance... I guess that improves, too. Does anybody know?
Disadvantages: You risk data loss with 95% of the apps you use on a daily basis. This will persist until the apps are rewritten to force data commits at appropriate times, but hopefully not frequently enough to eat up all the performance improvements and more.
Ext4 might be great for servers (where crucial data is stored in databases, which are presumably written by storage experts who read the Posix spec), but what is the rationale for using it on the desktop? Ext4 has been coming for years, and everyone assumed it was the natural successor to ext3 for *all* contexts where ext3 is used, including desktops. I hope distros don't start using or recommending ext4 by default until they figure out how to configure it for safe usage on the desktop. (That will happen long before the apps are rewritten.) Filesystem benchmarks be damned.
As explained in the article - he hasn't made a mistake. The behaviour of ext4 is perfectly compatible with the POSIX standard.
man fsync
If an application decides to check the name of the file system and if the name is "ext4" it erases everything in your home directory, should that be considered a file system bug too?
No, I'd call that malice.
"Most people, I think, don't even know what a rootkit is, so why should they care about it?"
Read the comment. If a sys dev believes that going from a system that behaves in a certain way aka ext3 a "crash" and you either have old data or new data generally speaking. On ext4 (whiz bang new and improved) where you no data and this is acceptable. Yes they are fools. It is a regression. You want to set it as a laptop mode fine. Better give warnings though.
Sorry not as a default behavior. This is the difference between theory and practice. Also called the REAL WORLD. I can't guarantee that every app works according to spec. Seems that there is some debate POSIX addresses this.
What is at issue is ext3 was very good in this respect and ext4 no so much. This is a step backward for the vast majority of systems. Esp. servers and desktops.
I don't care what the excuse. If I have a crash or power cable failure etc... I expect the FS hasn't trashed a bunch of open files at least its not DESIGNED to.
You sir are an idiot.
You need to look at your competition or those you are following and see what they've done so you don't repeat their mistakes. Then you can ask yourself "Do we need to look at that too?" or "Do we need to change that too?"
If they had spent just a couple of hours reviewing the change logs of those file systems, this probably may have never happened as it might have been fixed long ago along with what ever else is new and extremely immature with EXT4.
Even if you are creating something new, or think you are (EXT4 isn't something new, it's just another file system so people creating file systems need to review the history of all other file systems, whether what you are doing is "new" or not). You still need to look at history. You don't need to go through their code in this case with a fine tooth comb and pick it apart but just reviewing what you or your competition has done or changed in the past will make your product a better product.
ZFS isn't all that viable for Linux users. ZFS-FUSE is too slow.
With that said, I think someone should just go ahead and put ZFS in the Linux kernel and release a patch only. This will get around the GPL issues. All it would mean is that you couldn't redistribute a kernel binary or source with ZFS stuff in it. Anyone wanting ZFS would have to patch and compile their own kernel, not that big a deal. If it's internal use only then GPL is compatible with the ZFS license.
Personally I have lost a lot of data with all the ext filesystems (and Reiser3 too). I still use it on OS and boot partitions but all my important big data partitions are XFS. I have run for years on failing hardware with XFS. I have never lost data with XFS except for the sectors that were physically damaged and even then I never lost anything important. XFS has been fairly bulletproof for me, whereas I have lost entire ext2/3 partitions due to corruption that wasn't even a hardware failure.
I'm a hobbyist, and I don't program system level stuff, essentially, at all anymore, but way back when I did do C programming on Linux (~10 years ago), ISTR that this (from Ts'o in TFA) was advice you couldn't go anywhere without getting hit repeatedly over the head with:
Is this really something that is often missed in serious applications?
Ext4, on the other hand, has another mechanism: delayed block allocation. After a file has been closed, up to a minute may elapse before data blocks on the disk are actually allocated. Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. If the system crashes during this time, the rename() operation may already be committed in the journal, even though the new file still contains no data. The result is that after a crash the file is empty: both the old and the new data have been lost.
Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.
If that is true, then to the extent that is true, POSIX is "broken". Related changes to a file system really need to take place in an orderly way. Creating a file, writing its data, and renaming it, are related. Letting the latter change persist while the former change is lost, is just wrong. Does POSIX really require this behavior, or just allow it? If it requires it, then IMHO, POSIX is indeed broken. And if POSIX is broken, then companies like Microsoft are vindicated in their non-conformance.
now we need to go OSS in diesel cars
if it was a default FS on the latest version of the $OS_Shipped_On_95_Percent_Of_Desktops and had this bug, sure. If it is a relatively new and untested file system on an OS with choices of stable FS like Reiser, Ext2/3, JFS, XFS, OCFS2, etc, then no, not as big a deal....
Sounds like you'd fit right in at Microsoft. Ignoring technology "not invented here" isn't innovation, it's reinventing the wheel, aka a stupid waste of time.
For those of us who are not so familiar with the data loss issues surrounding EXT4, can someone please explain this? The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?" I.e. if I ask OpenOffice to save a file, it should do that the exact same way whether I ask it to save that file to an ext2 partition, an ext3 partition, a reiserfs partition, etc. What would make ext4 an exception? Isn't abstraction of lower-level filesystem details a good thing?
It is a miracle that curiosity survives formal education. - Einstein
So is this why we can't have voting (where correctness is paramount over performance) systems developed on Linux?
now we need to go OSS in diesel cars
Anyhow, ZFS is raid, lvm, and fs rolled up into one, so keeping the patch up to date with linux changes could be a bit of work.
Do you even lift?
These aren't the 'roids you're looking for.
Ext4 is still alpha-ish, and declared as such.
Any *user* who trusts production data to an experimental filesystem is already too stupid to have the right to gripe about losing said data.
Why not just make the actual "flushing" process work primarily on memory cache data - including any "renames", "deletes", etc.?
If any "writes" are pending, then the other operations should be done in the strict order in which they were requested. There should be no pattern possible where cache and file metadata can be out of sync with one another.
No. Being innovative means being original, and that means taking new and different paths.
Yeah, but you still have to get on the road so you can blaze your own trail off of it. That means knowing how other people have done things.
Otherwise, how far do you go with this? First principles? Hell, really, the only way to ensure a totally creative being is to have a baby and hand it over to wolves for rearing. You can be sure that that kid's ideas will be totally uncorrupted by the ideas of other humans. Of course, the kid's ideas will also be useless, but that's the price you pay for creativity.
Interestingly enough NTFS is probably one of the best things about Windows. It has most of the modern features, is incredibly resilient, and has existed for a LONG time.
Technophile
Ok...
A) Data loss is due to corrupting/interruption in the time it takes for the file-system to write pending items to the disk. We know that.
B) The time it takes to write items, that are not specifically (in code) told to write to disk NOW, is longer than in previous incarnations. We know that.
C) The main reason no one complained about this feature in ext3 was that the pending time was about 5secs and often times it was never noticed. We know that.
Honestly, any distro that would make this default on install may be brain-dead... The average users is more concerned with data retention than performance. However, having a mechanism to scale the pending write times variably is a good option and scalable to anyone's needs (home -> large data centre).
I say don't drink and drive, you might spill your drink. Before you get behind the wheel just stop and think.
I like the saying, Working as designed, too bad it's a shitty design. I understand it complies to POSIX but isn't the goal to make something that is perceived as better? In any case I think the semi-arrogance of the authors is the real issue here, not the behavior of the fs.
Technophile
So as expected, there is a veritable army of people demanding the old behavior restored; also, most probably a lot of them will "downgrade" or stay with using EXT3.
Of course, the things at fault are really the buggy applications. But even deeper than that, the *paradigm* of having a lot of generated files (that store important user data) that are rewritten unconditionally at each program startup is wrong. What the hell is up with that?
Can't they come up with a method where you rewrite a file only when absolutely necessary? Why must all icon locations, thumbnails and other such GUI desktop bullshit be written and rewritten zillions of times?
Not to mention that EXT3 is just one file system out of many, and arguably not even a very good one. It's rather weird that it was chosen as a default option for so many "popular" distributions (maybe out of some misguided desire to be backwards compatible?). If your application (or again, *paradigm*) works well on only one file system, then it's most probably not the file system's fault.
we discovered a new way to think.
...under ext4.
"Not an actor, but he plays one on TV."
Actually
Solution: an update to the code to behave as idiot application programmers require with a simple mount option.
I hope you're careful in all your C or C++ programming to never run, even for test, a program with behavior undefined according to the appropriate standard. (Something like accidentally getting two modifications of the same thing in without an intervening sequence point, or accidentally dereferencing a null pointer, or missing the new-line character after the last line of your program, or write an integer literal that doesn't fit in long int (for C++, anyway).)
If not, I hope you don't mind if it emails a nasty letter of resignation to your boss, and your porn collection to your mother. That's perfectly compatible with both the C and C++ standards.
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"
They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.
The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Basically, the spec was written one way, but the actual behavior was slightly different. Even though the standard didn't guarantee something to be written, most filesystems did it anyway. When EXT4 didn't write things immediately to improve performance, the applications that depended on filesystems writing data ASAP (even though it wasn't required behavior) started risking data loss in case of a crash and data not being written explicitly.
br/> The mechanism (fsync) has been around for ages, it's just that most apps didn't use it when they should because there wasn't a "need" to until EXT4, and other systems like XFS which are less popular and tend to be run by people who know what behavior to expect.
My blog. Good stuff (when I remember to update it). Read it.
I would be surprised to see an example of anyone criticizing Windows developers for following established standards.
while : ; do sync ; sleep 5 ; done
Ext4 *is* better, and probably because it benefits from the wiggle room provided by the specifications. The question is if you accept the tradeoff between performance and security. I choose performance, because my system doesn't crash that often.
My reading is that applications have been relying on an undocumented feature of the old filesystems instead of being implemented in an fs independent way. Ext4 removed this "feature" and exposed the already existing dependence of these applications. Thus to be fs independent, applications should call fsync to force data be physically written to disk.
The problem is they weren't. Instead they are relying on an (undocumented) feature of ext2/3 to do the fsync for them.
Even though the standard didn't guarantee something to be written, most filesystems did it anyway
No they didn't - ext3 was quite atypical. Even on Windoze, NTFS requires a fsync. (Mind you, vista introduces a transactional API like reiser was on about for linux before he turned all murdery...)
Clue sticks? Why not chairs?
Tyranny isn't the worst enemy of a democracy. Cynicism is.
Yes. All new kernel features should do anything it takes to ensure they work with popular applications. If a new kernel feature breaks an application, even if it is because the developers made incorrect assumptions about how things work, then the new kernel feature should be discarded. This is simple common sense, and something that even Microsoft gets right.
Then why bother coming up with standards? If Firefox doesn't properly render IE specific pages, is firefox at fault? Or is the webserver that isn't following W3C standards? Sorry, but following the standard always trumps everything else.
I don't particularly like the standard, but I've known about the need to make sure that writes hit the disk for years. And since it is a standard, it is documented. If the KDE folks (and whomever else) don't bother learning how to follow standards, then the egg is on their face.
Channel your indignation to change (ie "fix") the POSIX standard. But as someone else posted, a bunch of people will immediately scream for a way to defer the writes because the way you think it should work is too slow.
- doug
Part of the problem as I understand it was that ext3 performed horribly if you did do things according to the spec (i.e. fsync wrote everything pending, not just the file descriptor you gave it) which caused horrible performance.
I think it will be a great idea to use it for desktops as it might force applications to be written correctly, those that are really worried about it can put off upgrading to a new ubuntu until the dust settles.
I don't expect my OS to crash often enough for it to be a concern anyway and the places where its really important (like document based apps like emacs/vi/ooffice) had better have been using fsync already.
Losing a max 2 minutes of recent data changes for extra performance only when an App isn't written to spec, I think I can live with that.
If using 'C' IO does
fclose() alone guarantee that your data is written to disk or must one do
fsync() and then fclose().
fsync is not defined for 'C' IO, it is a UNIX system call.
I think most code does
fopen()
fwrite()
fclose()
If this is buggy code, then this must affect about every 'C' program ever written.
If this is about cases where fclose() does not get called because of a crash, then it is definitely an application bug.
If I had wanted POSIX-compliant behavior, I could have gotten Windows NT! (Windows was just POSIX-compliant enough to be certified, but the POSIX implementation was so half-assed that it was unusable in practice.) Just because Ext4 complies with the minimum requirements of the spec doesn't make it right, especially if it trashes your data.
Many married men would consider this to be a feature, not a bug.
Seven puppies were harmed during the making of this post.
"No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.
It turns out that all the modern operative systems work exactly like that. In ALL of them you need to use explicit syncronization (fsync and friends) to get a notification that your data has really been written to disk (and that's all what you get, a notification, because the system could oops before fsync finishes). You also can mount your filesystem as "sync", which sucks.
Journaling, COW/transaction-based filesystems like ZFS only guarantee the integrity, not that your data is safe. It turns out that Ext3 has the same problem, it's just that the window is smaller (5 seconds). And I wouldn't bet that HFS and ZFS have not the same problem (btrfs is COW and transaction based, like ZFS, and has the same problem).
Welcome to the real world...
I disagree. The mentality of backwards compatibility, even if the old app doesn't follow spec, is what keeps systems from moving forward. I mean, just think of how much further behind webstandards would be if FF, Opera, and Safari thought it was paramount to emulate every quirk or IE6 for the sake of backward compatibility.
The right way to do it is more or less what they are doing, implement the new system to the spec, roll it out as an option or beta, and give all the app developers a chance to realise and correct their mistakes and flawed assumptions before the new tech gets widely adopted.
Mod points: Guaranteed to remove your sense of humor.
Side effects may include gullibility and temporary retardation
The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"
They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.
The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.
Thanks for explaining that. In that case, I salute Mr. Tso and others for telling the truth and not caving in to pressure when they are in fact correctly following the specification. Too often people who are correct don't have the fortitude to value that more than immediate convenience, so this is a refreshing thing to see. Perhaps this will become the sort of history with which developers are expected to be familiar.
I imagine it will take a lot of work but at least with Free Software this can be fixed. That's definitely what should happen, anyway. There are sometimes when things just go wrong no matter how correct your effort was; in those cases, it makes sense to just deal with the problem in the most hassle-free manner possible. This, however, is not one of those times. Thinking that you can selectively adhere to a standard and then claim that you are compliant with that standard is just the sort of thing that really should cause problems. Correcting the applications that made faulty assumptions is therefore the right way to deal with this, daunting and inconvenient though that may be.
Removing this delayed-allocation feature from ext4 or placing limits on it that are not required by the POSIX standard is definitely the wrong way to deal with this. To do so would surely invite more of the same. It would only encourage developers to believe that the standards aren't really important, that they'll just be "bailed out" if they fail to implement them. You don't need any sort of programming or system design expertise to understand that, just an understanding of how human beings operate and what they do with precedents that are set.
It is a miracle that curiosity survives formal education. - Einstein
Microsoft Patent
You have a separate partition for /root ? How large can the home folder of the root user be?
The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync().
So, in principle, the filesystem could just throw away the data unless the application explicitly calls a fsync ?
This seems to be a slightly bit of...hmmm....stupid ?
And I really don't understand why the data access isn't "virtualized", i.e. data access before commit will access the new data in the write cache and not the old stuff. Yes, yes, commit and blah. But a filesystem is not a database (microsoft failed with that idea) and you are really weighting data loss/inconsistency (not fs, just stored data !) on harddisk /power failure versus constant data loss on a lots of programs versus paranoid system calls which break all power saving options and kill performance.
They tried to, but history was just a 0-byte file.
When was this that you tried XFS? Originally it did have problems but it got very stable several years ago.
ZFS is just a filesystem with lots of features. Hell, it runs in userspace via FUSE. There is nothing magical or difficult about it.
1) Modern filesystems are expected behave better than POSIX demands.
2) POSIX does not cover what should happen in a system crash at all.
3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.
4) fsync is not a solution, because ir forces the operation to complete *now*, which is counterproductive to write performance, cache coherence, laptop battery life, excessive SSD wear and a bunch of other reasons.
We don't need reliable data-on-disk-now, we need reliable old-or-new data without using a sledgehammer of fsync.
Using chairs is patented by Microsoft, with Steve Ballmer listed as "inventor".
Sleep your way to a whiter smile...date a dentist!
XFS does the exact same thing, for what it's worth.
Don't thank God, thank a doctor!
A few percent performance difference will be easily wiped away when the filesystem erases an important file that one time a year when a snowstorm knocks your power out.
He probably means /
How much did you pay for you Linux distro?
I've never understood this "it's free so you have no right to complain" bollocks.
A filesystem could erase files from disk every time read() is done and still be perfectly POSIX compilant (if they would put it back on a fsync() call), however that would also be retarded and an outright disregard to valuable user data.
Same here - the bug shows utter disregard to user data. POSIX compilance here is just as irrelevant as whether the code is indented according to C coding guidelines or not. It is still a regression from ext3 and a data loss under common usage scenarios.
Absolutely correct.
And thats the way it should be done.
Stability by default, increased performance by request.
Lets be realistic, how many applications benefit from this delayed write. Not many is guess. Now, on the other hand, if you have an extremely I/O heavy app, disable the auto syncs and do it manually.
The POSIX standard is just fine. The problem is application assumptions that aren't up to snuff.
Read the qmail source code sometime. Every time the author wants to assure himself that data has been written to the disk, it calls fsync.
If you don't, you risk losing data. Plain and simple.
- Michael T. Babcock (Yes, I blog)
From the explanations I received and some reading I've done, I don't think the data is just getting "thrown away" so that isn't really a valid question. The issue seems to be that unless fsync is called, the changes requested by the application may happen in a sequence that is other than what the application programmer expected. The example I saw in this discussion involved first writing data to a file and then renaming it soon afterwards. If I understand this correctly, the application is assuming that the rename cannot possibly happen before the writing of the data is done even though the specification has no such requirement. If the application needs this to happen in the order in which it was requested, it needs to write the data, then call fsync, then rename the file. You could probably fill a library with what I don't know about low-level filesystem details, so please correct me if I have misunderstood this.
The example I found in the Wikipedia entry on ext4 was different. That one involved data loss because the application updates/overwrites an existing file and does not call fsync and then the system crashes. The Wiki article states that this leads to undefined behavior (which, afaik, is correct per the spec). The article also states that a typical result is that the file was set to zero-length in preparation for being overwritten but because of the crash, the new data was never written so it remains zero-length, causing the loss of the old version of the file. Under ext3 you would usually find either the old version of the file or the new version.
What I don't understand and hope that a more knowledgable person could explain is why this can't be done a slightly different way. This is where I can apply reason to come up with something that sounds preferable to me but I simply don't have the background knowledge of filesystems to understand the "why". If the overwrite of the file is delayed, why isn't the truncation of the file to zero-length also delayed? That is, instead of doing it this way:
Step 1: Truncate file length to zero in preparation of overwriting it.
Step 2: Delay the writing of the new data for performance reasons.
Step 3: After the delay has elapsed, actually write the data to the disk.
Why can't it be done this way instead?
Step 1: Delay the truncation of the file length to zero in preparation of overwriting it.
Step 2: Delay the writing of the new data.
Step 3: After the delay has elapsed, set the file length to zero and immediately write the new data, as a single operation if that is possible, or as one operation immediately followed by the other.
That way if there is a crash, you'd still get either the old version or the new one and not a zero-length file where data used to be. The only disadvantage I can see is that this might continue to enable developers to make assumptions that are not found in the standard because the buggy behavior ext4 is now exposing may continue to work. If there's no technical reason why it cannot be done that way, perhaps the bad precedent alone is a good reason to either not handle it this way or to change the spec.
It is a miracle that curiosity survives formal education. - Einstein
The right thing is to code a reliable filesystem, despite POSIX not demanding such feat. POSIX is written in the 80s when system level code was crap. Now we can do better than that.
fsync() is no an answer - we do not care if the old data or the new data is saved after the crash. Ext4 looses both. If we start using fsync() that will hurt performance by a ton, cripple write caching, destroy laptop battery life, wear out SSDs much faster than really necessary and cause a bunch of other side effects without any actual gain except hiding this filesystem regression.
It is a filesystem bug - regression with a mayor data loss in common usage scenarions.
You don't risk any data loss, ever, if you shut down your system properly. The system will sync the data to disk as expected and everything will be peachy. You risk data loss if you lose power or otherwise shut down at an inopportune time and the data hasn't been sync'd to disk yet.
That is to say, 99% of people who use their computers properly won't have a problem.
Also note, the software you use should be doing something like:
loop: write some data, write some more data, finish writing data, fsync the data.
The problem here is that the program is doing the "writing" part and because of how caching and delayed writes work (without which, your computer would crawl), the data isn't written to disk _yet_ but will be, eventually.
Old software assumed the data would be written soon. With Ext4 its possible it won't be written until much much later for performance and power benefits.
PS you can just open a terminal window and type "sync" at any time to flush the data to disk on your system. I'm sure someone could write a tray icon that does the same in 30 seconds.
- Michael T. Babcock (Yes, I blog)
There are 2 'new' things about ext4 that are contributing to data loss:
1. The filesystem doesn't flush to disk as often as ext2/3 did.
2. When it *does* flush, the order of operations is such that it's possible to crash and end up with neither the old nor the new version of a file.
Number 1 is definitely a performance enhancer, and may be okay (certainly would be easy to tune down for desktop systems if you want to).
Number 2 is the real problem. A lot of apps were written assuming that if their data didn't get flushed out, it's no big deal. For example, if you write to a temp file and then rename, you were always guaranteed to either have a good copy of the old file or the new file. That being true, it's not 'wrong' do your update without an fsync(). If you don't really care about losing the changes (and *only* the changes), then you don't need to force the disk to spin up just to guarantee you don't lose them.
But ext4 is a game changer here. No guarantees at all, and no way to guarantee a good file other than to do an fsync().
In fact, if order of operations makes it possible to end up with a corrupt file after a crash, it may well be possible that this could happen even if you do an fsync(). The system can still crash in the middle of your fsync(), and if at any time, the filesystem produces something inconsistent on disk, you can end up with a problem. No filesystem should ever be coded that intentionally creates inconsistent data on disk, however transient. Imagine a DBMS doing that.
I don't know how much of a performance gain you get by the order of operations change, but I suspect it's not so much. And if it opens up a window for data corruption, IMHO, it's not worth it.
Posted from my Android phone. Oh, I can change this? There, that's better...
Is writing a new file, and then renaming it over an existing file really a 'typical workload'???
It is not the application program to understand the functionality of hardware. That is the reason of the operating system like Linux (monolith kernel) to exist. If you would not have an operating system, like Linux (kernel), OpenBSD, FreeBSD, NT or XNU. Every system program or application program should be developed in such way, that they know how to control the hardware. How to move the diskdrive head or how to store data to/from RAM and how to show data on the screen and overall, control the I/O functions.
In these days, you have very complex operating systems (like Linux kernel, what is monolith operating system) and even more complex software systems (like Ubuntu, Windows 7, Mac OS X) what includes the operating system itself (Linux, NT, XNU) and lots of middleware (system programs, like GNU project applications) and then all kind other layers of other platforms like Qt, Java, GTK+ etc. And most software developers do not need to know anything under the layer what they are using to develop their application. Java developer do not need to know how the operating system is controlling hardware. Just that what the Java is doing and for what it is connected and how it talks to the operating system etc.
Ext4 is just one part of the operating system what is not needed to know by normal developers.
.
Lets be realistic, how many applications benefit from this delayed write. Not many is guess.
The guess is wrong.
By delaying writes, ext4 has bigger window to determine how it will allocate stuff.
Also, this seems like just what the doctor ordered for flash drives.
Save your wrists today - switch to Dvorak
Innovation is using something in a new way,...
Stop right there. Anything more than that, and you involve observers.
The inventor thinks he is innovating.
The observer, knowing precedents to the invention, does not.
Context is implicit in "a new way". Moving things by wheeled dredges (that is, carts) would have been an innovation to the Incas, but not to the Sumerians.
Saying "should have" is simply a justification for blame.
There is no "should". There is only "do", or "do not".
They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.
Yup, and the problem has existed with KDE startup for years. I remember the startup files getting trashed when Mandrake first came out and I tried KDE for long enough to get hooked, and it's happened to me a few times a year ever since with every filesystem I've used. I just make my own backups of the .kde directory and fix this manually when it happens. I'm pretty good at this restore by now. Hopefully this bug in KDE will get fixed now that it is causing the KDE project such great embarrassment. I had a silent wish Tso would increase the default commit interval to 10 minutes when the first defenders of the KDE bug started squawking, but he's was too gracious for that.
PS I use a lot of experimental graphics drivers for work, hence lockups during startup are common enough that I probably see this KDE bug more than most KDE users. But they really violate every rule of using config files: 1st. open with minimum permission needed, in this case read only, unless a write is absolutely necessary. 2nd. only update a file when it needs updating. 3rd. when updating a config file make a copy, commit it to disk, and then replace the original, making sure file permissions and ownership are unchanged, then commit the rename if necessary.
PS2 Those computer users saying an fsync will kill performance need to get cluebat applied to them by the nearest programmer. 1st. There will be no fsyncs of config files at startup once the KDE startup is fixed. 2nd. fsyncs on modern filesystems are pretty fast, ext3 is the rare exception to that norm; this will be non-noticable when you apply a settings change. 3rd. These types of programming errors are not the norm; I've graded first and second year computer science classes and each of the three major mistakes made would have lost you 20-30% of your score for the assignment.
Its a fairly typical way of trying to acheive something loosely approximating transactional behavior with respect to updates to the file in question without relying on transactional file system semantics.
And I used to make my living repairing NTFS filesystems back in the 90's. Back then the smart folks had their boot drive formatted FAT for a reason. Of course, NTFS is much more mature now than back then. The same argument applies here, EXT4 was just released for general use. We should all give it, and Ted and company, a break.
I've followed Ted's work for many years on the FOSS front. I fully expect him to make EXT4 work best in both scenarios (data safety and performance optimized).
Kirk McKusick spent a lot of time working out the right order to write metadata and file data in FFS and the resulting file system, FFS with Soft Updates, gets high performance and high reliability... even after a crash.
It's a nice safety net. If you have to reformat your OS partition for some reason, you won't lose all of your stuff in /home.
Doesn't XFS have a sort of rewinding capability to restart a write operation if a crash occurs? I used XFS on my laptop for about a year, and the laptop had various thermal-related issues, so it locked up a lot. I don't think I ever lost any data with XFS, even with delayed allocation enabled..
... and he had an interesting take on it.
First, he said he got one of his vendor codemonkeys (emphasis on monkey here) to say that he understood why people did what they did, it always annoyed him to have to wait for data to write so his applicaiton could get on to the important stuff. His application is an inventory management system that runs on RPG midrange machines.
My buddy would howl at this. Um, excuse me, but the data *is* the important stuff. One of many reasons my bud ended up re-writing much of the canned software he was saddled with a few years ago when he took his current position. Some stuff he just 'tweaks', he says.
And he then related many a story of older systems and newer systems, from PDP-11s through the whole IBM System3x range and E-Series, and the infamous Windows servers he had on those processor cards and all, and the flaky stuff he saw.
He throughly understands the temptation to cache writes, and considers it pure poison. He says, "If your data isn't important enough to write out, it isn't important. Send it to /dev/null, that'll improve performance too!"
Of course, /dev/null isn't an option. But he recognizes the OS is not always going to optimize yout app.
And he didn't joke much about this EXT4/EXT3 issue. Something about being there before, or something. But he's weirderer than I am anyways.
deleting the extra space after periods so i can stay relevant, yeah.
Depending on how much non-system software root has installed in his home directory... pretty large.
Care to elaborate?
The only reasons for this delayed write is the performance gain unless im overlooking something crucial here.
Ok, so how many (especially desktop) apps will be faster by an amount observable by the user? I still think this wont be many.
As to the "this seems to be what the doctor ordered for slash drives". Does this imply the number of reads and writes to the drive is reduced? As far as i know only the number of writes is relevant to the longevity of flash drives, not their timing.
The application programmers aren't at fault here, the POSIX spec is. A filesystem is essentially a hierarchical database, yet POSIX doesn't include a way to make atomic updates to it. The only tool provided is fsync, which kills performance if used. And even with fsync some things - such as rewriting a configuration file - are either outright impossible or complex and fragile.
The real solution is to come up with a transactional API for filesystem. Until that's done, problems like this will persist. Calling fsync - which forces a disk write - or playing around with temporary files isn't reasonable when all you want to do is make sure that the file will be updated properly or left alone.
The alternative is to have every program call fsync constantly, which not only kills performance, but ironically enough also negates some of Ext4's advantages, such as delayed block allocation, since it essentially disables write caching. And it doesn't work if you are doing more complex things, such as, say, mass renaming files in a directory; you have no way of ensuring that either they are all renamed, or none are.
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Then why is it marked as stable in 2.6.28? Unless there's some strange definition of "stable" they use there...
Ok, so how many (especially desktop) apps will be faster by an amount observable by the user? I still think this wont be many.
Depends how much the apps depend on the file system (both reading and writing). Many desktop apps depend on reading a lot, and they benefit from better ordering of the data in file system.
Of course you only need to care about this if you care about file system performance in the first place - I don't think ext3 is going anywhere soon.
As to the "this seems to be what the doctor ordered for slash drives". Does this imply the number of reads and writes to the drive is reduced?
Yes - if you delay the writes more, you have a better idea what actually needs to hit the disk in the end, so you can cut away unnecessary writes. Since you can also combine several writes, you need to erase fewer blocks (though this one is more speculation than actual knowledge).
Save your wrists today - switch to Dvorak
A filesystem is not a Database Management System. It's purpose is to store files. If you want transactions, use a DBMS. There are plenty out there which use fsync correctly. Try SQLite.
Deleted
1) Modern filesystems are expected behave better than POSIX demands.
2) POSIX does not cover what should happen in a system crash at all.
3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.
4) fsync is not a solution, because ir forces the operation to complete *now*, which is counterproductive to write performance, cache coherence, laptop battery life, excessive SSD wear and a bunch of other reasons.
We don't need reliable data-on-disk-now, we need reliable old-or-new data without using a sledgehammer of fsync.
1. POSIX is an API. It tries not to force the filesystem into being anything at all. So, for instance, you can write a filesystem that waits to do writes more efficiently to cut down on the wear of SSDs.
2. Ext3 has a max 5 second delay. That means this bug exists in Ext3 as well.
3. If you have important data that if not written to the hard drive will cause catastrophic failure, then you use the part of the API that forces that write.
4. Atomicity does not guarantee the filesystem be synchronized with cache. It means that during the update no other process can alter the affected file and that after the update the change will be seen by all other processes.
We don't need a filesystem that sledgehammers each and every byte of data to the hard drive just in case there is a crash. What we DO need is a filesystem that can flexibly handle important data when told it is important, and less important data very efficiently.
What you are asking is that the filesystem be some kind of sentient all knowing being that can tell when data is important or not and then can write important data immediately and non-important data efficiently. I think that it is a little better to have the application be the one that knows when it's dealing with important data or not.
If the filesystem is a few percents faster but then your disk sits idle half of the time and then you have a crash and lose a file that takes two hours to recreate, have you actually gained any performance?
A big limitation with flash drives is that repeated reads and writes to a given sector of storage "wear it out" and cause failure more quickly than the same amount of reads and writes to a given sector on a traditional disk device. The generally accepted solution is to use an algorithmic approach to distribute reads and writes evenly throughout the disk (note: transparent to software developers, at least above the kernel level), and this is what the GP is talking about--more time between physical disk writes means that there is more opportunity for an algorithm to decide intelligently where different pieces of the written data should go.
It sounds like the correct solution is for the file system to implement transactional semantics. That is what the applications need and were incidentally getting, despite it not being in the spec.
Why isn't this being considered as the solution? There are other major OSes have implemented basic atomic transactions in their filesystems successfully, why not Linux?
Disadvantages: You risk data loss with 95% of the apps you use on a daily basis.
Wooo. A whole 30 seconds. Horrifying. Here's a workaround for you. Put it in your .bashrc
while true
do
sync
done &
You can't be too careful.
Deleted
That's what I meant :) Thanks for clarifying.
Face your daemons!
Sounds to me like EXT4 kinda fails at this minor detail, eh?
I mean, if it isn't isn't commit the changes to the file system in the right order, it isn't exactly storing files.
Or do you care to amend your definition to read "It's purpose is to attempt to store files. It does not promise to actually store files".
So at what point is "sentience" achieved then? :3
ReiserFS is pretty dead, while btrfs may actually be better though I can't believe it, so perhaps when Tux himself, in his 3rd incarnation, finally is my secretary and is managing my files will my computer finally have "sentience".
Promote true freedom - support standards and interoperability.
3. If you have important data that if not written to the hard drive will cause catastrophic failure, then you use the part of the API that forces that write.
You completly missed the point. The new data isn't important, it could be lost and nobody would care. The troublesome part is that you lose the old data too. If you would lose the last 5 minutes of changes in your KDE config that would be a non-issue, what however happens is that you not just lose the last few changes, but your complete config, it ends up as 0 byte files, which is a state that the filesystem never had.
fsyncs have other nasty side effects other than performance. For example, in Firefox 3, places.sqlite is fsynced after every page is loaded. For a laptop user, this behavior is unacceptable as it prevents the disks from staying spun down (not to mention the infuriating whine it creates to spin the disk up after every or nearly every page load). The use of fsync in Firefox 3 has actually caused some people (myself included), to mount ~/.mozilla as tmpfs and just write a cron job to write changed files back to disk once every 10 minutes.
So, while I'm all for applications using fsync when it's really needed, the last thing I'd like to see every application on the planet sprinkling their code with fsync "just to be sure".
Ok, so how many (especially desktop) apps will be faster by an amount observable by the user? I still think this wont be many.
Depends how much the apps depend on the file system (both reading and writing). Many desktop apps depend on reading a lot, and they benefit from better ordering of the data in file system.
Ok, i havent read though the source for ext4 so i dont know how this "magic" ordering works, but it sounds a bit like "defragmentation before writing". I'll assume that instead of looking for a large enough consecutive blook for the next file it instead looks for large enough block for the files in the "cache".
While this can obviously be advantageous at times, there still is no way the file system can know in which order the app will read these files on next startup (the app could even store the files in reverse order it reads them,e.g. read global configs first, then specific but write in reverse order) making this ordering even detrimental to performance.
But i'm sure you are right about ext3. It wont be going away anytime soon.
As to the "this seems to be what the doctor ordered for slash drives". Does this imply the number of reads and writes to the drive is reduced?
Yes - if you delay the writes more, you have a better idea what actually needs to hit the disk in the end, so you can cut away unnecessary writes.
Right. But thats exactly the problem. After a power outage or system crash these omited writes cant be recovered.
Just spin this idea to the maximum. Delay all writes until the system is out of RAM. This maximises the ability to do an inteligent sync with the drive. Nice in theory, but prone to loose ALL written data since last system start in the event of a crash.
Since you can also combine several writes, you need to erase fewer blocks (though this one is more speculation than actual knowledge).
Might well be true for some flash drives and not for others. And a good reason wear leveling algorythms should be in the device controller, not a filesystem driver.
Yes. It provides, according to the old semantics, that the file change will appear atomic. It is in effect a slightly modified Read-Modify-Write.
Looks like this (crappy C pseudocode, so don't get all pedantic):
fh = open(oldfile);
olddata = parsefile(fh);
close(fh);
newdata = modify_file(olddata);
fh = open(tmpfile);
write(fh, newdata);
close(fh);
unlink(originalfile); rename(tmpfilepath, originalpath);
Note that this provides atomicity for the data. Either the file is existent and contains consistent data, or it doesn't exist.
it looks like ext4 broke this by reordering some of the operations. It made the metadata commit before it actually wrote data to tmpfile, thus leaving us with a zero-sized file if the system crashes before the data-commit.
KDE isn't fixed right now. Additionally, KDE is not the only application that generates lots of write activity. I work with real-time systems, and write performance on data collection systems is important.
I did some benchmarks on the ext3 file system, the ext4 system without the patch, and the ext4 system with the patch. Code followed the open(), write(), close() sequence was 76% faster than the code with fsync(). Code that followed the open(), write(), close(), rename() sequence was 28% faster than code with that followed the open(), write(), fsync(), close(), rename() sequence. Additionally, the benchmarks were not significantly affected by the presence which file system was used (ext3, ext4, or ext4 patched.) You can look up the spreadsheet and the discussion at the launchpad discussion.
Major Linux file backup utilities, like tar, gzip, and rsync don't use fsync as part of normal operations. The only application of the three, tar, that uses fsync, only uses it when verifying data is physically written to disk. In that situation, it writes the data, calls fsync, calls ioctl(FDFLUSH), and the reads the data back. Strictly speaking, that is the only way to make sure the file is written to disk, and is readable.
Finally, as Theodore Ts'o has pointed out, if you really want to make sure the file is saved to disk, you also have to fsync() the directory too. I have never seen anyone do that, as part of a normal file save. Most C programming textbooks simply have fopen, fwrite, fclose as the recommended way to save files. Calling fsync this often is unusual for most C programmers.
I would hate to be in your programming class. Your enforcing programming standards that aren't followed by key Linux utilities, aren't in most textbooks, and aren't portable to non-Linux file systems.
If you require your students to fsync() the file and the directory, as part of a normal assignment, you are requiring them to do things that aren't done by any Linux utility out there. Further, if you are that paranoid, you better follow the example from the tar utility, and after the fsync completes, read all the data back to verify it was successfully written.
Glossing over some details, what is happening is closer to this:
The goal is to replace config with a new version. The programmer is essentially doing this:
The goal is that when you replace config, you're replacing it with a guaranteed complete version, config.new. Assuming it happens in this order (and that step 3 is atomic; it happens or doesn't, never partially) if you crash midway through, you'll either end up with the old config or the new config, but never a partial config. Unfortunately the operating system tries to speed things up, and for a variety of good reasons delaying step 2 makes sense. Doing so is allowed by the standards specifically for these good reasons. So what actually happens is this:
This works fine... unless something happens between steps 3 and 2. If we stop there, we have a new, empty file in place of "config." With ext4, the window between 3 and 2 could be as long as a minute, a window during which you can lose data.
The correct solution is for the program, not the operating system, to take care with files it cares about:
Now it's not possible to move 2a after 3, so you're guaranteed safe behavior. But you lose the speed benefits of reordering. For data you care about, this is a good idea. For data you don't care about (Your web browser cache leaps to mind), it's overkill and makes you slower.
ext3 (and the new ext4 option) essentially adds 2b automatically. It's good in that it's safer for everyone involved, but it's bad in that everyone takes a speed hit, even in cases where speed is more important than safety.
Search 2010 Gen Con events
Is writing a new file, and then renaming it over an existing file really a 'typical workload'???
YES!!!!!!
Good idea, wrong place.
...
...
Stuff like that belongs in the device controller, not the file system.
Or do you really fancy something like this in a filesystem?
switch(VENDOR_ID){
case VENDOR1: switch(PRODUCT_ID){
case...
case...
given that they have similar concurrency and parallel access issues - at least the principles?
and why cannot those principles be applied?
and if POSIX does not guarantee data integrity, then maybe it is time for a POSIX1.1 or POSIX++ ?
(retrospective disclaimer: i am not a hacker or file system programmer, but issues seem similar in principle...)
Just like KDE4 then.
You don't understand the problem.
You are wrong when you say EXT3 has this problem. It does not have it. If the EXT3 system crashes during those 5 seconds, you either get the old file or the new one. For EXT4, if it crashes, you can get a zero-length file, with both the old and new data lost.
The long delay is irrelevant and is confusing people about this bug. In fact the long delay is very nice in EXT4 as it means it is much more efficient and will use less power. I don't really mind if a crash during this time means I lose the new version of a file. But PLEASE don't lose the old one as well!!! That is inexcusable, and I don't care if the delay is .1 second.
Amost correct, but what actually happens with ext3 is the following:
* 1. Create config.new. (Should be empty, because it's new)
* 2. Write the new contents into config.new (cached)
* 3. Move config.new onto config (cached)
(time passes)
* 3b. Filesystem decides that it is time to commit cache to disk and tries to commit metadata first. All commits are written to a journal
* 2b. Metadata commit is determined to be dependant on file data, so file data is written first.
* 3c. Metadata is written do disk.
If a crash happens at any point before 3c, after crash you get the old file, if after 3c, you get the new file.
Arrgh it is annoying to keep having to fix people here.
KDE is already fixed exactly as you suggest. It writes a temp file and then renames it over the original. The problem with EXT4 is that this produces a totally unexpected result if it crashes (ie the result is that the destination is neither the old or new file, but garbage).
The people saying "fsync!" do need the cluebat. They are saying this should be done even if the rename is done exactly like you suggest. That is really slow, just like you say.
I do agree KDE should be fixed to not attempt to write any of it's files except when they really change. Rewriting all of them on startup is stupid!
If they cared about data integrity, they'd have a mount option to turn it *on*. Then in the manual, put a nice fat warning about "if you set this flag, there is a chance we will trash your filesystem, but do it really fast :-)".
I judge a program by shit like this. PostgreSQL comes out of the box with all kinds of integrity improving, but performance hurting options enabled by default and has nice fat comments about the lines were you can turn off stuff like fsync (which I'd never consider).
Bottom line, if they want to restore faith in their file system, the default, flag-free option would be the most stable but worst performing. Let us decide when to run the risk of trashing our file system, not find out the defaults sucked after the file system is hosed.
> this space reserved for nitwits who will claim I should have read the docs before doing anything so I'd know why I should always set this "dont-trash-my-filesystem" flag.... piss off, I dont read the manual before mounting a NTFS formatted USB drive, why should I have to read it before mounting your shitty filesystem?
It's a strange question coming from someone with @debian.org email address. Are you for real?
> Every time the author wants to assure himself that data has been written to the disk, it calls fsync.
The problem is, the application developers are not complaining about the not having the strong requirement of having the data on the disc, but losing a weaker consistency, as Matthew Garret explained quite aptly.
"Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"
ARRGH! This has nothing to do with the data being written "soon".
The problem with EXT4 is that people expect the data to be written before the rename!
Fsync() is not the solution. We don't want it written now. It is ok if the data and rename are delayed until next week, as long as the rename happens after the data is in the file!
I imagine it will take a lot of work but at least with Free Software this can be fixed.
The applications aren't broken, what they do is perfectly normal and taught in pretty much an C programming book out there. Adding fsync() all over the place wouldn't fix anything. For one thing it would mean inserting platform specific code into every application that might otherwise be completly portable ANSI-C or ISO-C++, which would be really ugly, but it would also make the filessytem extremely slow, since now everything gets written to disk instantly and can't be cached. If you want to have fast and secure file writing there is only one place where you can fix that and that is the filesystem.
The problem with using posix for anything is that the specifications are so loose as to be nearly useless. This happened because the specs were written by committees manned by the major UNIX vendors. Those vendors all made sure that the specification covered their implementations ugly edge cases, so they wouldn't have to update anything in their OS.
In the end, Posix is basically useless to write any kind of application code. If every application developer out there tried to deal with every edge case in the Posix specifications they would never get any real code written*.
On the other hand, kernel and OS developers love Posix because they can write just about anything, and it conforms to the specification. They can write all kinds of broken threading implementations, messed up communication and filesystem crap, and it conforms. When something unexpected happens, its the application developers problem because they failed to account for some strange case that doesn't happen in 99% of situations or cannot happen on the platform the application might have originally been written on.
Not only that, but just about any non trivial application ends up writing big chunks of code to deal with huge swaths of platform interactions that aren't covered by posix (or any other standard for example SUS). Everything from playing sounds, configuring a network interface to rewinding a tape.
*Ok, how about an example: Did you know that the close(2) can fail? What does this mean for real applications? Well you have to sit in a loop doing closes on a handle until it returns EBADF. Ok, simple! Now what happens if you have a threaded application where the threads are doing opens/closes? Open and close are marked as thread safe, but because close can fail you need a loop, now what happens if your loop successfully closes a file handle, and another open somewhere else successfully opens and reuses your file handle? That is right, the close loop will close it, leaving you trying to use a file handle that is actually been closed! Whats the solution? Well now you have to write a open/close wrapper that pthread_mutex_locks() a global lock to assure that you aren't getting open/closes running at the same time. This can get worse in some other cases, but i've proven my point. Posix is a rabbit hole.
I can already visualize the colorful, profanity laced comments your idea will produce. Something like
/* Bugfix #30534: Workaround for fucking EXT4, who is so fucking stupid it cannot even write out our rename in the proper sequence and might trash the users filesystem. Since the EXT4 guys refuse to fix their shitty ass filesystem, we have to hack around their busted shit by fucking fsyncing any god damn changes we might have done before doing any kind of directory operations. bite me assclowns. */
QFT
What does quantum field theory have to do with it?
You don't risk any data loss, ever, if you shut down your system properly.
That's meaningless, in that you can't completely eliminate the risk of a kernel panic or similar bug.
The POSIX standard is just fine.
Using POSIX semantics, how does the operating system distinguish between the following two requests?
http://outcampaign.org/
The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"
They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.
The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.
It may also mean that application devs are not testing their software on more than one platform. Just because it works on ext3, doesn't mean it works well on BSD FFS, UFS, ext4, btrfs, ZFS, etc.
Just compiling and running your code on more than one processor family may show bugs in your data structures, etc., running on more than one OS may show you bugs or bad assumptions about APIs and behaviour.
With all the virtualization software out there you don't even need multiple machines anymore.
Step 1: Truncate file length to zero in preparation of overwriting it.
Step 2: Delay the writing of the new data for performance reasons.
Step 3: After the delay has elapsed, actually write the data to the disk.
1. open("myconfig.new", O_CREATE|O_TRUNC)
2. write("myconfig.new")
3. fsync("myconfig.new")
4. rename("myconfig.new","myconfig")
What's the big fucking deal? You either get the old data or the new data, and your code will be good on any present or future POSIX system.
Why are you truncating files that you want to keep data on? What happens if you're on a mount point where the "sync" option is on in fstab so all operations are synchronous? Is that in one of your test cases?
This isn't rocket surgery.
2. This bug does NOT exist in ext3.
Ext3 writes out data before metadata is written (at lest in the default mode data=ordered), this there is no window opportunity where a crash could cause a data loss on ext3. On ext4 there is a 60 second window of opportunity. Or was, before this bug was fixed by patches pending for 2.6.30.
The new data is NOT important, it can be thrown away in a crash an no one will complain. The problem is that ext4 managed to destroy data that was already on the disk. That is unacceptable.
The OS partition you are thinking of is 'root' or '/' - not '/root'.
Its an easy mistake to make, and one that Aigarius was having some fun with.
I'm willing to bet that a ton of people complaining about the standards compliance not being an issue, and "user's needs first..." etc are the same people who rip on Microsoft and IE for not being standards compliant on the web. It's funny how people are so inconsistent with their evangelism.
FHS specifies to use /usr/local or /opt for that.
Personally, I think we need a user level API standard that has those guarantee's. This would most likely contain a wrapper around the POSIX compliant API, but may use a different approach for different filesystems that provide different guarantee's.
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
Just because it's stable (i.e., compilable, doesn't spew warnings, has survived poking) doesn't mean it's mature enough for major distros to rely upon.
it is experimental at best.
Whether the kernel folks issued the appropriate disclaimers on it or not, it still lies upon the distros not to include code that is unproven/brand new/reasonably suspect.
This code may be stable, but it is definitely green.
everybody rehash the exact same arguments as when the original article appeared!
Ah, too late. You already did.
The application programmers aren't at fault here, the POSIX spec is. A filesystem is essentially a hierarchical database, yet POSIX doesn't include a way to make atomic updates to it.
The idea of what a file system is or is not has changed over the twenty years that POSIX has been around. The reason there have been no atomic updates with POSIX because there have been no file system that have been able to do it--except maybe BSD's LFS, and more recently ZFS.
These use COW, so you always consistent on-disk structures and so you can be atomic. Ext1, not atomic; ext2, not atomic; ext3 and 4, not atomic; UFS, not atomic; FFS, not atomic; XFS, not atomic; NFS, not atomic.
The reason why POSIX does not specify "atomicity" is that because until recently it wasn't available.
Now, with all of these new COW file systems coming up (ZFS, btrfs) we can discuss the possibility of a new POSIX API.
Well, I must concede that I've only gotten as far as the POSIX standard in my computer science curriculum, so I'm not as familiar as I could be with system workings at the operating system level. I certainly agree with you that placing hardware specific code in a part of the operating system meant to generalize the algorithmic interaction with mass storage devices makes very little sense.
My understanding is that there is a logical representation of the bytes available on a physical disk (at which level the file system operates), and device drivers and hardware components translate that in some fashion into physical bytes, possibly generating this translation on the fly rather than as a simple bijection of "fixed logical byte maps to fixed physical byte". Wouldn't an algorithm implemented at these lower levels still be able to use the fact that more data is being written at a time to make more intelligent decisions about where to physically place that data?
instead, because its an article about Ext4, theres a load of Ext4-bashing and jokes. it would be quite unfair if it was full of Microsoft-bashing.
No, both of those are, implicitly, expected to be world readable, and at least usually for software that any user can run (to some degree of success). /root is the only place for root to put a local application (or any other files) that he doesn't want a user to be able to see at all.
In which case if you do not want a zero length file then you NEED to sync the file before renaming over the old one or not be dumb and truncate/write to a single file.
POSIX makes no guarantees about file data and metadata being in sync so you need to make sure they are with an fsync() before you commit to using the new file with a rename().
Yes I would like that as well. It would remove the annoying need to figure out a temp filename and to do the rename.
One suggestion was to add a new flag to open. I think it might also work to change O_CREAT|O_TRUNC|O_WRONLY to work this way, as I believe this behavior is exactly what any program using that is assuming.
f = creat(filename) would result in an open file that is completely hidden to any process. Anybody else attempting to open filename will either get the old file or no file. This should be easy to implement as the result should be similar to unlinking an already-opened file.
close(f) would then atomically rename the hidden file to filename. Anything that already has filename open would keep seeing the old file, anything that opens it afterwards will see the new file.
If the program crashes without closing the file then the hidden file goes away with no side effects. It might also be useful to have a call that does this, so a program could abandon a write. Not sure what call to use for that.
Calling fsync(f) would act like close() and force the rename, so after fsync it is exactly like current creat().
Completely aside from your point, with which I agree, I'd like to mod you up just for spelling "lose" correctly.
You are in a maze of twisty little passages, all alike.
And how would one make renaming all files (an arbitrary number) in a directory atomic. POSIX does support an atomic way of making changes. It might be somewhat rudimentary but it does work no matter if you are using local filesystems or something like NFS. The behaviour would acually work very well with NFSv4 with the parallel data storage and separate metadata options. Forcing POSIX to be atomic in the general case would preclude options such as this and drag performance of high-end systems down just because 16 year-old developers can't be arsed to code properly.
As about 6000 people have tried to point out to all you clueless "fsync!" posters, fsync() will kill performance unacceptably. It forces far more to happen then the programmer wants. We only want the order preserved, it is ok if the data write is delayed for a long time.
The fact that "POSIX allows this" is completely bogus. POSIX allows the file to be deleted when you read() it, as long as it is written back when the disk is unmounted. That does not mean that programs should all call unmount all the time just because such stupid behavior is possible!
Who says Linux isn't heading that way? Most existing transactional filesystems that I know of have license issues for inclusion in the Linux kernel, though I think Btrfs is both transactional and currently supported in the kernel, though not yet production-ready and stable.
One might note, that its excactly this "fixed logical byte mapping to fixed physical byte" isnt a bijection. Even normal hard drives have an amount of reserve blocks, afaik 10-20%. As soon as the drive has troubles reading a block it maps it to a reserve block. This is completely invisible to the the OS, so a hard drive which has worn out 5% of its blocks would still appear brand new (though you may see a performance decrease reading a file written to seemingly consecutive blocks).
The device electronics are the ideal place for such wear leveling algorythms. Sure, its theoreticly possible to place those algorythms in driver level. But then every device would require this knowledge for their drivers. Just think of small and embedded devices for a moment to see why thats a bad idea.
This is actually even stupider for flash drives. There is essentially zero seek time on a flash drive, so, in theory, it shouldn't really matter how much you write at any given time(since hte only delay should be how long it takes to actually write the cell).
In addition, presuming reasonable wear algorithms(which should be implemented in the device controller not in any sort of software), every bit of Math I've seen says that for any realistic amount of data writes the flash drives will last substantially longer than any current physical drives(last I saw it was about 30 years if you wrote every sector on the disk once a day, scaling down as writes increase. Even writing 6 times the volume of the drive per day that's 5 years which is a fairly long time for consumer grade physical drives, and unlike a physical drive, even if you can't read it, you can write it so you can just clone it over to a new drive.
File systems will definitely have to change for flash drives, but delaying writes probably isn't going to be the way to do it, especially since there's no need to do so.
Linux creator Linus Torvalds began the discussion saying, "a 'spec' is close to useless. I have _never_ seen a spec that was both big enough to be useful _and_ accurate. And I have seen _lots_ of total crap work that was based on specs. It's _the_ single worst way to write software, because it by definition means that the software was written to match theory, not reality."
http://kerneltrap.org/node/5725
It's not exactly how it's being described by the GP.
Yes, it's technically accurate, but it's not the point. Ext4 extends the delay between writes out from a maximum of 5 ms to well over a second.
Yes you read that right, if you write data in an ext4 file system it won't be on disk until an amount of time you can actually count has past.
The only reason this gives any kind of performance benefits at all is because most applications are not calling fsync(). The resolution to the data loss this is causing is for pretty much every application to call fsync() a whole lot of the time, which will probably end up with data being written to the disk even more inefficiently than it was before.
Just because the POSIX standards say it, doesn't mean it's right. POSIX is very old now, and was based around technological ideas which are out of date now.
There is no such thing as an atomic disk write operation, so your proposed step (3) is working on a bad premise. "One operation immediately followed by the other" is no help either--you have no way of knowing which bytes out of those two writes will and won't be there if there's a crash in the middle of that write.
Let's say you queue writes to two 4096 byte sectors. The power goes out. What made it to disk? Just one sector? Both? The first half of the second one, because the drive reordered the writes based on where the disk heads were at, and it got half the sector written before the capacitors in the drive fully discharged? You have no idea at all what you got, which is why filesystem designers avoid even thinking like this.
The only thing you can do is provide a mechanism to confirm whether a write was successful or not before moving onto a second one. fsync provides such a mechanism, which is why discussion of this issue invariably wander into talking about it.
You're right, we don't need a filesystem which sledges in every byte in case theirs a crash, and which handles important data differently.
At the same time we don't need a file system which takes so long to write data to a disk that every program has to treat its data as important so we end up with a system which does hammer every byte onto the disk in the case of a crash.
1.5 seconds is stupid in PC which can perform fifteen thousand operations in that time span, and when everyone has to use fsync() it won't be any faster.
I had a silent wish Tso would increase the default commit interval to 10 minutes when the first defenders of the KDE bug started squawking, but he's was too gracious for that.
Exactly; the code as written right now has an ugly race condition in it. The best thing you can do when you have one of those is make the conditions under which the problem occurs much more common, so that you get the right feedback for fixing it correctly.
Except even that's not enough, and risks data loss. Consider the following from the OS X man page for fsync:
This is almost certainly done in the name of performance, just as at the FS level. Which raises the question of whether, down the road, further "performance improvements" will create a need for F_FULLFSYNC_NO_REALLY_I_MEAN_IT, F_FULLFSYNC_JUST_WRITE_THE_DATA_ALREADY, etc.
You do realize that /root is different then / right?
As far as I know, you can't, with POSIX. What you COULD do, however, is have a small battery on the drive and dollops of RAM (say a couple of gigs) dedicated to queueing diskbound traffic. All transactions are dumped to the queue. If the power fails, the queue is intact and can continue being run to the drive when power is restored. This would be good for as long as the battery can maintain the RAM (no reads, no writes, no mechanical devices, no software or CPUs, just the RAM).
An alternative is to have a processor on the drive and shift the filesystem(s) over to that as a program, same way SETI@Home can offload some maths to the GPU. The advantage there is that the filesystem can then spend as long as it likes sorting things out, as it's not tying up the main CPU doing so. Again, it requires a fair chunk of RAM on the drive, and again you'd want to battery-back that RAM dedicated to data to the drive (you don't need to preserve anything else).
Either way would bypass the POSIX issue - to a degree, at least - because it's no longer FS semantics that handle the communication but the virtual FS layer to logical FS layer, and that's not POSIX-specified and can therefore be whatever the inventor of the hardware wants it to be.
(If it's a Linux programmer, the obvious protocol would be the existing virtual-FS-to-logical-FS API that Linux currently uses. Of course, Linux would be the only OS that could use those hard drives for some time, and Windows users likely never could.)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
and unlike a physical drive, even if you can't read it, you can write it so you can just clone it over to a new drive.
Did you just get 'read it' and 'write it' backwards or do I need more coffee? O.o Mah head hurthth.
I agree about delaying writes not making any sense for flash drives, though - as you said, there's no seek time, so there's no advantage to buffering then squirting a bunch of physically sequential writes.
This was one of the first real-world uses that I saw of ultracapacitors. An ultracap can store just enough power to get the buffer written and the disk parked before it shuts down. Of course, your solution works pretty well too, with a lithium button cell probably able to keep the ram refreshed for at least a day or two.
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
"Why can't it be done this way instead? "
The file metadata (ie: file length) may not be stored in the same place as the file data. Trying to make this operation atomic interferes with any optimization targeting the reduction of head seeking.
I can't wait for Fedora 11 to come out. EXT4 will be the default file-system. Seems like the KDE 4.0 fiasco all over again.
http://fedoraproject.org/wiki/Features/Ext4DefaultFs
You could probably fill a library with what I don't know about low-level filesystem details, so please correct me if I have misunderstood this.
There ought to be plenty of room in the Bush Library...
Those benchmarks are pretty interesting, but it seems to me that an overwhelming majority of those posted later in the article are CPU-bound operations rather than disk-bound (or GPU, as may have been the case with UT2004).
To be fair both to ext4 and the other file systems, I can't really see the benchmarks you linked to as being representative of real world operations. I'd be hesitant to make my judgments of the merits of a given FS based upon that alone; after all, I personally don't create 4GiB+ files regularly (perhaps someone who does video editing might). I think it would be far more useful to test a file system's capability for reading, writing, creating, and deleting hundreds of smaller files (e-mail and web service load profile) or perhaps the average time taken to load a specific application or series of applications (great for most general usage, such as word processors and the likes). Perhaps I just overlooked that in the article...
I do seem to remember a benchmark some time back that tested things similar to what I mentioned regarding small files between ext4 and reiserfs. It would be wonderful if someone benchmarked both of those file systems in addition to ext3, XFS, and others. Perhaps I'm stricken with excessive skepticism, but the benchmark linked by the OP smells too artificial for my taste. ;)
He who has no
Pretty much every programming book presents a simplified view of the world. Because it teaches C, not systems. Try Rochkind's "Advanced unix programming" one day if you want something close to real world. fsync()'s fairly portable and can be redefined to noop where needed, so I don't see a problem there.
:) You cannot be both fast AND transactional in every operation.
Of course "Adding fsync() all over the place wouldn't fix anything". On the other hand, adding it where it's supposed to go, will
It's really depressing that there are so many clueless comments in Slashdot --- but I guess I shouldn't be surprised.
Patches to work around buggy applications which don't call fsync() have been around long before this issue got slashdotted, and before the Ubuntu Laundpad page got slammed with comments. In fact, I commented very early in the Ubuntu log that patches that detected the buggy applications and implicitly forced the disk blocks to disk were already available. Since then, both Fedora and Ubuntu are shipping with these workaround patches.
And yet, people are still saying that ext4 is broken, and will never work, and that I'm saying all of this so that I don't have to change my code, etc ---- when in fact I created the patches to work around the broken applications *first*, and only then started trying to advocate that people fix their d*mn broken applications.
If you want to make your applications such that they are only safe on Linux and ext3/ext4, be my guest. The workaround patches are all you need for ext4. The fixes have been queued for 2.6.30 as soon as its merge window opens (probably in a week or so), and Fedora and Ubuntu have already merged them into their kernels for their beta releases which will be released in April/May. They will slow down filesystem performance in a few rare cases for properly written applications, so if you have a system that is reliable, and runs on a UPS, you can turn off the workaround patches with a mount option.
Applications that rely on this behaviour won't necessarily work well on other operating systems, and on other filesystems. But if you only care about Linux and ext3/ext4 file systems, you don't have to change anything. I will still reserve the right to call them broken, though.
You knew about the bug, where it is, how to reproduce it and why it exists, know enough about programming to teach a class, yet are hoping that 'they' fix it??? Why not make the fix yourself, publish it in the proper channels (kde mailing lists/bug tracker) to get it in there, and get eternal fame?
Prove that you know what you're talking about and make many others happy. Please? Pretty Please! With sugar on top. And a cherry.
Practice what you pr^H^Hteach.
A) Nevermind POSIX specs, wouldn't it be better if operations were made in order? Is it not a problem a rename could happen before a task the app ordered before? Nevermind the delay in actual disk update.
B) The apps should be calling fsync! Ok, here's what I don't get, if the point of this change was to improve performance by reducing disk writes, isn't it a little counter productive that we are basically asking apps to force a disk write every time they "write" something? Sounds a little counter-effective to me.
Well, It could as well be I do not understand the issue correctly.
Copyright infringement is "piracy" in the same way DRM is "consumer rape"
Just because it's stable (i.e., compilable, doesn't spew warnings, has survived poking) doesn't mean it's mature enough for major distros to rely upon.
it is experimental at best.
Whether the kernel folks issued the appropriate disclaimers on it or not, it still lies upon the distros not to include code that is unproven/brand new/reasonably suspect.
This code may be stable, but it is definitely green.
Good god, you're serious!
Here's a quarter, kid -- buy yourself a real operating system.
As far as I know Intel flash drives use NCQ for this. The idea is that you can keep a bunch of requests pending until you either have one erase block's worth or you hit a timeout.
In fact waiting for a write buffer to fill up a bit before flushing it to disk is actually quite similar conceptually to the original justification for NCQ, that the drive can sort the requests using an elevator seek algorithm.
Even better I think it's done in a way that doesn't lose user data - presumably with something like NTFS the journal keeps track of a pending transation so it can be rolled back if the system fails before it is completed.
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
That's true at the raw flash level. It's not true at the LBA level that filesystems operate. Since there are loads of filesystems that expect to be able to keep overwriting things, at the LBA level all flash disks do some kind of wear levelling.
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Isn't it usually a bad idea to compensate for a lack of a file system feature by adding hardware though?
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Or you could implement atomic renames in software, instead of doing it in hardware...
True, but if there's a sizable performance boost to be had by moving the bounds a little, then it's a fair tradeoff. Responsibility has been steadily moving away from the CPU (and OS) anyway, out to the peripherals as computers become more decentralised and peripherals become more autonomous. How is delegating write caching to the drive any worse than delegating video decryption to the video card or network packet buffering to the Ethernet hardware?
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
Well, sure.
There is a reason drives have caches, thats behavior introduced with IDE (the I stands for intelligent).
The main difference here is that a) the drive doesnt bluescreen and b) it should have enough power stored to flush the cache (and park the head) in case of power failure.
I'm not ranting about the idea of using caches. Im just opposed to doing it someplace it doesnt belong and where it causes more problems than good.
Also there seems to be a weird assumption here that writing a large block of data actually ends up in one "large row" on the device. This isnt necesarilly the case for a hard drive (the older the drive the more sectors will have been mapped to reserve sectors) and definatly shouldnt be the case for a flash drive which can map sectors even more efficiantly since it doesnt even have a seek time.
For many (most?) Unix admins, /root is just a nicer way to specify "/ filesystem" or "root filesystem". The path /root for root user's home directory is popular in Linux, but I never saw it in the Unixes I've used (but I don't know if that custom is a Linux invention.)
``So, while I'm all for applications using fsync when it's really needed, the last thing I'd like to see every application on the planet sprinkling their code with fsync "just to be sure".''
In other words, there is no substitute for doing it right.
Please correct me if I got my facts wrong.
``3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.''
This is, indeed, the central issue. And it's an issue not just with filesystems, but with anything that involves concurrency or transactions (multiple things that belong together). And it's an issue that few programmers get right.
The problem is that things often appear to work if you've not done them right. As long as there is only one thread and everything that thread does is performed successfully, there isn't a problem. So the program works, let's ship it. But then, in some Real World situation, there are suddenly multiple threads and/or failing operations. This can lead to spectacular failures.
The question is, of course, who gets to bear the burden for fixing the problems. Personally, I think there is an opportunity here for programming language and library designers to create APIs that make it easier to do the right thing, and harder to do the wrong thing. But given a specific API, it's up to the programmer to use it correctly. That doesn't mean you can't complain about the API and demand better APIs, but it does mean you can't complain when something that implements the API (correctly) does not magically make your broken code do the right thing. (Not that I am claiming that is what happened in this case - I don't know enough about it to be able to judge that.)
Please correct me if I got my facts wrong.
``For EXT4, if it crashes, you can get a zero-length file, with both the old and new data lost.''
Since you seem to know what you are talking about, I'm asking you. What is it that causes this data loss? Why does it happen with ext4, but not with other filesystems?
Please correct me if I got my facts wrong.
A admin simply calls it '/'.
Such an algorithm should take milliseconds, not minutes.
I think he's right in the first half. Let be data integrity and retention the default. Pleeease! I know i should RTFM but please raise hands who read an ext3 manual before installed. please don't just trash the work/life of other people and then laugh 'rtfm'
If you need extra hardware (drives with ultracapacitors or computers with a UPS) to run Linux, suddenly a Windows OEM license seems a lot cheaper.
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Can I be a dick and hold all writes in memory until I get an fsync and call it POSIX compliant?
Hold data for five minutes without an fsync?
One minute?
Maybe POSIX is just not good enough.
I did flip read and write, long day.
Yes, and from your email address, you really ought to know that. Or did they ditch their Unix labs?
Finally! A year of moderation! Ready for 2019?
I just want to say that I really like that idea, apart from the fsync call. fsync should not have any effect on correctly working (non-crashing) systems. An application should be able to call fsync at any time and not change anything (except ruin performance).
Finally! A year of moderation! Ready for 2019?
Read the qmail source code sometime. Every time the author wants to assure himself that data has been written to the disk, it calls fsync.
If you don't, you risk losing data. Plain and simple.
You mean a programmer actually RTFM ? That's crazy talk !
May contain traces of nut.
Made from the freshest electrons.
You make the wrong assumption that writing on flash has the same speed as reading. This is the case for hard disks, but for flash writing takes 10 times longer than reading.
you will not notice this until your write chache is full. Good controllers like intel will hide this slow writing for a long time.
Also most benchmark programs are for disks, and writing is disabled. This does not matter a lot for disks since wringing is the same speed as reading. It matters a lot for flash.
But any writing that is prevented, maybe becuase it delayed is a good thing for flash drives.
Does, for example, Firefox claim full POSIX conformity? Does POSIX demand file to be fsync()ed? No, it doesn't, but you may experience data loss in case of system crash.
I didn't read the POSIX spec, and I can be fairly sure that most FLOSS developers didn't either. And if they did, they surely don't remember every paragraph and every sentence. Not enough to be aware of it at every stage of the development process.
And now a question, does fclose() also call fsync()? That is, if I close the file, can I be reasonably sure that it'll be written to the HDD immediately?
I happen to agree with Ted and yourself from a technical stand point. But where Ted stuffed up is that from ext3 to ext4 he moved the goal posts and didn't communicate effectively to the community what effect this would have. Nor did he explore before hand the consequences of moving said goal posts on software widely in use. He's since done the right thing, which is to give people a choice; and the public fall out from all this has done the communication for him. But he could have saved himself a public ear bashing if he'd gone about this differently. Hopefully he'll remember this next time.
I usually use: "/ (root)" to cover those in the know, and those who aren't.
The POSIX specifies that closing a file does not force it to permanent storage. To get that, you MUST call fsync() .
So the required code to write a new file safely is:
The is no performance problem because fsync(fd) syncs only the requested file. However, that's in theory... use EXT3 and you'll quickly learn that fsync() is only able to sync the whole filesystem - it doesn't matter which file you ask it to sync, it will always sync the whole filesystem! Obviously that is going to be really slow.
Because of this, way too many software developers have dropped the fsync() call to make the software usable (that is, not too slow) with EXT3. The correct fix is to change all the broken software and in the process that will make EXT3 unusable because of slow performance. After that EXT3 will be fixed or it will be abandoned. An alternative choice is to use fdatasync() instead of fsync() if the features of fdatasync() are enough. If I've understood correctly, EXT3 is able to do fdatasync() with acceptable performance.
If any piece of software is writing to disk without using either fsync() or fdatasync() it's basically telling the system: the file I'm writing is not important, try to store it if you don't have better things to do.
_________________________
Spelling and grammar mistakes left as an exercise for the reader.
Yes, I was thinking the same. You (very simply) need:
* existing file: foo.bar
Transaction One: Write foo_tmp.bar and sync
Transaction Two: Delete foo.bar, rename foo_tmp.bar to foo.bar and sync
For desktop applications you should sacrifice performance in order to ensure that the transactions are committed. I'd go as far as requiring implicit filesystem syncs within 5 seconds of the last call. IIRC the problem with ext4 was that it thought that leaving data unwritten for minutes was okay (presumably because of high read load, but if that's streaming data then the application should have a buffer so it's safe to interrupt to write some data every so often). In addition in low-battery or power failure (via UPS) situations all writes should be synced because power loss is an imminent possibility.
As for applications that write transient temporary files for later use, they could be using a ram drive (even the Amiga had these in 1985).
Actually, there's a deeper issue.
fsync() doesn't really mean what the POSIX spec says it means, and hasn't for a while.
Technically, fsync() means sit and wait until the data has been written to disk, and then return. Since the commit interval on this new filesystem is over a minute, using this view would mean that the application could hang for all of that time.
Because commit windows are now so long (even 5 seconds was a long time), filesystem authors have altered the behavior of fsync() to mean "write to the filesystem NOW." With this new meaning of fsync() and a pedantic view of the POSIX APIs, there is no longer a way to say "I want the old data or the new data after a crash, but don't really care which." (BTW, saying this with POSIX would require spawning a separate thread to do the writes.) Instead, the user is saying "write this to disk, NOW." That's a whole different set of guarantees.
If all of the applications start replacing "I want the old or the new data, but don't care which," with "I want the new data written, NOW," then THAT will REALLY prevent the kind of write optimization that ext4 is trying to do. Delaying writing the rename until after the data is written shouldn't hurt filesystem performance significantly at all. The only cost should be in the in-memory data structures necessary to track this dependency.
In this case, what the application writers are asking for is both good for system integrity AND good for filesystem performance. The alternatives (database, fsync) are all worse, not better.
(Aside: All of this applies to the atomic rename() method. Everyone agrees that using O_TRUNC on an existing file was just dumb.)
Blaming the application developers is a bit rich. In EXT3 fsync was not only not necessary, it also could cause the system to freeze for 30 seconds, hence userland developers for Linux avoided fsync unless the data was not only just important but *really* *really* important.
Also application developers have to make many assumptions not explicitly spelt out in the POSIX specification, e.g. POSIX does not explicitly specify, e.g. that you machine has more than 16 bytes of RAM, that there is no "rm -rf /" in your initscripts etc. It is stupid to trumpet a new delayed allocation scheme, and then say "unless you explicitly disable it, your filesystem may enter a inconsistent state", so make sure that you always disable it or its *your* code thats buggy.
I also salute Mr. Tso... for recanting and fixing the damn bug. There is a more throughout discussion here.
For those of us who are not so familiar with the data loss issues surrounding EXT4, can someone please explain this? The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?" I.e. if I ask OpenOffice to save a file, it should do that the exact same way whether I ask it to save that file to an ext2 partition, an ext3 partition, a reiserfs partition, etc. What would make ext4 an exception? Isn't abstraction of lower-level filesystem details a good thing?
If you're old enough to remember back to how RAM above 640k was used in the DOS days, it was usually a RAM disk or disk cache (SmartDrv.exe). If you enabled write caching on SmartDrv.exe performance went way up, but of course you could lose data if you hit the RESET button before it had flushed.
... skip ahead a few years ...
Modern operating systems automatically cache data because it increases performance. Specifics of the size of the write cache and length of time before it's written to disk may vary, and each filesystem will have its own defaults.
EXT3 defaulted to committing data to disk after a maximum of 5 seconds. EXT4 increases that time to 150 seconds. (The exact numbers vary a bit, but you get the idea). Bottom line: When there is a system crash with EXT4 you notice losing data more often because there is a larger window of when data can get lost.
This is a very basic overview, but there are two groups weighing in on this:
Group 1: Things break under EXT4 that worked under EXT3!
Group 2: Look pal, it works fine. If you want your data committed right away so that you don't lose data maybe you should be calling fsync() so that the OS knows to commit your data? Because you know what, even with EXT3 you have data loss. It becomes more noticable with EXT4 because of the longer cache times, but the problem always existed!
Group 1: It worked before! And if commit our data immediately peformance drops!
Group 2: It didn't really work before, in laptop mode the EXT3 write time increases to 30 seconds. The problem has always existed! If you don't like taking the performance hit of committing data immediately, perhaps you shouldn't be writing so many tiny files so often!
Group 1: But it worked before! EXT4 is broken!
Group 2: Okay, look. You're obviously not listening. Why don't we make EXT4 behave more like EXT3 and do some auto-commits. Poorly coded applications will not lose data as often, and properly coded applications will not perform as well as they could.
Group 1: I'm taking this to Slashdot. EXT4 is teh suxx0rz!
Group 2: *sigh*
Just because the POSIX standards say it, doesn't mean it's right. POSIX is very old now, and was based around technological ideas which are out of date now.
POSIX is being very flexible here, and rightly so. Ext3 and ext4 represent different trade-offs between two extremes. On one extreme you have very high data reliability, where data immediately goes to the disk and it is rare that anything is lost. You can accomplish this my mounting with sync. However, this has terrible performance.
On the other extreme we have great performance because the OS never ever writes to the disk. It would be like a live CD (ignoring CD/DVD reads). However, if the system crashes, you lose everything.
Real filesystems go somewhere in the middle, trading some reliability for performance. Ext4 just shifts things more towards performance than ext3. If POSIX was more rigid, we wouldn't have the choice of where to make the trade-off without breaking the standards. It would be poor decision making had POSIX chosen some arbitrary time limit for disk writes. POSIX isn't this way because it is old, but because they were being flexible in preparation for the future. That was good forethought.
If you don't like the ext4 trade-offs, stick with ext3. Linux is rock solid, though, and I have seen a Linux kernel panic only once in my life. The only thing I have to worry about are power outages. And my cat stepping on the damn switch on the power strip. Shifting a bit towards performance sounds nice to me.
And if we ignore POSIX, then when what do we have? We end up getting crap like the arbitrary, undocumented mess that is the Windows API.
The only reason this gives any kind of performance benefits at all is because most applications are not calling fsync().
This is good, as they are deferring to the filesystem, letting the user choose the trade-off. Apps need to do fsync() if the data is very important (i.e. system logs) or if the application is destroying important old data (writing over old config files, like KDE). I am sure there are a few other situations too. But if they are always calling fsync() for every write they are doing it wrong.
Or they could document that feature (that it seems that everybody needs badly) and fix ext4... People could also comit to a middle point, adding another system call for atomic moves, or another couple of oppening modes.
Rethinking email
Because the standard wasn't fixed, just the code.
Rethinking email
If it was that way round (valilla Windows vs. hardware + Linux), sure. However, Linux wouldn't require the extra hardware to run, so you don't "need" it.
It would also be a fairer comparison to say that Linux + a capacitor + a rechargeable battery for RAM = Windows + a SAN box. And unless you're talking one serious capacitor, the SAN box is going to be more expensive.
Besides, battery-backed RAM and ultracapacitors wouldn't be OS-specific. You'd get a performance gain and a reliability gain on ANY OS that supported queued commands. At that point, any OS that failed to queue correctly (on the assumption the drive must be slow) would suddenly be much less reliable than any of its competitors.
Also, bear in mind that a rechargeable battery capable of preserving the disk's command queue is probably going to add fifty cents to every hundred dollars of disk. Unless you're buying disks in the same sort of numbers as Google, the overheads are going to be lower than the difference in disk price between stores. You'd never come close to reaching even a single Windows license fee.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
It's not really compensating for a defect in the software, because an ultracapacitor or battery-backed RAM will work on all filing systems, and indeed on all Operating Systems.
Even the most extreme solution I offered is nothing more than turning a local hard drive into NAS where the "network" just happens to be the internal bus. It has all the benefits of NAS (such as the queue not being corrupted when you power down the machine, not tying up the main CPU, and so on), but at a fraction of the cost (because you don't need the compute power or a whole new network + NICs).
You could also argue that it's not "adding" hardware, since all IEEE 488 (and many SCSI) hard drives used to be intelligent peripherals. Rather, all more modern hard drives are cut-down. It's like the Winmodem. Nobody argued that "full" modems were Winmodems + added hardware. Besides, a 50 cent capacitor and a torch battery are hardly extensive hardware mods.
Nor is this really a departure from current design methods. SCSI drives are forever increasing the size of the queues they support and they are already nominally intelligent peripherals. The most that could be said is that this suggestion extends the idea to ATA and SATA drives and replaces the typical absurd 16-command queue with a 32768-command queue.
This idea isn't heavyweight, like UPS. The drive would not run without mains power. Rather, the drive would retain all commands and finish off whatever it had been doing when power is restored. No need for whopping big batteries and expensive extras to keep the mechanical bits going. All you need is to stop the RAM decaying, just enough juice to keep the dynamic RAM refreshing, nothing more. One, maybe two, rechargeable camera batteries should be enough to handle most situations.
And if there's corruption after that, well, even a "perfect" FS wouldn't have given you any better retention.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I was thinking of one transaction, where the file is truncated and new contents are written. Either both entirely happen, or neither do.
OTOH, I think you're on to something with a simplified transaction system that allows atomic pairs of operations that can be used to implement things like this, e.g. replace one file with another, ensuring it's written completely on commit.
If you open file A and write some data to it, then close it, it does make sense that if it crashes and you look at the disk, you might see A as being zero length. It was zero length at one time, so this is an expected state to see it in.
However if you then rename A to B, it in effect does the rename before finishing the data. So now B can be a zero-length file.
This is really unexpected, and not good, because most programs doing this actually copied a lot of interesting and useful data from B to A, changing just a little bit (imagine a configuration file where one flag is switched, or a text editor saving the new version). All previous Unix systems when they crashed (if you ignore ones where the disk was left uselessly trashed) would have B either be it's old version or with the full data that was written to A, and programs relied on this. It is a very very useful to rely on this.
People saying "fsync!" do not understand the problem, and their solution will make performance dreadful, as bad as EXT2 or worse. The desire is to have *either* the old or new data. If the old data is still on the disk, well then a configuration or some editing was lost, which seems quite understandable considering your machine just crashed. But not the whole file including data that was correctly on the disk for possibly a year before the crash!
I may not have explained it right. What fsync would do it cause the currently-written file to appear at the name, but it would remain open and writable. Doing creat() immediately followed by fsync() would be like creat() is today.
The reason is that otherwise the fsync() is useless. Nobody else can see the file you are writing, and if the system crashes the file you are writing had better be completely gone when it is brought up (the previous file with that name would still appear). So calling fsync() unless it makes your file appear would serve no purpose.
You could also argue that it's not "adding" hardware, since all IEEE 488 (and many SCSI) hard drives used to be intelligent peripherals. Rather, all more modern hard drives are cut-down. It's like the Winmodem. Nobody argued that "full" modems were Winmodems + added hardware.
Actually Winmodems are a better example than the harddrive one. Microsoft developed software to run on the host CPU to emulate a modem. As far as I can see they turned the modem hardware itself into a soundcard - actually AMR cards were a winmodem and a soundcard. Then they had a proprietary standard where the winmodem manufacturers could write a driver to work in their environment to abstract away the hardward differences.
The laptop I'm writing this on has an Winmodem, a Motorolla SM56 on the HDAudio bus which seems to be a descendant of this sort of technology. What's clever from the Microsoft point of view is that you can make a cheap modem, if you're willing to make it Windows only.
Of course, these days modems are essentially obsolete, but Windmodems are so cheap that they were still putting them into laptops a year ago when I bought this machine. Now imagine if the same situation happened with hard drives where you could run windows on a dumb one that was cheap to make but Linux required a more expensive, smarter one. Laptops being cost sensitive and Windows being a common case you'd quickly find that most laptops came with the dumb drive and Linux would be unable to run on them. With a laptop it's not like you could change hard drive controller either.
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
The bug does NOT exist in Ext3 with its default mount options, because it saves data in ordered mode. That means that it will either flush everything to disk up to a certain point in time (meta data and data alike) or nothing at all.
The problem with Ext4 is that they decided it would be a good idea to have 2 time lines. One for meta-data, and one for other data. It's possible those two are NOT in sync (as in, meta-data has been flushed up to point X in time, while data only has been flushed until point Y in time). Since meta-data is usually tiny in comparison to actual data, I don't even see why they would do this. Just donot flush meta-data until you have to actually flush real data as well. Problem solved.
Furthermore, if you indeed did change all applications to call fsync() when needed, performance of Ext4 would be worse than current Ext3 performance in ordered mode.
As a filesystem author, I can tell you that calling fsync() for whatever reason is ALWAYS a huge performance hit. The only thing applications expect is that things are ==SEEMINGLY== done in order (as in a time line). Ext4 can send stuff over the internet for all I care, but when my application asks it to do A, B, C, then no matter what happens, I can NEVER EVER end up with A & C after a crash. The only acceptable states after a crash are nothing at all, A, A+B or A+B+C.
That still leaves plenty of room for writing things out-of-order and doing delayed block allocation, because, as long as order is guaranteed, I don't care if things happen like this:
A (10 minutes pause) B + C
The order of the actions is important, not when they get flushed (if ever), as long as no FUTURE events are flushed first without flushing all preceding events. I could write a filesystem that only touches the disc every 30 minutes (given sufficient memory) and still be able to preserve this simple basic expectation.
Transaction One: Write foo_tmp.bar and sync
Transaction Two: Delete foo.bar, rename foo_tmp.bar to foo.bar and sync
I think that might be the wrong order for transaction two, unless you can accomplish all of that with a single write to the disk. Might it be better to rename foo_tmp to foo.bar, then delete the original? That way, if the power were to fail between the delete and the rename cycle you would still have a file.
Or, with a journal, you could do it in three steps, just like you said:
1: journal that foo_tmp.bar will become foo.bar if foo.bar is gone.
2: write foo_tmp.bar and sync.
3: delete foo.bar, rename foo_tmp.bar, and sync.
True if Linux required it. But let's say Linux'll run on any hard drive, but the "super hard drive" will let you run a desktop or server Linux install that is much more robust.
Laptops generally have power regulation, so can flush buffers and do all other I/O in a controlled way if the main battery reaches a critical level, so I don't see any value in adding robustness that would never get utilized on such a system. Although, if it's only going to add $0.5 - $5 to a hard drive, the value of the selling point would likely see it ending up in most laptops anyway.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I could say the same things about windows vista...
At least linux is honest when things foul up unexpectedly.
1. POSIX is an API. It tries not to force the filesystem into being anything at all. So, for instance, you can write a filesystem that waits to do writes more efficiently to cut down on the wear of SSDs.
POSIX explicitly requires that data be flushed to disk when fsync is called. This wears out SSDs and forces HDDs to spin up wearing out. So while you could write an fs that ignores fsyncs (MacOS X more-or-less does this), that fs would be not be POSIX compliant.
So, if your application wants to "delete this file after this other file has been deleted, without triggering a spinup", using fsync is *explicitly* wrong. Under *any* POSIX compliant fs this will force an spinup.
If you do not use an fsck, your app will be compatible with at least some POSIX compliant fs, including ext3 with data=ordered.
2. Ext3 has a max 5 second delay. That means this bug exists in Ext3 as well.
I understand that when running on battery this is usually increased to 15 seconds which is probably enough to stop your harddisk dieing before its time.
4. Atomicity does not guarantee the filesystem be synchronized with cache. It means that during the update no other process can alter the affected file and that after the update the change will be seen by all other processes.
Good. Because we not want the filesystem to be syncronized with disk (cache?). This forces a spin up. All we want is for atomicity to be preserved during a crash; this does not require a spin up. Note, for example, that if the drive does not spin up before the crash then the neither will the old file be deleted nor will the new file be written.
Real Disadvantages: You risk data loss with any application that stores critical data using either (1) a truncate/write method or (2) a write/rename method without a asking the OS to sync it's data. I think that far fewer than 95% of applications fall under (2) and every filesystem will have issues with (1).
For (1) there is nothing the OS can do for the application, just about any file system would loose data in this case depending on how long it caches the writes in memory and if the application has a chance to finish writing all of the data. (1) is clearly bad application code at fault. Ext4 does increase the write-delay for the data but any way you use (1) is asking for problems if the system crashes/the disk fills up/etc.
For (2) the file system could implement atomic rename operations but that would be at a slight performance loss when the application didn't need this atomic operation. This is more of a do-what-I-mean-not-what-I-say workaround as I don't see too many situations where (2) would be used without expecting atomic operation. If the application didn't care about possible data loss in the file (1) works well. The real fix however is to call sync() in the application code in this situation, it makes the code more portable across posix filesystems.
Yes, but the point is, the reason they leave it to the file system is that it's they presume the file system won't be stupid with their data.
There are thousands of posts in here talking about the fact that you're supposed to be calling fsync() and if you're not then it's just your own damned fault. I'm just pointing out that by forcing applications who didn't need fsync before to fsync they're actually going to hurt performance all in the name of a performance gain.
That's not the only issue. Another is that THERE IS NO FUCKING FSYNC in C.
fopen/fwrite/fclose/rename is the best you can do without diving into platform-specific stuff. From that POV any platform that reorders the renaming to happen before fclose is horribly broken. Fuck this bullshit, 9899:1990 comes first, POSIX next.
From your link: "If you want a platform independent program, avoid fsync". It seems unfortunate that OS implementers have so often interpreted this as "If you want a platform independent program, kiss your data goodbye."
It doesn't matter how long the delay is, so long as the OS doesn't reorder "delete old file" to occur before the "create new file". In ext3 in ordered mode, the OS doesn't reorder writes, so you are fine. (unless your hard-disk reorders writes)
I understand that fsync is useless when done on one of those temporary files, but I think that's preferable to it having a strange side effect.
Finally! A year of moderation! Ready for 2019?
There are mount options to make EXT4 behave the way you want for all applications. This is another good reason not to use one big filesystem for your whole disk.
My mail server has separate LVM mounts for the mail queue, the IMAP storage, the root partition, the logs, /var/lib, local and home so they can be mounted with different options (and independently grown as necessary).
On a home PC, this might mean mounting your /home directory with "just write it now, seriously" while /tmp and /var is more lax.
- Michael T. Babcock (Yes, I blog)
The rename /isn't/ happening before the data is written, unless you replay the journal, and you're not journaling data.
If you think about that, it makes perfect sense.
The rename is NOT happening on disk before the data is written to disk if the system is running normally.
If the system crashes, the log replay may rename the file without data because you're logging metadata (like renames) not data. Just turn on data logging and you'll be fine.
- Michael T. Babcock (Yes, I blog)
THANK YOU.
http://outcampaign.org/