Ext4 Data Losses Explained, Worked Around

LOL: Bug Report by Em+Emalb · 2009-03-19 05:50 · Score: 5, Funny

User: My data, it's gone!
EXT4:"Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations."

Solution: WORKS AS DESIGNED

--
Sent from your iPad.

Those who fail to learn the lessons of history.... by morgan_greywolf · 2009-03-19 05:52 · Score: 5, Insightful

FTFA, this is the problem:

Ext4, on the other hand, has another mechanism: delayed block allocation. After a file has been closed, up to a minute may elapse before data blocks on the disk are actually allocated. Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. If the system crashes during this time, the rename() operation may already be committed in the journal, even though the new file still contains no data. The result is that after a crash the file is empty: both the old and the new data have been lost.

And now my question: Why did the Ext4 developers make the same mistakes Reiser and XFS both made (and later corrected) years ago? Before you get to write any filesystem code, you should have to study how other people have done it, including all the change history. Seriously.

Those who fail to learn the lessons of [change] history are doomed to repeat it.

--
My blog

rename completes before the write by Spazmania · 2009-03-19 05:53 · Score: 5, Insightful

Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.

I couldn't disagree more:

When applications want to overwrite an existing file with new or changed data [...] they first create a temporary file for the new data and then rename it with the system call - rename(). [...] Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until [up to 60 seconds later].

Application developers reasonably expect that writes to the disk which happen far apart in time will happen in order. If I write to a file and then rename the file, I expect that the rename will not complete significantly before the write. Certainly not 60 seconds before the write. It seems dead obvious, at least to me, that the update of the directory entry should be deferred until after ext4 flushes that part of the file written prior to the change in the directory entry.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.

Re:rename completes before the write by Anonymous Coward · 2009-03-19 07:47 · Score: 4, Interesting

behaves precisely as demanded by the POSIX standard

Application developers reasonably expect
Apples and oranges. POSIX != "what app developers reasonably expect".
Of course you have a point insofar as that just pointing to POSIX and saying it's a correct implementation of the spec is not enough, but let's be clear here that one of these things is not like the other.
Re:rename completes before the write by Wodin · 2009-03-19 08:18 · Score: 2, Informative

If power is lost at the right time, the same results would happen.
The right time being the hundredths of a second between the commit of the file data and the commit of the directory data, not 60 seconds.
No, not "hundredths of a second". Five seconds. Or 30 if you're using laptop mode.
https://bugs.launchpad.net/ubuntu/jaunty/+source/ecryptfs-utils/+bug/317781/comments/54

--
-- Wodin
Re:rename completes before the write by SanityInAnarchy · 2009-03-19 08:23 · Score: 3, Insightful

The right time being the hundredths of a second between the commit of the file data and the commit of the directory data, not 60 seconds.
And if you fsync'd, the right time would be zero, on either ext3 or ext4. Or XFS, for that matter.

If I fsync after every write, I can get reliability in ext2.
No you can't. Reliability in ext2 would force you to sync not just your file, but whole directory structures -- and even then, you'd only be safe until something else starts writing.

I put up with the performance hit from ext3 and ext4 because I want the reliability in the filesystem instead of having to build it into every part of every application.
Too late.
All the journaling guarantees is that if you lose power, you won't have to fsck -- you'll get a filesystem which is internally consistent. Oh, and it also guarantees that you won't see circular directory entries, or an entire directory falling off the face of the planet, and other nastiness.
Whether it's consistent with respect to your application is completely outside the scope of the FS journaling, and is the responsibility of your application. Put it in a library, use a database, whatever -- but it's not the filesystem's fault that you failed to read the spec, nor is it very smart of you to code to ext3 instead of POSIX.

--
Don't thank God, thank a doctor!
Re:rename completes before the write by nusuth · 2009-03-19 08:53 · Score: 2, Informative

Application developers reasonably expect that writes to the disk which happen far apart in time will happen in order. If I write to a file and then rename the file, I expect that the rename will not complete significantly before the write. Certainly not 60 seconds before the write.
That is sounds like a reasonable assumption but it is certainly not reasonable to write code that depends on that. 60 seconds is an eternity for a computer, but so is a second. Therefore the fact that 60 seconds is much longer than what you would expect has no bearing on the situation. If your applications depend on frequent data writes, they will have exactly the same file zeroing problem regardless of the actual amount of delay. You can't know that a crash will happen a least -say- 0.06 seconds after a write and rename, so you will still be losing files on crashes, only 1000 times less frequently with a 0.06 sec delay instead of 60. Considering how many times the problematic idiom may be used in 0.06 seconds, and how many computers are using linux, that is still an unacceptable way to write programs.
It seems dead obvious, at least to me, that the update of the directory entry should be deferred until after ext4 flushes that part of the file written prior to the change in the directory entry.
Ensuring rename happens after write is fundamentally different from not ensuring it but writing data frequently enough that it often happens that way. This is also exactly what has been done with ext3's ordered mode and what is being proposed for fixing ext4.

--
Gentlemen, you can't fight in here, this is the War Room!
Re:rename completes before the write by Yokaze · 2009-03-19 10:42 · Score: 2, Informative

It is not about losing data of the write due, it is about losing data already written, by completing the operations in a different order as issued.

--
"Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"

the workaround is bad design by girlintraining · 2009-03-19 05:54 · Score: 3, Insightful

Short version: "We're sorry we changed something that worked and everyone was used to, but hey -- it's compliant with a standard." If this were Microsoft, we'd give them a healthy helping of humble pie, but because it's Linux and the magic word "POSIX" gets used, I'm sure we'll forgive them for it. The workaround is laughable -- "call fsync(), and then wait(), wait(), wait(), for the Wizard to see you." How about writing a filesystem that actually does journaling in a reliable fashion, instead of finger-pointing after the user loses data due to your snazzy new optimization and say "The developer did it! It wasn't us, honest." Microsoft does it and we tar and feather them, but the guys making the "latest and greatest" Linux feature we salute them?

We let our own off with heineous mistakes while professionals who do the same thing we hang simply because they dared to ask to be paid for their effort. Lame.

--
#fuckbeta #iamslashdot #dicemustdie

Re:the workaround is bad design by jd · 2009-03-19 06:02 · Score: 5, Funny

But... those of us who learned the Ancient And Most Wise ways always triple-sync. We also sacrifice Peeps and use red food colouring in voodoo ceremonies (hey, it really is blood, so it should work) to keep the hardware from failing.
On next week's Slashdot, there will be a brief tutorial on the right way to burn a Windows CD at the stake, and how to align the standing stones known as RAM Chips to points of astronomical significance.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:the workaround is bad design by morgan_greywolf · 2009-03-19 06:03 · Score: 2, Interesting

No, we don't salute them. If you ask me, now matter what Ted T'so says about it complying with the POSIX standard, sorry, but it's a bug if it causes known, popular applications to seriously break, IMHO.
Broken is broken, whether we're talking about Ted T'so or Microsoft.

--
My blog
Re:the workaround is bad design by Dan667 · 2009-03-19 06:32 · Score: 2, Insightful

I believe a major difference is that Microsoft would just deny there was a problem at all. If they did acknowledge it, they certainly would not detail what it is.
Re:the workaround is bad design by ManWithIceCream · 2009-03-19 06:47 · Score: 2, Informative

We let our own off with heineous mistakes while professionals who do the same thing we hang simply because they dared to ask to be paid for their effort. Lame.
Is Ted Ts'o not professional? Does he not get paid? Ts'o's employed by the Linux Foundation, on leave from IBM. Free Software does not mean volenteer-made software!
Re:the workaround is bad design by TheMMaster · 2009-03-19 06:48 · Score: 3, Insightful

Actually, no.
Microsoft runs a proprietary show where they 'set the standard' themselves. Which basically means 'there is no standard except how we do it'.
Linux, however, tries to adhere to standards. When it turns out that something doesn't adhere to standards, it gets fixed.
Another problem is that most users of proprietary software on their proprietary OS don't have the sources to the software they use, so if the OS fixes something that was previously broken, but the software version used is 'no longer supported' the 'fix' in the OS breaks the users' software and the user has no option of fixing his software.
THIS is why a) microsoft can't ever truly fix something and b) why using proprietary software screws over the user.
Or would you rather have OSS software do the same as proprietary software vendors and work around problems forever but never fixing them? Saw that shiny 'run in IE7 mode' button in IE8? that's what you'll get...

--
Fighting for peace is like fucking for virginity
Re:the workaround is bad design by Hatta · 2009-03-19 06:51 · Score: 4, Insightful

If this were Microsoft, we'd give them a healthy helping of humble pie, but because it's Linux and the magic word "POSIX" gets used, I'm sure we'll forgive them for it.
You must be reading a different slashdot than I am. The popular opinion I see is that this is very bad design. If the spec allows this behavior, it's time to revisit the spec.

--
Give me Classic Slashdot or give me death!
Re:the workaround is bad design by stevied · 2009-03-19 10:10 · Score: 2, Informative

The "workaround" is understanding how the platform you're targeting actually works rather than making guesses. fsync() and even fdatasync() have been around for ages and are documented. *NIX directories have always just been more or less lists of (name,inode_no) tuples, which is why hard links are part of the platform. There isn't really any magical connection between an inode and the directories it happens to be listed in.
Ted knows this stuff inside and out and is almost ridiculously reasonable compared to many people I've met with his level of expertise. The patches to enable the actual workaround were available pretty much at the same time the awareness of this bug hit the mainstream. Given the flak he was taking, the fact that he expressed his opinions about the way some of the userspace software may or may not have been behaving doesn't seem unreasonable.
The answer here is (1) roll out the workaround so nobody is horribly surprised when the latest distros ship with ext4, and (2) for developers to _listen_ to the guy who knows what he's talking about and fix their apps, ideally by providing some standard functions in the GNOME / KDE / etc. libs to handle the common situation, thus allowing the full performance advantages to be extracted from all the hard work that's been put into ext4 (and other file systems.)
There are a relatively small number of people in the world who are worth listening to when they say something. Take a lesson from a guy with a 3 digit UID (sorry to pull rank, but sometimes it has to be done!), and let me tell you that Ted Ts'o is one of them.
Re:the workaround is bad design by ChaosDiscord · 2009-03-19 10:34 · Score: 2, Insightful

"The workaround is laughable -- 'call fsync(), and then wait(), wait(), wait(), for the Wizard to see you.'"
The "workaround" has been the standard for decades! Twenty years ago when I was learning programming I was warned: Until you call fsync(), you have no guarantee that your data has landed on disk. If you want to be sure the data is on the disk, call fsync(). While it's a complication for application developers, the benefit is that it allows filesystem developers to make the filesystem faster. That ext3 in its default configuration happened to work as erroneously expected has always been a happy coincidence, not something to rely on.
You might as well be complaining about the "workaround" that you have to shutdown your computer properly instead of yanking the cord out of the wall; since it didn't used to lose data when you did that.

--
Search 2010 Gen Con events

Show some respect! by LotsOfPhil · 2009-03-19 05:56 · Score: 5, Funny

...new solutions have been provided by Ted Ts'o to...

That's General Ts'o to you!

--
This post climbed Mt. Washington.

Re:LOL: Bug Report by Z00L00K · 2009-03-19 05:58 · Score: 4, Insightful

This is the problem with new features - the users have problems using them until they fully understands and appreciates the advantages and disadvantages.

And also consider - ext4 is relatively new, so it will improve over time. If you want stability stick to ext3 or ext2. If you want a really stupid filesystem go FAT and prepare for a patent attack.

--
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.

I sit just me? by IMarvinTPA · 2009-03-19 06:04 · Score: 2, Insightful

I sit just me, or would you expect that the change would only be committed once the data was written to disk under all circumstances?
To me, it sounds like somebody screwed up a part of the POSIX specification. I should look for the line that says "During a crash, loose the user's recently changed file data and wipe out the old data too."

IMarv

--
Trusting software vendors is no smarter than trus

Re:I sit just me? by Em+Emalb · 2009-03-19 06:07 · Score: 3, Funny

Nope, not just you, I sit also.

--
Sent from your iPad.

Workaround is disaster for laptops by victim · 2009-03-19 06:05 · Score: 5, Insightful

The workaround (flushing everything to disk before the rename) is a disaster for laptops or anything else which might wish to spin down a disk drive.

The write-replace idiom is used when a program is updating a file and can tolerate the update being lost in a crash, but wants either the old or the new to be intact and uncorrupted. The proposed sync solution accomplishes this, but at the cost of spinning up the drive and writing the blocks at each write-replace. How often does your browser update a file while you surf? Every cache entry? Every history entry? What about your music player? Desktop manager? All of these will be spin up your disk drive.

Hiding behind POSIX is not the solution. There needs to be a solution that supports write-replace without spinning up the disk drive.

The ext4 people have kindly illuminated the problem. Now it is time to define a solution. Maybe it will be some sort of barrier logic, maybe a new kind of sync syscall. But it needs to be done.

Re:Workaround is disaster for laptops by GMFTatsujin · 2009-03-19 06:20 · Score: 2, Insightful

If the issue is drive spin-up, how have the new generation of flash drives been taken into account? It seems to me that rotational drives are on their way out.
That doesn't do anything for the contemporary generations of laptop, but what would the ramifications be for later ones?
Re:Workaround is disaster for laptops by Kjella · 2009-03-19 06:48 · Score: 5, Informative

Fixed code:
fwrite()
fsync() - sync this file before close
fclose()
rename()
Either you're a troll or an idiot, since you're AC'ing I guess I got trolled. This will sync immidiately and kill performance and battery life, since every block must be confirmed written before the process can continue. What you need to fix this is a delayed rename that happens after the delayed write.
Problem:
fwrite()
fclose()
rename()
*ACTUAL RENAME*
*TIME PASSES* <-- crash happens here = lose old file
*ACTUAL WRITE*
Real solution:
fwrite()
fclose()
rename()
*TIME PASSES* <-- crash happens here = keep old file
*ACTUAL WRITE*
*ACTUAL RENAME*

--
Live today, because you never know what tomorrow brings
Re:Workaround is disaster for laptops by dshadowwolf · 2009-03-19 07:11 · Score: 3, Informative

And you don't get it... The truth is that Ext4 was writing the journal out before any changes took place. This means that when the crash happens between the metadata write and the actual write a replay of the journal will cause data loss.
Other filesystems with delayed allocation solve this by not writing the journal before the actual data commits happen. The fix that TFA is talking about introduces this to Ext4.
Re:Workaround is disaster for laptops by david_thornley · 2009-03-19 07:15 · Score: 3, Informative

In which case the standard sucks, big time, and finding a loophole that trashes normal expected behavior should not be cause for rejoicing.
There needs to be a way to write a file such that either the old or the new is preserved. Agreed on this?
Now, in a file system that's going to run real well, there needs to be a way to delay writes in order to batch them. Agreed on this?
We have two reasonable demands here. Pick one, because that's all you're going to get.
Currently, in order to keep either the old or new file, it's necessary to write the new file right now. This is the standard behavior, and it trashes performance. Alternatively, the writes can be batched up for later, for good performance, and we run the risk of losing both old and new versions of a file.
In other words, in order to optimize the heck out of the file system, it's necessary to trash the performance.
What we need is a way to do the rewrite-rename thing in a way so it can be safely delayed, so the file system can batch up a lot of writes to do in a really fancy optimized way, but writing the new file fully before renaming it. There's no obvious reason to me why the file system can't keep track of this and guarantee the order. It may not be required by the standard, but that's no excuse for not implementing it.

--
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
Re:Workaround is disaster for laptops by BigBuckHunter · 2009-03-19 07:21 · Score: 2, Informative

There needs to be a solution that supports write-replace without spinning up the disk drive.
How do you intend on writing to the disk drive... without spinning it up? Is this not what you're asking? If this is indeed your question, the answer is already "by using a battery backed cache".

BBH

Dunno by Shivetya · 2009-03-19 06:05 · Score: 4, Insightful

but if you want a write later file system shouldn't it be restricted to hardware that can preserve it?

I understand that doing writes immediately when requested leads to performance degradation but that is why business systems which defer writes to disk only do so when the hardware can guarantee it. In other words, we have a battery backed cache, if the battery is low or nearing end of life the cache is turned off and all writes are made when the data changes.

Trying to make performance gains to overcome limitations of the hardware never wins out.

--
* Winners compare their achievements to their goals, losers compare theirs to that of others.

Re:Dunno by MikeBabcock · 2009-03-19 08:31 · Score: 2, Informative

Without write-back (that's delaying writes until later and keeping them in a cache), you lose elevator sorting. No elevator sorting makes heavy drive usage ridiculously slower than with.
You can't re-sort and organize your disk activity without the ability to delay the data in a pool.
The difference between EXT3 and EXT4 is not whether the data gets written immediately -- neither do that. The difference is how long they wait. EXT4 finally gives major power preservation by delaying writes until necessary so my laptop hard drive doesn't spin up for brief moments of unnecessary disk activity all the time.
You want your data written synchronously? Just mount your filesystem with 'sync' and its all done for you. No problem, no bug.
"mount -o remount,sync /dev/sda1 /" all done.

--
- Michael T. Babcock (Yes, I blog)
Re:Dunno by mr3038 · 2009-03-20 01:04 · Score: 2, Insightful

ext3 is also delaying writes. The bug is that ext4 is not delaying renames to happen after writes. Instead renames happen immediately, and guess what, they spin your hard drive up, then you get to wait 60 second until real data starts to be written. Oh and if you lose power or crash during these 60 seconds, you loose all data - new and old. Oh and you common desktops programs do that cycle several times a minute.

Excuse my language, but why the fuck are those "common desktop programs" writing and renaming files several times a minute? I understand that files are written if I change any settings but this is something different. Perhaps there should be some special filesystem that is designed to freeze the whole system for 1 second for every write() any application does. Such filesystem could be used for application testing. That way it would be immediately obvious if any program is writing too much stuff without a good reason.
The EXT4 is doing exactly the right thing because it's never actually writing any of those files to the disk. Because those files are constantly replaced with new versions, there's no point trying to save any unless the application ask so. To do that, the application should call fsync(). Otherwise, the FS has no obligation to write anything in any given order to the disk until the FS is unmounted. A high performance FS with enough cache will not write anything to disk until fsync() unless the CPU and disk have nothing else to do (and even then, only because it probably improves the performance of possibly following fsync() or unmount in the future).

--
_________________________
Spelling and grammar mistakes left as an exercise for the reader.

Re:Those who fail to learn the lessons of history. by Samschnooks · 2009-03-19 06:10 · Score: 2, Insightful

Speaking as someone who has developed OS commercial code (OS/2), I always assumed that the person before me understood what they were doing; because, if you didn't, you were spending all your time researching how the 'wheel' was invented. Also, aside from this very rare occurrence, it is pretty arrogant to think that your predecessors are incompetent or, to be generous, ignorant.

This problem is just something that slipped through the cracks and I'm sure the originator of this bug is kicking himself in the ass for being so "stupid".

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 06:12 · Score: 5, Insightful

Rubbish. Sorry, if the syncs were implicit, app developers would just be demanding a way to to turn them off most of the time because they were killing performance.

Re:LOL: Bug Report by von_rick · 2009-03-19 06:15 · Score: 4, Insightful

And also consider - ext4 is relatively new, so it will improve over time. If you want stability stick to ext3 or ext2.

QFT

The filesystem was first released sometime towards the end of December 2008. The Linux distros that incorporated it, gave it as an option, but the default for /root and /home was always EXT3.

In addition, this problem is not a week old like the article states. People have been discussing this problem on forums ever since mid-January, when the benchmarks for EXT4 were published and several people decided to try it out to see how it fares. I have been using EXT4 for my /root partition since January. Fortunately I haven't had any data loss, but if I do end up losing some data, I'd understand that since I have been using a brand new file-system which has not been thoroughly tested by users, nor has it been used on any servers that I know of.

--

Face your daemons!

Re:Those who fail to learn the lessons of history. by dotancohen · 2009-03-19 06:22 · Score: 2, Insightful

Before you get to write any filesystem code, you should have to study how other people have done it...

No. Being innovative means being original, and that means taking new and different paths. Once you have seen somebody else's path, it is difficult to go out on your own original path. That is why there are alpha nad beta stages to a project, so that watchful eyes can find the mistakes that you will undoubtedly make, even those that have been made before you.

--
It is dangerous to be right when the government is wrong.

Quick workaround - no patches required by canadiangoose · 2009-03-19 06:32 · Score: 5, Informative

If you mount your ext4 partitions with nodelalloc you should be fine. You will of course no longer benefit from the performance enhancements that delayed allocation bring, but at least you'll have all of your freaking data. I'm running Debian on Linux 2.6.29-rc8-git4, and so far my limited testing has shown this to be very effective.

--
Never eat more than you can lift -- Miss Piggy

POSIX spec is fine, ext4 is flawed by iYk6 · 2009-03-19 06:34 · Score: 2, Informative

Someone above says that the POSIX standard is fine, but that ext4 violates it. Here is his quote:
"When applications want to overwrite an existing file with new or changed data [...] they first create a temporary file for the new data and then rename it with the system call - rename("

It seems that ext4 renames the file first, and then writes the file up to 60 seconds later.

No kidding by Sycraft-fu · 2009-03-19 06:36 · Score: 5, Insightful

All the stuff with Ext4 strikes me as amazingly arrogant, and ignorant of the past. The issue that FS authors, well any authors of any system programs/tools/etc need to understand is that your tool being usable is the #1 important thing. In the case of a file system, that means that it reliably stores data on the drive. So, if you do something that really screws that over, well then you probably did it wrong. Doesn't matter if you fully documented it, doesn't matter if it technically "follows the spec" what matters is that it isn't usable.

I mean I could write a spec for a file system that says "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.

I'd give these guys more credit if I was aware of any other major OS/FS combo that did shit like this, but I'm not. Linux/Ext3 doesn't, Windows/NTFS doesn't, OS-X/HFS+ doesn't, Solaris/ZFS doesn't, etc. Well that tells me something. That says that the way they are doing things isn't a good idea. If it is causing problems AND it is something else nobody else does, then probably you ought not do it.

This is just bad design, in my opinion.

Re:No kidding by mr_mischief · 2009-03-19 07:11 · Score: 2, Insightful

It does store data reliably on the drive that has been properly synchronized by the application's author. This data that is lost is what has been sent to a filehandle but not yet synchronized when the system loses power or crashes.
The FS isn't the problem, but it is exposing problems in applications. If you need your FS to be a safety net for such applications, nobody is taking ext3 away just because ext4 is available. IF you want the higher performance of ext4, buy a damn UPS already.
Re:No kidding by SIR_Taco · 2009-03-19 07:39 · Score: 2, Insightful

what matters is that the damn thing loses data on a regular basis.
I guess I don't really understand what you mean by regular basis, or maybe you just like feeding quarters into the FUD machine. Maybe you live in a place where power failures are very common and/or you like to randomly hit the reset/power buttons. Or maybe you're just not peddling hard enough to keep your computer from going into black/brown-out status.
The fact is that you will not lose data on a regular basis unless you have severe power problems. This is a performance boost based on the assumption that power outages and bone-headed users are not the common-place. Take that as you will, and I'm not one to suggest that any distro accept this as their default FS, however, it does have its place and many people welcome it.
Just my two cents.

--
I say don't drink and drive, you might spill your drink. Before you get behind the wheel just stop and think.
Re:No kidding by SanityInAnarchy · 2009-03-19 08:00 · Score: 2, Informative

The issue that FS authors, well any authors of any system programs/tools/etc need to understand is that your tool being usable is the #1 important thing.
Part of usability is performance. This is a significant performance improvement.

So, if you do something that really screws that over, well then you probably did it wrong. Doesn't matter if you fully documented it, doesn't matter if it technically "follows the spec" what matters is that it isn't usable.
The real problem here is that application developers were relying on a hack that happened to work on ext3, but not everywhere else.
Let me ask you this -- should XFS change the way it behaves because of this? EXT4 is doing exactly what XFS has done for decades.

I mean I could write a spec for a file system that says "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss.
No, that's actually precisely what the spec says, with one exception: You can guarantee it to be written to disk by calling fsync.

I'd give these guys more credit if I was aware of any other major OS/FS combo that did shit like this, but I'm not.
Only because you haven't looked.
In fact, there's a mount option to turn this behavior on in ext3.
The "bad design" goes deeper than that.

--
Don't thank God, thank a doctor!
Re:No kidding by Tacvek · 2009-03-19 08:40 · Score: 4, Informative

I don't think you have it right.
On Ext3 with "data=ordered" (a default mount option), if one writes the file to disk, and then renames the file, ext3 will not allow the rename to take place until after the file has been written to disk.
Therefore if an application that wants to change a file uses the common pattern of writing to a temporary file and then renaming (the renaming is atomic on journaling file systems), if the system crashes at any point, when it reboots the file is guaranteed to be either the old version or the new version.
With Ext4, if you write a file and then rename it, the rename can happen before the write. Thus if the computer crashes between the rename and the write, on reboot the result will be a zero byte file.
The fact that the new version of the file may be lost is not the issue. The issue is that both versions of the file may be lost.
The end result is the write and rename method of ensuring atomic updates to files does not work under Ext4.
A new mount option that forces the rename to come after the data is written to disk is being added. Once that is available, the problem will be gone if you use that mount option. Hopefully it will be made a default mount option.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
Re:No kidding by JumboMessiah · 2009-03-19 09:00 · Score: 2, Insightful

I just posted in the wrong thread. Synopsis:
I made a lot of money back in the 90's repairing NTFS installs. The similarity with it, back then, and EXT4 is they are/were young file systems.
Give Ted and company a break. Let him get the damn thing fixed up (I have plenty of faith in Ted). Hell, I even remember losing an EXT3 file system back when it was fresh out of the gate. And I'm sure there's plenty who could say the same for all those you listed, including ZFS.
And your comment about extended data caching. Is your memory short? Remember "laptop mode", specifically setup this way to keep the hard drive from having to spin up...
Re:No kidding by AvitarX · 2009-03-19 15:48 · Score: 3, Informative

But if the application syncs the file, the new data is written to disk.
This wastes time and performance, and for most files is un-needed.
There are not only "important" and "unimportant" files, there are also "typical" files.
We don't want to lose them, but who cares if recent changes are lost.
Take for example a KDE config file. I am willing to risk all changes made to it since boot (I generally leave my computer off at night, so this is 12 or so hours). I do not want to lose all of my changes since install (this is 10,000 hours).
The method of writing a temporary file and then renaming prevents the second from happening (in EXT3, XFS now, ReiserFS now, and soon EXT4) while still allowing for very aggressive write caching.
EXT4 currently allows for the the second to happen unless a disk write is forced preventing either of the scenarios.
The loss of the file already synced to disk potentially years ago is the issue, not the loss of the relatively recent data.
EXT4 has essentially removed the option for having "typical" files, and forces them to be treated as "important".
So everything becomes every change forces a write, or we care not about this (cache for example). The typical stuff that every change is not so critical (in the rare event of a crash), but it is sure nice to have something becomes elevated to an "important" file that does all of those bad things you describe, and eliminates the ability to cache writes.

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg

Re:Shoulders of Giants by Evanisincontrol · 2009-03-19 06:39 · Score: 2, Insightful

Standing on the shoulders of giants is usually the best way to make progress.

Sure, if the only direction you want to go is the direction that the giant is already moving. Doesn't help you get anywhere else, though.

Re:LOL: Bug Report by try_anything · 2009-03-19 06:48 · Score: 5, Insightful

This is the problem with new features - the users have problems using them until they fully understands and appreciates the advantages and disadvantages.

Advantages: Filesystem benchmarks improve. Real performance... I guess that improves, too. Does anybody know?

Disadvantages: You risk data loss with 95% of the apps you use on a daily basis. This will persist until the apps are rewritten to force data commits at appropriate times, but hopefully not frequently enough to eat up all the performance improvements and more.

Ext4 might be great for servers (where crucial data is stored in databases, which are presumably written by storage experts who read the Posix spec), but what is the rationale for using it on the desktop? Ext4 has been coming for years, and everyone assumed it was the natural successor to ext3 for *all* contexts where ext3 is used, including desktops. I hope distros don't start using or recommending ext4 by default until they figure out how to configure it for safe usage on the desktop. (That will happen long before the apps are rewritten.) Filesystem benchmarks be damned.

The odd thing is... by DragonWriter · 2009-03-19 07:00 · Score: 2, Insightful

I'm a hobbyist, and I don't program system level stuff, essentially, at all anymore, but way back when I did do C programming on Linux (~10 years ago), ISTR that this (from Ts'o in TFA) was advice you couldn't go anywhere without getting hit repeatedly over the head with:

if an application wants to ensure that data have actually been written to disk, it must call the the function fsync() before closing the file.

Is this really something that is often missed in serious applications?

Re:The odd thing is... by Cassini2 · 2009-03-19 08:17 · Score: 2, Informative

Calling fsync() excessively completely trashes system performance and usability. Essentially, operating systems have write back caches to speed code execution. fsync() disables the write back cache by writing data out immediately, and making your program wait while the flush happens. Modern computers can do activities that involve rapidly touching hundreds of files per second. Forcing each write to use an fsync() slows things down dramatically, and makes for a poor user experience.
To make matters worse, from a technical point of view, it is necessary for strict POSIX compliance to fsync() the file and then fsync() the containing directory. I have never seen a piece of normal application code that fsync() the containing directory. Even common linux utilities like rsync, and gzip don't use fsync anymore. tar uses fsync in one special case: for file verification before calling ioctl(FDFLUSH). The documentation on tar is instructive:
/* Verifying an archive is meant to check if the physical media got it correctly, so try to defeat clever in-memory buffering pertaining to this particular media. On Linux, for example, the floppy drive would not even be accessed for the whole verification. The code was using fsync only when the ioctl is unavailable, but Marty Leisner says that the ioctl does not work when not preceded by fsync. So, until we know better, or maybe to please Marty, let's do it the unbelievable way :-). */ #if HAVE_FSYNC fsync (archive); #endif #ifdef FDFLUSH ioctl (archive, FDFLUSH); #endif
In general, application writers are interested in making sure the file is readable. Unless you are really determined, and willing to go through the file verification like in the tar command, fsync() does little to guarantee a file will be readable at a later date. Under modern file systems, there are so many reasons why a file may become unreadable, and so few of them are fixed with fsync(), that one has to ask: Why bother with fsync()?
In fact, there are so few good reasons to use fsync(), that many applications have completely given up on fsync(). fsync() is disabled on Apple Macs running OSX. If you run NFS, fsync() will probably flush your data to the network, but not to the hard disk. If you are running a PC with a modern hard drive, the hard drive probably has a write back cache. As such, fsync() doesn't guarantee your data is physically on the disk. fsync() is disabled in laptop mode.
For most applications, using fsync() will only slow down your C code. It is useful for certain applications, like databases. Many other programming languages have no equivalent to fsync(). For most programs, fsync() is an infrequently used call, and is primarily used in special purpose libraries like databases.

Bad POSIX by Skapare · 2009-03-19 07:01 · Score: 4, Interesting

Ext4, on the other hand, has another mechanism: delayed block allocation. After a file has been closed, up to a minute may elapse before data blocks on the disk are actually allocated. Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. If the system crashes during this time, the rename() operation may already be committed in the journal, even though the new file still contains no data. The result is that after a crash the file is empty: both the old and the new data have been lost.

Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.

If that is true, then to the extent that is true, POSIX is "broken". Related changes to a file system really need to take place in an orderly way. Creating a file, writing its data, and renaming it, are related. Letting the latter change persist while the former change is lost, is just wrong. Does POSIX really require this behavior, or just allow it? If it requires it, then IMHO, POSIX is indeed broken. And if POSIX is broken, then companies like Microsoft are vindicated in their non-conformance.

--
now we need to go OSS in diesel cars

Re:LOL: Bug Report by causality · 2009-03-19 07:04 · Score: 3, Interesting

Disadvantages: You risk data loss with 95% of the apps you use on a daily basis. This will persist until the apps are rewritten to force data commits at appropriate times, but hopefully not frequently enough to eat up all the performance improvements and more.

For those of us who are not so familiar with the data loss issues surrounding EXT4, can someone please explain this? The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?" I.e. if I ask OpenOffice to save a file, it should do that the exact same way whether I ask it to save that file to an ext2 partition, an ext3 partition, a reiserfs partition, etc. What would make ext4 an exception? Isn't abstraction of lower-level filesystem details a good thing?

--
It is a miracle that curiosity survives formal education. - Einstein

voting by Skapare · 2009-03-19 07:06 · Score: 3, Funny

So is this why we can't have voting (where correctness is paramount over performance) systems developed on Linux?

--
now we need to go OSS in diesel cars

Re:LOL: Bug Report by larry+bagina · 2009-03-19 07:08 · Score: 2, Informative

My one experience with XFS involved the partition being corrupted beyond recoverability within 15 minutes. Too bad, in theory XFS is great.

Anyhow, ZFS is raid, lvm, and fs rolled up into one, so keeping the patch up to date with linux changes could be a bit of work.

--
Do you even lift?

These aren't the 'roids you're looking for.

Re:LOL: Bug Report by shentino · 2009-03-19 07:08 · Score: 2, Insightful

Ext4 is still alpha-ish, and declared as such.

Any *user* who trusts production data to an experimental filesystem is already too stupid to have the right to gripe about losing said data.

Easier Fix by maz2331 · 2009-03-19 07:11 · Score: 3, Insightful

Why not just make the actual "flushing" process work primarily on memory cache data - including any "renames", "deletes", etc.?

If any "writes" are pending, then the other operations should be done in the strict order in which they were requested. There should be no pattern possible where cache and file metadata can be out of sync with one another.

Re:LOL: Bug Report by swillden · 2009-03-19 07:24 · Score: 5, Interesting

The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"

They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.

Re:LOL: Bug Report by PitaBred · 2009-03-19 07:25 · Score: 2, Informative

Basically, the spec was written one way, but the actual behavior was slightly different. Even though the standard didn't guarantee something to be written, most filesystems did it anyway. When EXT4 didn't write things immediately to improve performance, the applications that depended on filesystems writing data ASAP (even though it wasn't required behavior) started risking data loss in case of a crash and data not being written explicitly.
br/> The mechanism (fsync) has been around for ages, it's just that most apps didn't use it when they should because there wasn't a "need" to until EXT4, and other systems like XFS which are less popular and tend to be run by people who know what behavior to expect.

--
My blog. Good stuff (when I remember to update it). Read it.

Re:Those who fail to learn the lessons of history. by ChienAndalu · 2009-03-19 07:33 · Score: 2, Interesting

Ext4 *is* better, and probably because it benefits from the wiggle room provided by the specifications. The question is if you accept the tradeoff between performance and security. I choose performance, because my system doesn't crash that often.

POSIX by 200_success · 2009-03-19 07:41 · Score: 3, Insightful

If I had wanted POSIX-compliant behavior, I could have gotten Windows NT! (Windows was just POSIX-compliant enough to be certified, but the POSIX implementation was so half-assed that it was unusable in practice.) Just because Ext4 complies with the minimum requirements of the spec doesn't make it right, especially if it trashes your data.

A bad design that it is used everywhere by diegocgteleline.es · 2009-03-19 07:46 · Score: 5, Informative

"No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.

It turns out that all the modern operative systems work exactly like that. In ALL of them you need to use explicit syncronization (fsync and friends) to get a notification that your data has really been written to disk (and that's all what you get, a notification, because the system could oops before fsync finishes). You also can mount your filesystem as "sync", which sucks.

Journaling, COW/transaction-based filesystems like ZFS only guarantee the integrity, not that your data is safe. It turns out that Ext3 has the same problem, it's just that the window is smaller (5 seconds). And I wouldn't bet that HFS and ZFS have not the same problem (btrfs is COW and transaction based, like ZFS, and has the same problem).

Welcome to the real world...

Re:A bad design that it is used everywhere by Tacvek · 2009-03-19 08:52 · Score: 5, Informative

The Ext3 5 seconds thing is true, but that is not the important difference.
On Ext3, with the default mount options, if one writes a file to disk, and then renames the file the write is guarantee to come before the rename. This can be used to ensure atomic updates to files, by writing a temporary copy of the file with the desired changes, and then renaming the file.
On Ext4, if one writes a file to the disk, and then renames the file, the rename can happen first. The result of this is that it is not possible to ensure atomic updates to files unless one uses fsync between the writing and the renaming. However, that would hurt performance, since fsync will force the file to be committed to disk right now, when all that is really important is that it is committed to disk before the rename is.
Thankfully the Ext4 module will be gaining a new mount option that will ensure that a file is written to disk before the renaming occurs. This mount option should have no real impact on performance, but will ensure the atomic update idiom that works on Ext3 will also work on Ext4.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
Re:A bad design that it is used everywhere by mbessey · 2009-03-19 10:49 · Score: 2, Insightful

There's a ton of software out there that uses the "write to new file with temporary name, then rename it to the final name" pattern, much of it written before Ext4 (or Ext3, or Ext) was designed, and rather a lot of it written before most of the folks on the Linux Kernel mailing list were even out of elementary school. This is a well-established method for reliably updating files, and it works, or fails gracefully, on almost every filesystem implementation from 1976 to the present day - except for Ext4.
Claiming that otherwise-portable software ought to include Linux-specific (not to mention Ext4-specific!) code to avoid massive data loss seems a bit backward.

Re:LOL: Bug Report by causality · 2009-03-19 07:46 · Score: 4, Insightful

The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"

They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.

Thanks for explaining that. In that case, I salute Mr. Tso and others for telling the truth and not caving in to pressure when they are in fact correctly following the specification. Too often people who are correct don't have the fortitude to value that more than immediate convenience, so this is a refreshing thing to see. Perhaps this will become the sort of history with which developers are expected to be familiar.

I imagine it will take a lot of work but at least with Free Software this can be fixed. That's definitely what should happen, anyway. There are sometimes when things just go wrong no matter how correct your effort was; in those cases, it makes sense to just deal with the problem in the most hassle-free manner possible. This, however, is not one of those times. Thinking that you can selectively adhere to a standard and then claim that you are compliant with that standard is just the sort of thing that really should cause problems. Correcting the applications that made faulty assumptions is therefore the right way to deal with this, daunting and inconvenient though that may be.

Removing this delayed-allocation feature from ext4 or placing limits on it that are not required by the POSIX standard is definitely the wrong way to deal with this. To do so would surely invite more of the same. It would only encourage developers to believe that the standards aren't really important, that they'll just be "bailed out" if they fail to implement them. You don't need any sort of programming or system design expertise to understand that, just an understanding of how human beings operate and what they do with precedents that are set.

--
It is a miracle that curiosity survives formal education. - Einstein

Re:LOL: Bug Report by ijakings · 2009-03-19 07:46 · Score: 4, Funny

Microsoft Patent

Re:LOL: Bug Report by AigariusDebian · 2009-03-19 07:48 · Score: 2, Insightful

You have a separate partition for /root ? How large can the home folder of the root user be?

Re:Those who fail to learn the lessons of history. by noidentity · 2009-03-19 07:49 · Score: 2, Funny

Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. [...] And now my question: Why did the Ext4 developers make the same mistakes Reiser and XFS both made (and later corrected) years ago? [...] Those who fail to learn the lessons of [change] history are doomed to repeat it.

They tried to, but history was just a 0-byte file.

Re:LOL: Bug Report by AigariusDebian · 2009-03-19 07:53 · Score: 4, Insightful

1) Modern filesystems are expected behave better than POSIX demands.

2) POSIX does not cover what should happen in a system crash at all.

3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.

4) fsync is not a solution, because ir forces the operation to complete *now*, which is counterproductive to write performance, cache coherence, laptop battery life, excessive SSD wear and a bunch of other reasons.

We don't need reliable data-on-disk-now, we need reliable old-or-new data without using a sledgehammer of fsync.

Re:Those who fail to learn the lessons of history. by AigariusDebian · 2009-03-19 07:58 · Score: 2, Insightful

A few percent performance difference will be easily wiped away when the filesystem erases an important file that one time a year when a snowstorm knocks your power out.

Re:LOL: Bug Report by mrwolf007 · 2009-03-19 08:13 · Score: 2, Insightful

Absolutely correct.
And thats the way it should be done.
Stability by default, increased performance by request.
Lets be realistic, how many applications benefit from this delayed write. Not many is guess. Now, on the other hand, if you have an extremely I/O heavy app, disable the auto syncs and do it manually.

Re:LOL: Bug Report by MikeBabcock · 2009-03-19 08:15 · Score: 4, Interesting

The POSIX standard is just fine. The problem is application assumptions that aren't up to snuff.

Read the qmail source code sometime. Every time the author wants to assure himself that data has been written to the disk, it calls fsync.

If you don't, you risk losing data. Plain and simple.

--
- Michael T. Babcock (Yes, I blog)

Re:LOL: Bug Report by causality · 2009-03-19 08:16 · Score: 4, Interesting

So, in principle, the filesystem could just throw away the data unless the application explicitly calls a fsync ?
This seems to be a slightly bit of...hmmm....stupid ?

From the explanations I received and some reading I've done, I don't think the data is just getting "thrown away" so that isn't really a valid question. The issue seems to be that unless fsync is called, the changes requested by the application may happen in a sequence that is other than what the application programmer expected. The example I saw in this discussion involved first writing data to a file and then renaming it soon afterwards. If I understand this correctly, the application is assuming that the rename cannot possibly happen before the writing of the data is done even though the specification has no such requirement. If the application needs this to happen in the order in which it was requested, it needs to write the data, then call fsync, then rename the file. You could probably fill a library with what I don't know about low-level filesystem details, so please correct me if I have misunderstood this.

The example I found in the Wikipedia entry on ext4 was different. That one involved data loss because the application updates/overwrites an existing file and does not call fsync and then the system crashes. The Wiki article states that this leads to undefined behavior (which, afaik, is correct per the spec). The article also states that a typical result is that the file was set to zero-length in preparation for being overwritten but because of the crash, the new data was never written so it remains zero-length, causing the loss of the old version of the file. Under ext3 you would usually find either the old version of the file or the new version.

What I don't understand and hope that a more knowledgable person could explain is why this can't be done a slightly different way. This is where I can apply reason to come up with something that sounds preferable to me but I simply don't have the background knowledge of filesystems to understand the "why". If the overwrite of the file is delayed, why isn't the truncation of the file to zero-length also delayed? That is, instead of doing it this way:

Step 1: Truncate file length to zero in preparation of overwriting it.
Step 2: Delay the writing of the new data for performance reasons.
Step 3: After the delay has elapsed, actually write the data to the disk.

Why can't it be done this way instead?

Step 1: Delay the truncation of the file length to zero in preparation of overwriting it.
Step 2: Delay the writing of the new data.
Step 3: After the delay has elapsed, set the file length to zero and immediately write the new data, as a single operation if that is possible, or as one operation immediately followed by the other.

That way if there is a crash, you'd still get either the old version or the new one and not a zero-length file where data used to be. The only disadvantage I can see is that this might continue to enable developers to make assumptions that are not found in the standard because the buggy behavior ext4 is now exposing may continue to work. If there's no technical reason why it cannot be done that way, perhaps the bad precedent alone is a good reason to either not handle it this way or to change the spec.

--
It is a miracle that curiosity survives formal education. - Einstein

Re:LOL: Bug Report by MikeBabcock · 2009-03-19 08:20 · Score: 2, Informative

You don't risk any data loss, ever, if you shut down your system properly. The system will sync the data to disk as expected and everything will be peachy. You risk data loss if you lose power or otherwise shut down at an inopportune time and the data hasn't been sync'd to disk yet.

That is to say, 99% of people who use their computers properly won't have a problem.

Also note, the software you use should be doing something like:

loop: write some data, write some more data, finish writing data, fsync the data.

The problem here is that the program is doing the "writing" part and because of how caching and delayed writes work (without which, your computer would crawl), the data isn't written to disk _yet_ but will be, eventually.

Old software assumed the data would be written soon. With Ext4 its possible it won't be written until much much later for performance and power benefits.

PS you can just open a terminal window and type "sync" at any time to flush the data to disk on your system. I'm sure someone could write a tray icon that does the same in 30 seconds.

--
- Michael T. Babcock (Yes, I blog)

Re:LOL: Bug Report by zenyu · 2009-03-19 08:47 · Score: 4, Informative

They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

Yup, and the problem has existed with KDE startup for years. I remember the startup files getting trashed when Mandrake first came out and I tried KDE for long enough to get hooked, and it's happened to me a few times a year ever since with every filesystem I've used. I just make my own backups of the .kde directory and fix this manually when it happens. I'm pretty good at this restore by now. Hopefully this bug in KDE will get fixed now that it is causing the KDE project such great embarrassment. I had a silent wish Tso would increase the default commit interval to 10 minutes when the first defenders of the KDE bug started squawking, but he's was too gracious for that.

PS I use a lot of experimental graphics drivers for work, hence lockups during startup are common enough that I probably see this KDE bug more than most KDE users. But they really violate every rule of using config files: 1st. open with minimum permission needed, in this case read only, unless a write is absolutely necessary. 2nd. only update a file when it needs updating. 3rd. when updating a config file make a copy, commit it to disk, and then replace the original, making sure file permissions and ownership are unchanged, then commit the rename if necessary.

PS2 Those computer users saying an fsync will kill performance need to get cluebat applied to them by the nearest programmer. 1st. There will be no fsyncs of config files at startup once the KDE startup is fixed. 2nd. fsyncs on modern filesystems are pretty fast, ext3 is the rare exception to that norm; this will be non-noticable when you apply a settings change. 3rd. These types of programming errors are not the norm; I've graded first and second year computer science classes and each of the three major mistakes made would have lost you 20-30% of your score for the assignment.

Re:LOL: Bug Report by DragonWriter · 2009-03-19 08:47 · Score: 3, Informative

Is writing a new file, and then renaming it over an existing file really a 'typical workload'???

Its a fairly typical way of trying to acheive something loosely approximating transactional behavior with respect to updates to the file in question without relying on transactional file system semantics.

Sounds like they need to talk to Kirk McKusick by argent · 2009-03-19 08:58 · Score: 4, Informative

Kirk McKusick spent a lot of time working out the right order to write metadata and file data in FFS and the resulting file system, FFS with Soft Updates, gets high performance and high reliability... even after a crash.

Re:LOL: Bug Report by ultranova · 2009-03-19 09:15 · Score: 4, Insightful

Solution: an update to the code to behave as idiot application programmers require with a simple mount option.

The application programmers aren't at fault here, the POSIX spec is. A filesystem is essentially a hierarchical database, yet POSIX doesn't include a way to make atomic updates to it. The only tool provided is fsync, which kills performance if used. And even with fsync some things - such as rewriting a configuration file - are either outright impossible or complex and fragile.

The real solution is to come up with a transactional API for filesystem. Until that's done, problems like this will persist. Calling fsync - which forces a disk write - or playing around with temporary files isn't reasonable when all you want to do is make sure that the file will be updated properly or left alone.

The alternative is to have every program call fsync constantly, which not only kills performance, but ironically enough also negates some of Ext4's advantages, such as delayed block allocation, since it essentially disables write caching. And it doesn't work if you are doing more complex things, such as, say, mass renaming files in a directory; you have no way of ensuring that either they are all renamed, or none are.

--

Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

Bollocks by Colin+Smith · 2009-03-19 09:30 · Score: 2, Interesting

A filesystem is not a Database Management System. It's purpose is to store files. If you want transactions, use a DBMS. There are plenty out there which use fsync correctly. Try SQLite.

--
Deleted

Re:LOL: Bug Report by blazerw · 2009-03-19 09:31 · Score: 5, Insightful

1) Modern filesystems are expected behave better than POSIX demands.

2) POSIX does not cover what should happen in a system crash at all.

3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.

4) fsync is not a solution, because ir forces the operation to complete *now*, which is counterproductive to write performance, cache coherence, laptop battery life, excessive SSD wear and a bunch of other reasons.

We don't need reliable data-on-disk-now, we need reliable old-or-new data without using a sledgehammer of fsync.

1. POSIX is an API. It tries not to force the filesystem into being anything at all. So, for instance, you can write a filesystem that waits to do writes more efficiently to cut down on the wear of SSDs.
2. Ext3 has a max 5 second delay. That means this bug exists in Ext3 as well.
3. If you have important data that if not written to the hard drive will cause catastrophic failure, then you use the part of the API that forces that write.
4. Atomicity does not guarantee the filesystem be synchronized with cache. It means that during the update no other process can alter the affected file and that after the update the change will be seen by all other processes.

We don't need a filesystem that sledgehammers each and every byte of data to the hard drive just in case there is a crash. What we DO need is a filesystem that can flexibly handle important data when told it is important, and less important data very efficiently.

What you are asking is that the filesystem be some kind of sentient all knowing being that can tell when data is important or not and then can write important data immediately and non-important data efficiently. I think that it is a little better to have the application be the one that knows when it's dealing with important data or not.

Re:Those who fail to learn the lessons of history. by tkinnun0 · 2009-03-19 09:34 · Score: 2, Interesting

If the filesystem is a few percents faster but then your disk sits idle half of the time and then you have a crash and lose a file that takes two hours to recreate, have you actually gained any performance?

Re:LOL: Bug Report by Foolhardy · 2009-03-19 09:42 · Score: 2, Insightful

It sounds like the correct solution is for the file system to implement transactional semantics. That is what the applications need and were incidentally getting, despite it not being in the spec.

Why isn't this being considered as the solution? There are other major OSes have implemented basic atomic transactions in their filesystems successfully, why not Linux?

Re:LOL: Bug Report by grumbel · 2009-03-19 10:08 · Score: 3, Informative

3. If you have important data that if not written to the hard drive will cause catastrophic failure, then you use the part of the API that forces that write.

You completly missed the point. The new data isn't important, it could be lost and nobody would care. The troublesome part is that you lose the old data too. If you would lose the last 5 minutes of changes in your KDE config that would be a non-issue, what however happens is that you not just lose the last few changes, but your complete config, it ends up as 0 byte files, which is a state that the filesystem never had.

Re:LOL: Bug Report by somenickname · 2009-03-19 10:17 · Score: 4, Insightful

fsyncs have other nasty side effects other than performance. For example, in Firefox 3, places.sqlite is fsynced after every page is loaded. For a laptop user, this behavior is unacceptable as it prevents the disks from staying spun down (not to mention the infuriating whine it creates to spin the disk up after every or nearly every page load). The use of fsync in Firefox 3 has actually caused some people (myself included), to mount ~/.mozilla as tmpfs and just write a cron job to write changed files back to disk once every 10 minutes.

So, while I'm all for applications using fsync when it's really needed, the last thing I'd like to see every application on the planet sprinkling their code with fsync "just to be sure".

Re:LOL: Bug Report by Cassini2 · 2009-03-19 10:22 · Score: 4, Interesting

PS2 Those computer users saying an fsync will kill performance need to get cluebat applied to them by the nearest programmer.
1st. There will be no fsyncs of config files at startup once the KDE startup is fixed.

KDE isn't fixed right now. Additionally, KDE is not the only application that generates lots of write activity. I work with real-time systems, and write performance on data collection systems is important.

2nd. fsyncs on modern filesystems are pretty fast, ext3 is the rare exception to that norm; this will be non-noticable when you apply a settings change.

I did some benchmarks on the ext3 file system, the ext4 system without the patch, and the ext4 system with the patch. Code followed the open(), write(), close() sequence was 76% faster than the code with fsync(). Code that followed the open(), write(), close(), rename() sequence was 28% faster than code with that followed the open(), write(), fsync(), close(), rename() sequence. Additionally, the benchmarks were not significantly affected by the presence which file system was used (ext3, ext4, or ext4 patched.) You can look up the spreadsheet and the discussion at the launchpad discussion.

3rd. These types of programming errors are not the norm; I've graded first and second year computer science classes and each of the three major mistakes made would have lost you 20-30% of your score for the assignment.

Major Linux file backup utilities, like tar, gzip, and rsync don't use fsync as part of normal operations. The only application of the three, tar, that uses fsync, only uses it when verifying data is physically written to disk. In that situation, it writes the data, calls fsync, calls ioctl(FDFLUSH), and the reads the data back. Strictly speaking, that is the only way to make sure the file is written to disk, and is readable.

Finally, as Theodore Ts'o has pointed out, if you really want to make sure the file is saved to disk, you also have to fsync() the directory too. I have never seen anyone do that, as part of a normal file save. Most C programming textbooks simply have fopen, fwrite, fclose as the recommended way to save files. Calling fsync this often is unusual for most C programmers.

I would hate to be in your programming class. Your enforcing programming standards that aren't followed by key Linux utilities, aren't in most textbooks, and aren't portable to non-Linux file systems.

If you require your students to fsync() the file and the directory, as part of a normal assignment, you are requiring them to do things that aren't done by any Linux utility out there. Further, if you are that paranoid, you better follow the example from the tar utility, and after the fsync completes, read all the data back to verify it was successfully written.

Re:LOL: Bug Report by ChaosDiscord · 2009-03-19 10:23 · Score: 2, Insightful

Glossing over some details, what is happening is closer to this:

The goal is to replace config with a new version. The programmer is essentially doing this:

1. Create config.new. (Should be empty, because it's new)
2. Write the new contents into config.new
3. Move config.new onto config

The goal is that when you replace config, you're replacing it with a guaranteed complete version, config.new. Assuming it happens in this order (and that step 3 is atomic; it happens or doesn't, never partially) if you crash midway through, you'll either end up with the old config or the new config, but never a partial config. Unfortunately the operating system tries to speed things up, and for a variety of good reasons delaying step 2 makes sense. Doing so is allowed by the standards specifically for these good reasons. So what actually happens is this:

1. Create config.new. (Should be empty, because it's new)
3. Move config.new onto config
2. Write the new contents into config.new (which is actually config now, so it works)

This works fine... unless something happens between steps 3 and 2. If we stop there, we have a new, empty file in place of "config." With ext4, the window between 3 and 2 could be as long as a minute, a window during which you can lose data.

The correct solution is for the program, not the operating system, to take care with files it cares about:

1. Create config.new. (Should be empty, because it's new)
2a. Write the new contents into config.new
2b. Wait until the contents are on disk. ("fsync")
3. Move config.new onto config

Now it's not possible to move 2a after 3, so you're guaranteed safe behavior. But you lose the speed benefits of reordering. For data you care about, this is a good idea. For data you don't care about (Your web browser cache leaps to mind), it's overkill and makes you slower.

ext3 (and the new ext4 option) essentially adds 2b automatically. It's good in that it's safer for everyone involved, but it's bad in that everyone takes a speed hit, even in cases where speed is more important than safety.

--
Search 2010 Gen Con events

Re:LOL: Bug Report by spitzak · 2009-03-19 10:23 · Score: 2, Informative

Is writing a new file, and then renaming it over an existing file really a 'typical workload'???

YES!!!!!!

Re:LOL: Bug Report by spitzak · 2009-03-19 10:30 · Score: 4, Insightful

You don't understand the problem.

You are wrong when you say EXT3 has this problem. It does not have it. If the EXT3 system crashes during those 5 seconds, you either get the old file or the new one. For EXT4, if it crashes, you can get a zero-length file, with both the old and new data lost.

The long delay is irrelevant and is confusing people about this bug. In fact the long delay is very nice in EXT4 as it means it is much more efficient and will use less power. I don't really mind if a crash during this time means I lose the new version of a file. But PLEASE don't lose the old one as well!!! That is inexcusable, and I don't care if the delay is .1 second.

Re:LOL: Bug Report by spitzak · 2009-03-19 10:41 · Score: 2, Informative

ARRGH! This has nothing to do with the data being written "soon".

The problem with EXT4 is that people expect the data to be written before the rename!

Fsync() is not the solution. We don't want it written now. It is ok if the data and rename are delayed until next week, as long as the rename happens after the data is in the file!

Re:LOL: Bug Report by Sparr0 · 2009-03-19 11:33 · Score: 2, Informative

No, both of those are, implicitly, expected to be world readable, and at least usually for software that any user can run (to some degree of success). /root is the only place for root to put a local application (or any other files) that he doesn't want a user to be able to see at all.

Re:LOL: Bug Report by spitzak · 2009-03-19 11:49 · Score: 3, Interesting

Yes I would like that as well. It would remove the annoying need to figure out a temp filename and to do the rename.

One suggestion was to add a new flag to open. I think it might also work to change O_CREAT|O_TRUNC|O_WRONLY to work this way, as I believe this behavior is exactly what any program using that is assuming.

f = creat(filename) would result in an open file that is completely hidden to any process. Anybody else attempting to open filename will either get the old file or no file. This should be easy to implement as the result should be similar to unlinking an already-opened file.

close(f) would then atomically rename the hidden file to filename. Anything that already has filename open would keep seeing the old file, anything that opens it afterwards will see the new file.

If the program crashes without closing the file then the hidden file goes away with no side effects. It might also be useful to have a call that does this, so a program could abandon a write. Not sure what call to use for that.

Calling fsync(f) would act like close() and force the rename, so after fsync it is exactly like current creat().

Re:LOL: Bug Report by Eskarel · 2009-03-19 12:34 · Score: 3, Insightful

This is actually even stupider for flash drives. There is essentially zero seek time on a flash drive, so, in theory, it shouldn't really matter how much you write at any given time(since hte only delay should be how long it takes to actually write the cell).

In addition, presuming reasonable wear algorithms(which should be implemented in the device controller not in any sort of software), every bit of Math I've seen says that for any realistic amount of data writes the flash drives will last substantially longer than any current physical drives(last I saw it was about 30 years if you wrote every sector on the disk once a day, scaling down as writes increase. Even writing 6 times the volume of the drive per day that's 5 years which is a fairly long time for consumer grade physical drives, and unlike a physical drive, even if you can't read it, you can write it so you can just clone it over to a new drive.

File systems will definitely have to change for flash drives, but delaying writes probably isn't going to be the way to do it, especially since there's no need to do so.

Workaround patches already in Fedora and Ubuntu by tytso · 2009-03-19 15:04 · Score: 4, Informative

It's really depressing that there are so many clueless comments in Slashdot --- but I guess I shouldn't be surprised.

Patches to work around buggy applications which don't call fsync() have been around long before this issue got slashdotted, and before the Ubuntu Laundpad page got slammed with comments. In fact, I commented very early in the Ubuntu log that patches that detected the buggy applications and implicitly forced the disk blocks to disk were already available. Since then, both Fedora and Ubuntu are shipping with these workaround patches.

And yet, people are still saying that ext4 is broken, and will never work, and that I'm saying all of this so that I don't have to change my code, etc ---- when in fact I created the patches to work around the broken applications *first*, and only then started trying to advocate that people fix their d*mn broken applications.

If you want to make your applications such that they are only safe on Linux and ext3/ext4, be my guest. The workaround patches are all you need for ext4. The fixes have been queued for 2.6.30 as soon as its merge window opens (probably in a week or so), and Fedora and Ubuntu have already merged them into their kernels for their beta releases which will be released in April/May. They will slow down filesystem performance in a few rare cases for properly written applications, so if you have a system that is reliable, and runs on a UPS, you can turn off the workaround patches with a mount option.

Applications that rely on this behaviour won't necessarily work well on other operating systems, and on other filesystems. But if you only care about Linux and ext3/ext4 file systems, you don't have to change anything. I will still reserve the right to call them broken, though.

Re:LOL: Bug Report by DiegoBravo · 2009-03-19 17:52 · Score: 2, Insightful

For many (most?) Unix admins, /root is just a nicer way to specify "/ filesystem" or "root filesystem". The path /root for root user's home directory is popular in Linux, but I never saw it in the Unixes I've used (but I don't know if that custom is a Linux invention.)

Re:LOL: Bug Report by Eskarel · 2009-03-19 20:53 · Score: 3, Informative

I did flip read and write, long day.

The problem is/was in the EXT3 in the first place! by mr3038 · 2009-03-20 00:08 · Score: 2, Informative

The POSIX specifies that closing a file does not force it to permanent storage. To get that, you MUST call fsync() .

So the required code to write a new file safely is:

fd = fopen(...)
fwrite(..., fd)
fsync(fd)
fclose(fd)

The is no performance problem because fsync(fd) syncs only the requested file. However, that's in theory... use EXT3 and you'll quickly learn that fsync() is only able to sync the whole filesystem - it doesn't matter which file you ask it to sync, it will always sync the whole filesystem! Obviously that is going to be really slow.

Because of this, way too many software developers have dropped the fsync() call to make the software usable (that is, not too slow) with EXT3. The correct fix is to change all the broken software and in the process that will make EXT3 unusable because of slow performance. After that EXT3 will be fixed or it will be abandoned. An alternative choice is to use fdatasync() instead of fsync() if the features of fdatasync() are enough. If I've understood correctly, EXT3 is able to do fdatasync() with acceptable performance.

If any piece of software is writing to disk without using either fsync() or fdatasync() it's basically telling the system: the file I'm writing is not important, try to store it if you don't have better things to do.

--
_________________________
Spelling and grammar mistakes left as an exercise for the reader.

Slashdot Mirror

Ext4 Data Losses Explained, Worked Around

92 of 421 comments (clear)