Ext4 Data Losses Explained, Worked Around

LOL: Bug Report by Em+Emalb · 2009-03-19 05:50 · Score: 5, Funny

User: My data, it's gone!
EXT4:"Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations."

Solution: WORKS AS DESIGNED

--
Sent from your iPad.

Those who fail to learn the lessons of history.... by morgan_greywolf · 2009-03-19 05:52 · Score: 5, Insightful

FTFA, this is the problem:

Ext4, on the other hand, has another mechanism: delayed block allocation. After a file has been closed, up to a minute may elapse before data blocks on the disk are actually allocated. Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. If the system crashes during this time, the rename() operation may already be committed in the journal, even though the new file still contains no data. The result is that after a crash the file is empty: both the old and the new data have been lost.

And now my question: Why did the Ext4 developers make the same mistakes Reiser and XFS both made (and later corrected) years ago? Before you get to write any filesystem code, you should have to study how other people have done it, including all the change history. Seriously.

Those who fail to learn the lessons of [change] history are doomed to repeat it.

--
My blog

rename completes before the write by Spazmania · 2009-03-19 05:53 · Score: 5, Insightful

Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.

I couldn't disagree more:

When applications want to overwrite an existing file with new or changed data [...] they first create a temporary file for the new data and then rename it with the system call - rename(). [...] Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until [up to 60 seconds later].

Application developers reasonably expect that writes to the disk which happen far apart in time will happen in order. If I write to a file and then rename the file, I expect that the rename will not complete significantly before the write. Certainly not 60 seconds before the write. It seems dead obvious, at least to me, that the update of the directory entry should be deferred until after ext4 flushes that part of the file written prior to the change in the directory entry.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.

Re:rename completes before the write by Anonymous Coward · 2009-03-19 07:05 · Score: 1, Insightful

You dissagree with his interpritations of the spec?
Well then, show us the relevent part of the spec that says things should happen in order.
It doesnt say that? It says instead to use fsync()?
Blame the FS all you people want, but the fact remains that the application writters screwed up big time, their code is not robust and probably will fail again in the future. Even with Ext3, the code was a ticking time bomb. If power is lost at the right time, the same results would happen.
Sure, it would be nice to have a FS that fixed the poorly made code people write, but that does not remove the blame from the application writters, it simply adds some to the FS writters for taking what was a good desktop FS and trying to turn it into a server FS. Desktop FSs need to deal with poor application code, and with frequent power losses, but poor code is still poor code.
Re:rename completes before the write by Spazmania · 2009-03-19 07:27 · Score: 1

If power is lost at the right time, the same results would happen.
The right time being the hundredths of a second between the commit of the file data and the commit of the directory data, not 60 seconds.
If I fsync after every write, I can get reliability in ext2. I put up with the performance hit from ext3 and ext4 because I want the reliability in the filesystem instead of having to build it into every part of every application.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:rename completes before the write by Anonymous Coward · 2009-03-19 07:40 · Score: 0

As far as the filesystem is concerned, that's what happens. It just loses some (but not all) of that information when the system crashes. The rename is meta-data, so it's written to the journal. The actual data is not journaled, so it gets lost in the crash. The option to omit data from the journal is a performance optimization which has always had the consequence that some data can get lost in a crash while other (meta) data persists even though it was written later. The "window of opportunity" of that happening has just been small enough that most people never ran into that problem. I agree that the filesystem should try harder to maintain order of operations, but ultimately it can't decide if you prefer to rollback all metadata updates after the first lost data update or if you prefer to have metadata updates without the corresponding data. The filesystem would need a transaction API and the filesystem authors' argument is that database management systems exist, so there's no need to duplicate the effort in filesystems.
Re:rename completes before the write by Anonymous Coward · 2009-03-19 07:47 · Score: 4, Interesting

behaves precisely as demanded by the POSIX standard

Application developers reasonably expect
Apples and oranges. POSIX != "what app developers reasonably expect".
Of course you have a point insofar as that just pointing to POSIX and saying it's a correct implementation of the spec is not enough, but let's be clear here that one of these things is not like the other.
Re:rename completes before the write by noidentity · 2009-03-19 08:00 · Score: 1

Too bad it doesn't have something like Apple has had for ages, even in Mac OS Classic: FSExchangeObjects(). This call atomically exchanges two files; either the old file is intact on disk, or the new one's data is in its place. It seems that something is needed to signal to the filesystem that the rename is part of such an exchange operation, as opposed to a plain rename. But I don't know the details; perhaps the current interface already provides the filesystem enough information to determine when an exchange is occurring.
Re:rename completes before the write by Wodin · 2009-03-19 08:18 · Score: 2, Informative

If power is lost at the right time, the same results would happen.
The right time being the hundredths of a second between the commit of the file data and the commit of the directory data, not 60 seconds.
No, not "hundredths of a second". Five seconds. Or 30 if you're using laptop mode.
https://bugs.launchpad.net/ubuntu/jaunty/+source/ecryptfs-utils/+bug/317781/comments/54

--
-- Wodin
Re:rename completes before the write by SanityInAnarchy · 2009-03-19 08:18 · Score: 1

Application developers reasonably expect that writes to the disk which happen far apart in time will happen in order.
First: Where is this in the spec? At all?
Second: It's not "far apart in time". It's within a few fractions of a second.

It seems dead obvious, at least to me, that the update of the directory entry should be deferred until after ext4 flushes that part of the file written prior to the change in the directory entry.
Most uses of the filesystem, this really doesn't matter.
Let me put it this way -- why are you assuming it's file write, then a directory entry change? That makes sense in this case, but suppose I do this:
cd / tar -xjpf foo.tar.bz2 #creates foo/ touch /var/lock/whatever # (and aquire the lock, somehow) ln foo/etc/foo /etc ln foo/usr/bin/* /usr/bin ... # or better, crawl it deliberately rm -rf foo rm /var/lock/whatever
Now, since most of those are directory operations, not actual file writes, suppose those hit disk out of order. That means I might have foo/usr/bin/* installed, but not foo/etc/foo. But the 'rm' might've claimed etc/foo, so I can't exactly roll the transaction forward by crawling foo, I have to start over.
Never mind the fact that while I'm doing this, the rest of the filesystem is in an inconsistent state, with respect to my contrived package manager.
So what this means is, with the way the POSIX spec is now, if you want to rely on stuff like that, you have to assume every directory entry hits disk in order -- or you have to flush to disk yourself at a few critical points.
Now, if you can call 'sync' immediately after installing everything, but before blowing away the installation folder ('foo'), I believe the above scheme actually works. In fact, it's more efficient that way, because the filesystem gets to reorder the directory structure before flushing. It's not as efficient as it could be, but it's better.

--
Don't thank God, thank a doctor!
Re:rename completes before the write by SanityInAnarchy · 2009-03-19 08:23 · Score: 3, Insightful

The right time being the hundredths of a second between the commit of the file data and the commit of the directory data, not 60 seconds.
And if you fsync'd, the right time would be zero, on either ext3 or ext4. Or XFS, for that matter.

If I fsync after every write, I can get reliability in ext2.
No you can't. Reliability in ext2 would force you to sync not just your file, but whole directory structures -- and even then, you'd only be safe until something else starts writing.

I put up with the performance hit from ext3 and ext4 because I want the reliability in the filesystem instead of having to build it into every part of every application.
Too late.
All the journaling guarantees is that if you lose power, you won't have to fsck -- you'll get a filesystem which is internally consistent. Oh, and it also guarantees that you won't see circular directory entries, or an entire directory falling off the face of the planet, and other nastiness.
Whether it's consistent with respect to your application is completely outside the scope of the FS journaling, and is the responsibility of your application. Put it in a library, use a database, whatever -- but it's not the filesystem's fault that you failed to read the spec, nor is it very smart of you to code to ext3 instead of POSIX.

--
Don't thank God, thank a doctor!
Re:rename completes before the write by MikeBabcock · 2009-03-19 08:26 · Score: 1

That's essentially how Reiser4's wandering logs work at a lower level. Not that anyone cares.

--
- Michael T. Babcock (Yes, I blog)
Re:rename completes before the write by nusuth · 2009-03-19 08:53 · Score: 2, Informative

Application developers reasonably expect that writes to the disk which happen far apart in time will happen in order. If I write to a file and then rename the file, I expect that the rename will not complete significantly before the write. Certainly not 60 seconds before the write.
That is sounds like a reasonable assumption but it is certainly not reasonable to write code that depends on that. 60 seconds is an eternity for a computer, but so is a second. Therefore the fact that 60 seconds is much longer than what you would expect has no bearing on the situation. If your applications depend on frequent data writes, they will have exactly the same file zeroing problem regardless of the actual amount of delay. You can't know that a crash will happen a least -say- 0.06 seconds after a write and rename, so you will still be losing files on crashes, only 1000 times less frequently with a 0.06 sec delay instead of 60. Considering how many times the problematic idiom may be used in 0.06 seconds, and how many computers are using linux, that is still an unacceptable way to write programs.
It seems dead obvious, at least to me, that the update of the directory entry should be deferred until after ext4 flushes that part of the file written prior to the change in the directory entry.
Ensuring rename happens after write is fundamentally different from not ensuring it but writing data frequently enough that it often happens that way. This is also exactly what has been done with ext3's ordered mode and what is being proposed for fixing ext4.

--
Gentlemen, you can't fight in here, this is the War Room!
Re:rename completes before the write by ObsessiveMathsFreak · 2009-03-19 08:53 · Score: 0

If I write to a file and then rename the file, I expect that the rename will not complete significantly before the write. Certainly not 60 seconds before the write. It seems dead obvious, at least to me....
It seems dead obvious to me that 'hyperbole' should be pronounced 'hi-per-bowl'. But the powers that be have deigned that it be pronounced 'hi-per-bowl-ee'. It's clear in both cases here, that the powers that be are talking out of their asses.
The problem here is not application developers, and it's not (primarily) ext4. The problem here is the POSIX standard. Following the POSIX standard, to the letter, has lead to permanent data loss. I imagine that a write followed by a rename is a reasonably common operation. However, the POSIX standard all but ensures that this scenario will inherently lead to data loss in the event of power failure or crashes, the very situations the standards should have in mind to avoid.
The POSIX standards have a bug. It's time to revise them.

--
May the Maths Be with you!
Re:rename completes before the write by Tacvek · 2009-03-19 09:42 · Score: 1

I agree that the filesystem should try harder to maintain order of operations, but ultimately it can't decide if you prefer to rollback all metadata updates after the first lost data update or if you prefer to have metadata updates without the corresponding data.
Why should this be the case? Why should a metadata change ever be written to disk before a data change that preceded it in time? If the ordering were strictly enforced, the write-replace idiom would always work to ensure atomic updates of files. The metadata journaling would still endure the integrity of the filesystem data structures, and all would be good.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
Re:rename completes before the write by Tacvek · 2009-03-19 09:45 · Score: 1

He is talking about a file write followed by the rename because he is discussing the real problem here, that Ext4 allows those to have the order reversed, making the write-replace idiom for atomic updates of files fail, which is discussed in TFA.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
Re:rename completes before the write by Ornedan · 2009-03-19 09:53 · Score: 1

write. fsync. rename. fsync.
Oh, hey, look at all that data guaranteed not lost.
Re:rename completes before the write by Yokaze · 2009-03-19 10:42 · Score: 2, Informative

It is not about losing data of the write due, it is about losing data already written, by completing the operations in a different order as issued.

--
"Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"
Re:rename completes before the write by spitzak · 2009-03-19 10:57 · Score: 1

And now EXT4 performs WORSE than EXT3. You have sucessfully removed all the advantages of it!
Re:rename completes before the write by Ornedan · 2009-03-19 11:13 · Score: 1

Writes may be re-ordered as the system sees fit, as long as end result is identical - but all guarantees are void in undefined situations like crashes. If your write order is critical, you have to enforce it by fsyncing.
The "transactional" write & rename behaviour is quite sensible, though. So there should probably be some easier mechanic to invoke it than calling fsync a bunch of times.
Re:rename completes before the write by Anonymous Coward · 2009-03-19 12:06 · Score: 0

Does it say in POSIX that it is *GUARANTEED* that the write is commited before the change to metadata?
Does it? No?
Well, there's your answer then. You are relying on something that is not stated anywhere. Try coding to the spec rather than what you imagine the spec to be.
Re:rename completes before the write by Spazmania · 2009-03-19 16:15 · Score: 1

No, hundredths of a second. The rename and file write are synced to the disk *at the same time*, which comes no more than 5 seconds after the app finishes writing.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:rename completes before the write by Spazmania · 2009-03-19 16:19 · Score: 1

Try coding to the spec rather than what you imagine the spec to be.
If I coded to spec instead of coding to what my customers wanted, I wouldn't have my very well paying job.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:rename completes before the write by Spazmania · 2009-03-19 16:28 · Score: 1

all guarantees are void in undefined situations like crashes
If I wanted a 100% guarantee that what I'd just written makes it to the disk, I'd fsync.
That isn't what I want.
What I want is to get it done more quickly than stop-and-sync would allow with a minimum of fuss inside my program while offering a high probability that if any of it made it to the disk then all of it made it to the disk. That's why I wrote/renamed instead of truncate/wrote.
Ext3 did it reasonably. Ext2 did too. Ufs on my old Sparcstation worked fine. Ext4 failed to meet expectations.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:rename completes before the write by AvitarX · 2009-03-19 16:33 · Score: 1

I read that launchpad bug, and then read up on laptop mode, but I can't find any evidence that laptop mode actually alters the order of meta data, and file contents being updated in Ext3, or that when.
Can you please point me to a link that has evidence that this is the case? Or is your assumption that most people journal with the option data=writeback?
Because the default is data=ordered, which makes the interval of losing data that has been already written far smaller, while preventing the need to write everything twice.

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Re:rename completes before the write by marcosdumay · 2009-03-20 03:59 · Score: 1

In fact, that time is smaler than a milissecond for modern disks. It is only the time needed to rewrite the entire inode pointer at the directory descriptor, what is, well, 8 (consecutive) bytes long.
If the power goes off on the middle of those 8 bytes, you get a corrupt file, if it goes off before that, you get the old file, and after that you get the new one. The temporary file may be corrupted if the power goes off while writting its data, but nobody cares about that.

--
Rethinking email

the workaround is bad design by girlintraining · 2009-03-19 05:54 · Score: 3, Insightful

Short version: "We're sorry we changed something that worked and everyone was used to, but hey -- it's compliant with a standard." If this were Microsoft, we'd give them a healthy helping of humble pie, but because it's Linux and the magic word "POSIX" gets used, I'm sure we'll forgive them for it. The workaround is laughable -- "call fsync(), and then wait(), wait(), wait(), for the Wizard to see you." How about writing a filesystem that actually does journaling in a reliable fashion, instead of finger-pointing after the user loses data due to your snazzy new optimization and say "The developer did it! It wasn't us, honest." Microsoft does it and we tar and feather them, but the guys making the "latest and greatest" Linux feature we salute them?

We let our own off with heineous mistakes while professionals who do the same thing we hang simply because they dared to ask to be paid for their effort. Lame.

--
#fuckbeta #iamslashdot #dicemustdie

Re:the workaround is bad design by jd · 2009-03-19 06:02 · Score: 5, Funny

But... those of us who learned the Ancient And Most Wise ways always triple-sync. We also sacrifice Peeps and use red food colouring in voodoo ceremonies (hey, it really is blood, so it should work) to keep the hardware from failing.
On next week's Slashdot, there will be a brief tutorial on the right way to burn a Windows CD at the stake, and how to align the standing stones known as RAM Chips to points of astronomical significance.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:the workaround is bad design by morgan_greywolf · 2009-03-19 06:03 · Score: 2, Interesting

No, we don't salute them. If you ask me, now matter what Ted T'so says about it complying with the POSIX standard, sorry, but it's a bug if it causes known, popular applications to seriously break, IMHO.
Broken is broken, whether we're talking about Ted T'so or Microsoft.

--
My blog
Re:the workaround is bad design by Anonymous Coward · 2009-03-19 06:32 · Score: 0

You may let our own off with heineous mistakes while professionals who do the same thing get hung.
I do exactly the same thing on both cases... Not use it till it is fixed.
Re:the workaround is bad design by Dan667 · 2009-03-19 06:32 · Score: 2, Insightful

I believe a major difference is that Microsoft would just deny there was a problem at all. If they did acknowledge it, they certainly would not detail what it is.
Re:the workaround is bad design by sakdoctor · 2009-03-19 06:34 · Score: 1

wait(), wait(), wait(), for the Wizard to see you
There's no place like /home.
There's no place like /home.
There's no place like /home.
Re:the workaround is bad design by ManWithIceCream · 2009-03-19 06:47 · Score: 2, Informative

We let our own off with heineous mistakes while professionals who do the same thing we hang simply because they dared to ask to be paid for their effort. Lame.
Is Ted Ts'o not professional? Does he not get paid? Ts'o's employed by the Linux Foundation, on leave from IBM. Free Software does not mean volenteer-made software!
Re:the workaround is bad design by TheMMaster · 2009-03-19 06:48 · Score: 3, Insightful

Actually, no.
Microsoft runs a proprietary show where they 'set the standard' themselves. Which basically means 'there is no standard except how we do it'.
Linux, however, tries to adhere to standards. When it turns out that something doesn't adhere to standards, it gets fixed.
Another problem is that most users of proprietary software on their proprietary OS don't have the sources to the software they use, so if the OS fixes something that was previously broken, but the software version used is 'no longer supported' the 'fix' in the OS breaks the users' software and the user has no option of fixing his software.
THIS is why a) microsoft can't ever truly fix something and b) why using proprietary software screws over the user.
Or would you rather have OSS software do the same as proprietary software vendors and work around problems forever but never fixing them? Saw that shiny 'run in IE7 mode' button in IE8? that's what you'll get...

--
Fighting for peace is like fucking for virginity
Re:the workaround is bad design by Hatta · 2009-03-19 06:51 · Score: 4, Insightful

If this were Microsoft, we'd give them a healthy helping of humble pie, but because it's Linux and the magic word "POSIX" gets used, I'm sure we'll forgive them for it.
You must be reading a different slashdot than I am. The popular opinion I see is that this is very bad design. If the spec allows this behavior, it's time to revisit the spec.

--
Give me Classic Slashdot or give me death!
Re:the workaround is bad design by try_anything · 2009-03-19 07:00 · Score: 0

Short version: "We're sorry we changed something that worked and everyone was used to, but hey -- it's compliant with a standard." If this were Microsoft, we'd give them a healthy helping of humble pie, but because it's Linux and the magic word "POSIX" gets used, I'm sure we'll forgive them for it.
I think what we've learned is that there's a bug in the POSIX standard, and Ext4 exploits the bug to deliver high measured performance in a way that is actually bad for users. So it's a benchmark hack on top of a flawed spec -- all in all, a shit sandwich for users.
That's not to say that Ext4 is bad technology. It sounds like it will deliver on its performance promises on systems that run well-written, failure-resistent software. It just won't work with the software that desktop users currently use. It will take a while for this to get sorted out, and we have to moderate our expectations from "everyone switches to ext4 and gets an automatic speed boost" to "wait and see; desktop users might not benefit from it anytime soon."
Re:the workaround is bad design by DragonWriter · 2009-03-19 07:02 · Score: 1

Short version: "We're sorry we changed something that worked and everyone was used to, but hey -- it's compliant with a standard." If this were Microsoft, we'd give them a healthy helping of humble pie
If Microsoft simultaneously sacrificed backwards compatibility and correctly implemented a standard, we'd probably be left completely speechless.
Re:the workaround is bad design by Anonymous Coward · 2009-03-19 07:02 · Score: 0

Yeah, if I had mod points (and was logged in!) I'd give them to you.
I'm an old Unix administrator who worked on Unix systems back in the early 1980s and always always always did a triple sync especially before shutdown.
Re:the workaround is bad design by Anonymous Coward · 2009-03-19 07:04 · Score: 0

While we're talking about Microsoft, I've had several instances of NTFS files being zero filled on system crash.
Re:the workaround is bad design by Xtravar · 2009-03-19 07:14 · Score: 1

What? When Microsoft made IE more standards-compliant, everyone was happy even if it broke legacy applications/sites.
You, sir, are making no sense.
If Microsoft broke stuff to make their OS POSIX compliant, we'd all be really happy!

--
Buckle your ROFL belt, we're in for some LOLs.
Re:the workaround is bad design by gnasher719 · 2009-03-19 07:18 · Score: 1

I think what we've learned is that there's a bug in the POSIX standard, ...

It is not exactly a bug in the standard. There is a standard, and there is QOI (Quality of Implementation). When you write data, the Posix says that the data is vulnerable for a time interval of unknown length. A good implementation will replace "unknown length" with "length zero", or "length almost zero". ext4 decided that "unknown length" can mean "two minutes". QOI = zero.
Re:the workaround is bad design by CannonballHead · 2009-03-19 07:22 · Score: 1

ln -s /home /away
or...
mkdir /away; cp -Rf /home/* /away;
... yes there is!
Re:the workaround is bad design by Anonymous Coward · 2009-03-19 07:28 · Score: 0

"This is what the spec says, you're all doing it wrong" sounds like denying to me.
Re:the workaround is bad design by Anonymous Coward · 2009-03-19 07:47 · Score: 1, Informative

This would never happen with Microsoft, they're all for crippling their OS just so it can be backwards compatible with broken applications
Re:the workaround is bad design by Anonymous Coward · 2009-03-19 07:55 · Score: 0

You are a freetard liar.
Re:the workaround is bad design by shutdown+-p+now · 2009-03-19 07:59 · Score: 1

I believe a major difference is that Microsoft is very, very keen on backwards compatibility, and would not likely knowingly break things that bad.
Re:the workaround is bad design by wastedlife · 2009-03-19 09:06 · Score: 1

From what I understand, the only data lost is when the application calls sync() instead of fsync(). The POSIX spec in question has long been that a sync() does not guarantee that the data is written until the next scheduled write. The problem exists in other file systems, but the write-ahead time is so short as to be less likely to cause a problem. EXT4 has such a long write-ahead time that now it can cause a problem.

--
Said, "It's just like dice but it's got more sides And it tells me who lives and who dies"
Re:the workaround is bad design by wastedlife · 2009-03-19 09:19 · Score: 1

Well, I looked into this a bit more and understand now. The problem is not that data that hasn't been fsync()'ed isn't guaranteed. It is that rename() operations replaces the old file before writing the new file to disc. My bad, that is bad behavior.

--
Said, "It's just like dice but it's got more sides And it tells me who lives and who dies"
Re:the workaround is bad design by adiposity · 2009-03-19 10:07 · Score: 1

Or would you rather have OSS software do the same as proprietary software vendors and work around problems forever but never fixing them? Saw that shiny 'run in IE7 mode' button in IE8? that's what you'll get...
Poor example. This is Microsoft fixing a problem by adhering to standards. Unfortunately, their previous software was buggy and didn't follow standards, and people designed for that flawed system. Now they are trying to force everyone into standards compliance, but offering a backwards compatibility mode the only way they can without harming future standards compliance. I fail to see how this represents working around problems forever and not fixing them.
Not that microsoft doesn't do this...but I don't see this as being a great example.
And standards are great, as long as they are documented, but standards aren't always as good as something that "just works." You can design something to spec, but if that spec hasn't considered all usability cases, the spec does not save the app from being bad.
In my opinion, although IE5 was horrible at following standards, it was a superior browser at the time to other alternatives because of its speed and basic compatibility with all sites. Now-a-days I am using Firefox for other reasons. A spec used by 20% of the world vs. a de facto spec used by 80% is not necessarily better (to take w3c vs IE as an example). That's why Firefox has "quirks mode," and it's basically the same idea as "run in ie7" but really worse, because it does even less to encourage writing to standards.
-Dan
Re:the workaround is bad design by stevied · 2009-03-19 10:10 · Score: 2, Informative

The "workaround" is understanding how the platform you're targeting actually works rather than making guesses. fsync() and even fdatasync() have been around for ages and are documented. *NIX directories have always just been more or less lists of (name,inode_no) tuples, which is why hard links are part of the platform. There isn't really any magical connection between an inode and the directories it happens to be listed in.
Ted knows this stuff inside and out and is almost ridiculously reasonable compared to many people I've met with his level of expertise. The patches to enable the actual workaround were available pretty much at the same time the awareness of this bug hit the mainstream. Given the flak he was taking, the fact that he expressed his opinions about the way some of the userspace software may or may not have been behaving doesn't seem unreasonable.
The answer here is (1) roll out the workaround so nobody is horribly surprised when the latest distros ship with ext4, and (2) for developers to _listen_ to the guy who knows what he's talking about and fix their apps, ideally by providing some standard functions in the GNOME / KDE / etc. libs to handle the common situation, thus allowing the full performance advantages to be extracted from all the hard work that's been put into ext4 (and other file systems.)
There are a relatively small number of people in the world who are worth listening to when they say something. Take a lesson from a guy with a 3 digit UID (sorry to pull rank, but sometimes it has to be done!), and let me tell you that Ted Ts'o is one of them.
Re:the workaround is bad design by Anonymous Coward · 2009-03-19 10:19 · Score: 0

There's no place like /home.
I need this on T-shirt. :)
Re:the workaround is bad design by 21mhz · 2009-03-19 10:32 · Score: 1

That "time interval of unknown length" was left in the spec for a reason. One possible reason I see is that untarring a large tree into a networked filesystem does not result in a massive amount of synchronous operations. And in ext3 the data vulnerability interval is still 5 seconds by default, not exactly zero.
Now, the real issue was people assuming the order of operations is preserved between changing the file data and changing the filesystem metadata. This is also an assumption taken beyond what the specification actually guarantees, one which restricts the way how the filesystem needs to be implemented with severe implications to performance, and this actually fails with a few modern filesystems besides ext4.

--
My exception safety is -fno-exceptions.
Re:the workaround is bad design by ChaosDiscord · 2009-03-19 10:34 · Score: 2, Insightful

"The workaround is laughable -- 'call fsync(), and then wait(), wait(), wait(), for the Wizard to see you.'"
The "workaround" has been the standard for decades! Twenty years ago when I was learning programming I was warned: Until you call fsync(), you have no guarantee that your data has landed on disk. If you want to be sure the data is on the disk, call fsync(). While it's a complication for application developers, the benefit is that it allows filesystem developers to make the filesystem faster. That ext3 in its default configuration happened to work as erroneously expected has always been a happy coincidence, not something to rely on.
You might as well be complaining about the "workaround" that you have to shutdown your computer properly instead of yanking the cord out of the wall; since it didn't used to lose data when you did that.

--
Search 2010 Gen Con events
Re:the workaround is bad design by Anonymous Coward · 2009-03-19 10:39 · Score: 0

You misunderstand the problem. The problem isn't that the data isn't on the disk, but that the events have been recorded to the disk out of order. It's the ordering that's the problem, not the delay. Actually that's part of what is wrong with fsync here, because it's too strong for the situation.
Re:the workaround is bad design by spitzak · 2009-03-19 11:05 · Score: 1

No you are misunderstanding the problem. It has nothing to do with the time, it has to do with the order things are happening.
Writing data to a and then doing rename(a,b), and a VAST amount of software assumes that after a crash, b will either have it's old contents, or have what was written to a before the rename. This is perfectly logical, was true on virtually every Unix filesystem, and is also a useful operation that has no equivalent in POSIX (fsync does not count because it forces far more to happen than the programmer wanted).
It is perfectly ok if both the data and rename are delayed until next week. All that is needed is that atomically the file b is either the old or new version. It can be the old one for a long time.
Re:the workaround is bad design by spitzak · 2009-03-19 11:11 · Score: 1

You misunderstand the problem. The behavior of EXT3 is POSIX compliant.
If Microsoft broke back-compatibility and it made no difference to how POSIX compliant they were, they would not be anybody defending them here!
The other annoying thing to me is people acting like POSIX is some sort of flawless commandments from god. It has mistakes and oversights, and saying "it conforms to the spec" is a total whitewashing of this situation. Microsoft could claim their changes obey all the MSDN documentation and that is equally bogus.
Re:the workaround is bad design by girlintraining · 2009-03-19 13:34 · Score: 1, Flamebait

THIS is why a) microsoft can't ever truly fix something and b) why using proprietary software screws over the user.
As opposed to linux, where developer's can't get something working and then leave it alone? I fail to see any difference in end result as a user.

Or would you rather have OSS software do the same as proprietary software vendors and work around problems forever but never fixing them?
Right, because Linux has never had a problem that was just too costly to fix and so a workaround was devised.

Saw that shiny 'run in IE7 mode' button in IE8? that's what you'll get...
As opposed to the shiny 'All your data are belong to us' button on EXT4?
Your post is nothing more than FUD, it's just pointing the other way now.

--
#fuckbeta #iamslashdot #dicemustdie
Re:the workaround is bad design by gzipped_tar · 2009-03-19 13:53 · Score: 1

This has nothing to do with journalling. Ext3/4 journals have been stable enough so you don't have FS structural corruption (believe me, it would be nasty to have a directory failing to reference its parent). Data safety is another matter, which is equally important, but it's on a different level -- both the FS and the applications should be held responsible for data safety ("data" by itself doesn't make sense outside certain applications). Maybe I could improvise with the quote on War and put it in this way: "data safety is too important to be left up to FS developers alone."
As for the finger-pointing, there's no adverse side effects if you just ignore it ;)

--
Colorless green Cthulhu waits dreaming furiously.
Re:the workaround is bad design by jpmorgan · 2009-03-19 13:55 · Score: 1

The problem here is the standard is broken, and has been broken for a very long time. But, there are decade old best-practices which worked in 99.99% of situations in the past, so it was never critical to fix the standard. Ext4 may be 100% compliant with the published standard, but it's 0% compliant with best practices, the defacto standard. What used to be a perfectly good and cheap way to accomplish a simple task (doing an atomic update to a file) now has to be replaced with a very expensive mechanism (fsync) to achieve the same results. The reason for this change was to improve performance, but now every program is going to have to be changed to fsync whenever it does a write+rename, completely negating any benefit that this supposed performance enhancement brought. It's idiotic.
And you can bitch about Microsoft, but they've gone the opposite route to Ext4. NTFS now supports transactional updates, so you can make updates and never have to worry about this kind of situation.
Re:the workaround is bad design by Vexorian · 2009-03-19 15:39 · Score: 1

As opposed to the shiny 'All your data are belong to us' button on EXT4?
It seems you don't understand what the problem with ext4 actually is, I recommend you to actually try reading the rest of the posts or try reading stuff before talking. First of all, ext3 is live and working, there's no reason whatsoever to move to ext4 as of now, unless you want a speed boost. Second "all your data is belong to us" seems far from what is going on, did you hear this only happens when the power is gone? Since when is it that obvious to expect all data to reliably survive a power shutdown? Even with journaling FS it is still a problem. Third, the ext4 developers are not saying "all your data belongs to us" to users. Not at all, this is more of developers vs. developers. The users will just have to wait until one side changes what was going on.
POSIX compliance is probably important if you are coding a POSIX app and want it to work correctly. Most devs are using a layer or three above the POSIX API, think of firefox which uses sqlite which correctly calls fsync. Or think of a Mono app which uses its .net imitation which calls the API. Or even the C++ guys using their fstream object... I just mean this doesn't even affect most developers... I guess you were just reading the words "data losses" in a topic that relates to Linux and wanted to prove your theory of how free software is like Microsoft, eh?

--

Copyright infringement is "piracy" in the same way DRM is "consumer rape"
Re:the workaround is bad design by Anonymous Coward · 2009-03-19 17:03 · Score: 0

Every time something negative about Linux (or whatever free project you want to substitute) is posted, there is a comment like this, that likes to point out some perceived hypocrisy. The problem is that these posters never seem to actually read any comments about said free projects.
From what I've seen, most people disagree with ext4's design. If you took a second to actually read the comments, you might see that.
What it really is, I figure, is a ploy to get moderated up. The problem is that moderators don't seem to care about what a post says. Rather, if it appears to make an "unpopular" statement, but without the juvenile delivery that is so common on the internet, moderators can't praise it fast enough. What if that unpopular statement is utterly wrong? Well, it doesn't matter, because the poster is "brave", or some other suitable adjective.
My plea to Slashdot readers is this: don't fall for it. Just because someone says something unpopular doesn't make him right, or worth listening to. This goes double, triple, or even quadruple for those who say "I know I'll be modded down for this", or even worse, "mod me down if you must". These posters are setting themselves up as martyrs, but by using that phrasing they inevitably get naive moderators to reward them.
The OP here did a better job than the "mod me down" people, but is still exploiting the moderators' naivete. You can't really fault the posters who do this, I suppose; so long as moderators are giving out free points, why not take advantage? But the dialogue here would be much better if people didn't post with a desire to receive "5, Insightful".
I'm not against taking an unpopular stand. There are certainly times it's warranted. I'm against taking an unpopular stand when your foil is imaginary. The OP here is lambasting a fictional group of people.
Re:the workaround is bad design by bconway · 2009-03-20 01:32 · Score: 1

Short version: "We're sorry we changed something that worked and everyone was used to, but hey -- it's compliant with a standard." If this were Microsoft, we'd give them a healthy helping of humble pie
Really? I thought we praised them for following standards when they did exactly that with IE8's default rendering mode.

--
Interested in open source engine management for your Subaru?
Re:the workaround is bad design by girlintraining · 2009-03-20 09:23 · Score: 1

First of all, ext3 is live and working, there's no reason whatsoever to move to ext4 as of now, unless you want a speed boost.

Or use NTFS, which has the same features and works now.

Second "all your data is belong to us" seems far from what is going on, did you hear this only happens when the power is gone?
If the system crashes, or someone frobs the reset button, it's the same deal.

Since when is it that obvious to expect all data to reliably survive a power shutdown?

The idea behind journaling is not to save all the data, but to leave the data in a consistent state. ext4 fails this test.

Not at all, this is more of developers vs. developers. The users will just have to wait until one side changes what was going on.

Uhhmm, and what about the users who use products made by developers on the wrong side of this "debate"? Too bad for them?

I guess you were just reading the words "data losses" in a topic that relates to Linux and wanted to prove your theory of how free software is like Microsoft, eh?
Software is software, I don't care to get into religious debates about which one is the One True Software. I care about two things as a professional: Reliability, and performance, and in that order. I won't sacrifice reliability to gain performance. Microsoft understands this. Linux is still learning.

--
#fuckbeta #iamslashdot #dicemustdie
Re:the workaround is bad design by Anonymous Coward · 2009-03-22 06:31 · Score: 0

"Microsoft runs a proprietary show where they 'set the standard' themselves" - by TheMMaster (527904) on Thursday March 19, @02:48PM (#27259629) Homepage
They sure do, and that standard runs on a good 95% of the world's PC's, from home user systems, up thru departmental LANs, & right up into enterprise-wide WANs + Back Office server type applications, on the most used hardware platform there is for personal computers & servers, in x86...
A standard that acts as the official disseminator of trade data @ NASDAQ using Windows Server + SQLServer 2005, & it has done so via failover clustering, for years now in a stable & consistent manner, running into the fabled "5-9's" of 99.999% uptime, since 2006 to present day:
----
NASDAQ Migrates to SQL Server 2005:
http://windowsfs.com/enews/nasdaq-migrates-to-sql-server-2005 [windowsfs.com]
----
AND, that stability also has been seen by end-users, once they FULLY 'security-harden' their Windows NT-based OS of modern variety, as shown here via quoted testimonial:
----
HOW TO SECURE Windows 2000/XP/Server 2003 & even VISTA, + make it "fun-to-do", via CIS Tool Guidance (& beyond):
http://www.tcmagazine.com/forums/index.php?s=9783f30ecf36d1be841544233b95fdf8&showtopic=2662&st=0&start=0
----
USER FEEDBACK/TESTIMONIAL:
http://www.xtremepccentral.com/forums/showthread.php?s=c96cb88da236d4122a8aef2235caec6b&t=28430&page=3
(Using a verbatim quote/User Testimonial, of 1++ yr. virus/spyware/trojan/rootkit/worm/malware-in-general trouble-free stable, fast, & secure operation as the result while using Microsoft Windows once security-hardened)
----
"Its 2009 - still trouble free!
I was told last week by a co worker who does active directory administration, and he said I was doing overkill. I told him yes, but I just eliminated the half life in windows that you usually get. He said good point.
So from 2008 till 2009. No speed decreases, its been to a lan party, moved around in a move, and it still NEVER has had the OS reinstalled besides the fact I imaged the drive over in 2008.
Great stuff!
My client STILL Hasn't called me back in regards to that one machine to get it locked down for the kid. I am glad it worked and I am sure her wallet is appreciated too now that it works. Speaking of which, I need to call her to see if I can get some leads.
APK - I will say it again, the guide is FANTASTIC! Its made my PC experience much easier. Sandboxing was great. Getting my host file updated, setting services to system service, rather than system local. (except AVG updater, needed system local)"
THRONKA @ xtremepccentral.com
----
As the saying goes?
"Nuff said"
----

"THIS is why a) microsoft can't ever truly fix something and b) why using proprietary software screws over the user." - by TheMMaster (527904) on Thursday March 19, @02:48PM (#27259629) Homepage
That's funny: I don't see NTFS hosing people over like this ext4 fsync data-loss related problem...
APK
P.S.=> You can say what you wish in reply, I won't be there to see it. What I do know, is the facts noted above (& you are welcome to dispute them (to no avail, because facts ARE FACTS, & this article's topic even BACKS my 2nd statement in & of itself))... apk

Show some respect! by LotsOfPhil · 2009-03-19 05:56 · Score: 5, Funny

...new solutions have been provided by Ted Ts'o to...

That's General Ts'o to you!

--
This post climbed Mt. Washington.

Re:Show some respect! by Anonymous Coward · 2009-03-19 06:17 · Score: 0

I don't think his file system will ever top his chicken.
Re:Show some respect! by TheGratefulNet · 2009-03-19 06:57 · Score: 1, Funny

"what's a matter, colonel? CHICKEN?"
sorry.

--

--
"It is now safe to switch off your computer."

Re:LOL: Bug Report by jd · 2009-03-19 05:57 · Score: 0

I wish to suggest that this is the immediate solution. The complete solution involves a truckload of pissed-off users storming a POSIX committee meeting and bashing the committee members over the head with clue sticks.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:LOL: Bug Report by Z00L00K · 2009-03-19 05:58 · Score: 4, Insightful

This is the problem with new features - the users have problems using them until they fully understands and appreciates the advantages and disadvantages.

And also consider - ext4 is relatively new, so it will improve over time. If you want stability stick to ext3 or ext2. If you want a really stupid filesystem go FAT and prepare for a patent attack.

--
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 05:59 · Score: 0

If an application decides to check the name of the file system and if the name is "ext4" it erases everything in your home directory, should that be considered a file system bug too?

More like ext3? by Anonymous Coward · 2009-03-19 05:59 · Score: 0

...does that make it ext4-, ext3.99, ext4less?

I sit just me? by IMarvinTPA · 2009-03-19 06:04 · Score: 2, Insightful

I sit just me, or would you expect that the change would only be committed once the data was written to disk under all circumstances?
To me, it sounds like somebody screwed up a part of the POSIX specification. I should look for the line that says "During a crash, loose the user's recently changed file data and wipe out the old data too."

IMarv

--
Trusting software vendors is no smarter than trus

Re:I sit just me? by Anonymous Coward · 2009-03-19 06:07 · Score: 1, Insightful

aye, standard aren't perfect. if it doesn't make sense, that part should be avoided and create an updated standard addressing these issues. what somebody decided years bad isn't always the best solution.
Re:I sit just me? by Em+Emalb · 2009-03-19 06:07 · Score: 3, Funny

Nope, not just you, I sit also.

--
Sent from your iPad.
Re:I sit just me? by IMarvinTPA · 2009-03-19 06:17 · Score: 1

I sit is it? Hmm.
IMarv

--
Trusting software vendors is no smarter than trus
Re:I sit just me? by Anonymous Coward · 2009-03-19 06:37 · Score: 0

There are proper ways to do what both GNOME and KDE are doing - they just choose to do it wrong, and dependant on a specific implementation's behaviour. They then discover their implementation is complete shit, wrong, and broken according to the POSIX specifications. They then decide to bitch and moan about the FS rather than fix their horribly broken code.
Next, a mass of ignorant users who don't understand what the hell they are talking about in the first place then complain loudly because they are clueless parrots - who are seemingly lucky to locate their keyboard.
Simple put, GNOME/KDE both need to simply fix their shit instead of passing the buck on their horribly buggy code based on notions which are well known to be false by any competent POSIX coder. In short, anyone that is wagging a finger of ext4 is ignorant. Anyone that isn't wagging a finger at GNOME/KDE and DEMANDING they fix their broken behaviour is an idiot.
Re:I sit just me? by Anonymous Coward · 2009-03-19 08:06 · Score: 0

Yes, we've heard lots of great suggestions for fixing it - like running a zillion fsyncs every second, or having your mundane file system tasks run on a full-fledged database system on top of the normal filesystem.
Remember, file system developers don't assign blame, they take responsibility. Just suck it up and guarantee the damn write-rename order already.

Workaround is disaster for laptops by victim · 2009-03-19 06:05 · Score: 5, Insightful

The workaround (flushing everything to disk before the rename) is a disaster for laptops or anything else which might wish to spin down a disk drive.

The write-replace idiom is used when a program is updating a file and can tolerate the update being lost in a crash, but wants either the old or the new to be intact and uncorrupted. The proposed sync solution accomplishes this, but at the cost of spinning up the drive and writing the blocks at each write-replace. How often does your browser update a file while you surf? Every cache entry? Every history entry? What about your music player? Desktop manager? All of these will be spin up your disk drive.

Hiding behind POSIX is not the solution. There needs to be a solution that supports write-replace without spinning up the disk drive.

The ext4 people have kindly illuminated the problem. Now it is time to define a solution. Maybe it will be some sort of barrier logic, maybe a new kind of sync syscall. But it needs to be done.

Re:Workaround is disaster for laptops by Anonymous Coward · 2009-03-19 06:20 · Score: 0

"There needs to be a solution that supports write-replace without spinning up the disk drive."
You mean to write on the disk without spinning it up?
How dows ext3 do that?
The fix for the ext4-problem is so easy:
Good code:
fwrite()
fclose() - no extra spinning of disc
Bad code:
fwrite()
fclose()
rename() - rename may replace old file without new file on dix
Fixed code:
fwrite()
fsync() - sync this file before close
fclose()
rename()
Re:Workaround is disaster for laptops by GMFTatsujin · 2009-03-19 06:20 · Score: 2, Insightful

If the issue is drive spin-up, how have the new generation of flash drives been taken into account? It seems to me that rotational drives are on their way out.
That doesn't do anything for the contemporary generations of laptop, but what would the ramifications be for later ones?
Re:Workaround is disaster for laptops by Anonymous Coward · 2009-03-19 06:24 · Score: 0

All that temporary file usage should reside in /tmp, which anyone with a modicum of knowledge will mount to RAM, especially on desktops and laptops.
Re:Workaround is disaster for laptops by Anonymous Coward · 2009-03-19 06:34 · Score: 0

All that temporary file usage should reside in /tmp, which anyone with a modicum of knowledge will mount to RAM, especially on desktops and laptops.
Are you nuts? My 4 gig laptop has a 10 gig /tmp partition. When I replace it next year, it becomes a "california server", so that's why I set the fs up that way.
It already serves as a linux development platform, so a big /tmp is needed.
Re:Workaround is disaster for laptops by Anonymous Coward · 2009-03-19 06:43 · Score: 0

No. He wants to write to the disk eventually (but is quite happy for the write to not happen for a long time) without the file being lost.
As long as the rename is not permitted to happen until the data has actually been written, there's no problem with nothing being written. The file on disk still has the entire old contents until the data is written out and then the rename is done, after which it has the entire new contents. It the file is overwritten 20 time, then whenever the OS decides to spin up the disk and clear the cache, the last updated copy of the file is written and the others could potentially (if the file system is smart enough) never have to be written.
Re:Workaround is disaster for laptops by Kjella · 2009-03-19 06:48 · Score: 5, Informative

Fixed code:
fwrite()
fsync() - sync this file before close
fclose()
rename()
Either you're a troll or an idiot, since you're AC'ing I guess I got trolled. This will sync immidiately and kill performance and battery life, since every block must be confirmed written before the process can continue. What you need to fix this is a delayed rename that happens after the delayed write.
Problem:
fwrite()
fclose()
rename()
*ACTUAL RENAME*
*TIME PASSES* <-- crash happens here = lose old file
*ACTUAL WRITE*
Real solution:
fwrite()
fclose()
rename()
*TIME PASSES* <-- crash happens here = keep old file
*ACTUAL WRITE*
*ACTUAL RENAME*

--
Live today, because you never know what tomorrow brings
Re:Workaround is disaster for laptops by RiotingPacifist · 2009-03-19 06:58 · Score: 1

good code - unless there is a crash during the writing of the file, in which case your software is screwed next time you try and read the config file.
bad code - safe as long as the filesystem isn't ext4 or really old versions of XFS/reiserfs.
fixed code - yeah lets abuse fsync and slow the users system down.

--
IranAir Flight 655 never forget!
Re:Workaround is disaster for laptops by Anonymous Coward · 2009-03-19 07:02 · Score: 0

Why do you want to rename before the new file is synced?
Writing data and renaming are independent actions and don't have to be executed in order. This is what the stadard says. I assume that you, too, did not read the standard, before you made up your opinion about what it should say about the issue, but doesn't.
Re:Workaround is disaster for laptops by dshadowwolf · 2009-03-19 07:11 · Score: 3, Informative

And you don't get it... The truth is that Ext4 was writing the journal out before any changes took place. This means that when the crash happens between the metadata write and the actual write a replay of the journal will cause data loss.
Other filesystems with delayed allocation solve this by not writing the journal before the actual data commits happen. The fix that TFA is talking about introduces this to Ext4.
Re:Workaround is disaster for laptops by david_thornley · 2009-03-19 07:15 · Score: 3, Informative

In which case the standard sucks, big time, and finding a loophole that trashes normal expected behavior should not be cause for rejoicing.
There needs to be a way to write a file such that either the old or the new is preserved. Agreed on this?
Now, in a file system that's going to run real well, there needs to be a way to delay writes in order to batch them. Agreed on this?
We have two reasonable demands here. Pick one, because that's all you're going to get.
Currently, in order to keep either the old or new file, it's necessary to write the new file right now. This is the standard behavior, and it trashes performance. Alternatively, the writes can be batched up for later, for good performance, and we run the risk of losing both old and new versions of a file.
In other words, in order to optimize the heck out of the file system, it's necessary to trash the performance.
What we need is a way to do the rewrite-rename thing in a way so it can be safely delayed, so the file system can batch up a lot of writes to do in a really fancy optimized way, but writing the new file fully before renaming it. There's no obvious reason to me why the file system can't keep track of this and guarantee the order. It may not be required by the standard, but that's no excuse for not implementing it.

--
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
Re:Workaround is disaster for laptops by Anonymous Coward · 2009-03-19 07:17 · Score: 0

(1) good code - unless there is a crash during the writing of the file, in which case your software is screwed next time you try and read the config file.
(2) bad code - safe as long as the filesystem isn't ext4 or really old versions of XFS/reiserfs.
(3) fixed code - yeah lets abuse fsync and slow the users system down.
1: the write is either committed or not like in ext3 thanks to the journal. The software doesn't care about the write commited to a particular point in time.
2: This uses a feature that is not in the posix-standard. So don't rely on it or change the standard.
3: In what way is the system slowed down if the write needs to be committed anyways and you only sync that particular file?
Sure, there should be a function fsync_then_rename() this would create the old behaviour.
Re:Workaround is disaster for laptops by BigBuckHunter · 2009-03-19 07:21 · Score: 2, Informative

There needs to be a solution that supports write-replace without spinning up the disk drive.
How do you intend on writing to the disk drive... without spinning it up? Is this not what you're asking? If this is indeed your question, the answer is already "by using a battery backed cache".

BBH
Re:Workaround is disaster for laptops by shutdown+-p+now · 2009-03-19 08:02 · Score: 1

Drive spin-up is just the syndrome, the problem is more fundamental. Having to do explicit fsync() in this case still slows things down significantly for "lots of small files" case (such as browser cache).
Re:Workaround is disaster for laptops by MikeBabcock · 2009-03-19 08:38 · Score: 1

Real solution is more like VIM:
fwrite("oldfilename.dat~"); fclose();
/* time passes */
/* Crash -- recoverable new data and old data available */
/* User exits normally */
rename("oldfilename.dat~", "oldfilename.dat"); fsync(); exit();
Making the OS decide which data needs to be fsync'd when is just silly. When to sync the data I'm copying off my memory card vs. the blocks coming in from BitTorrent vs. the temp files my Java game uses is stochastic and arbitrary, and if one of those apps knows its data needs to be preserved because its irreplaceable, it ought to fsync() it.

--
- Michael T. Babcock (Yes, I blog)
Re:Workaround is disaster for laptops by lacoronus · 2009-03-19 08:45 · Score: 1

I agree with parent.
In all other caching schemes I'm aware of, there are rules as to how one may reorder the operations.
One option would be to separate writing the data from delineating transactions. For example, we could have an "io barrier" (analogous with a memory barrier), that says that no operation may be moved across io-barrier calls, but this gives no guarantee of the data having been written to disk.
So, you could have: fwrite(); fclose(); io_barrier(); // 1 rename(); io_barrier(); // 2
This would prevent the rename from being moved up above the first io-barrier, or the write from being moved below the first. (The barrier calls themselves may not be moved around relative to any io-operation or each other.) This means that if the rename() has really executed and been written to disk, then we know for sure that the fwrite() and fclose() have run. However, we only guarantee the order of operations, not that we've actually written stuff to disk, so there is no disk-thrashing.
Re:Workaround is disaster for laptops by lacoronus · 2009-03-19 08:58 · Score: 1

Won't work. The problem is that you may get the rename before the write, so if your code crashes between the rename and the fsync, you end up with lost data - the rename goes through, but no data has been written yet.
The solution is to stuff a fsync() after the first fclose().
Re:Workaround is disaster for laptops by wastedlife · 2009-03-19 09:27 · Score: 1

(such as browser cache)
Is there a better example? Browser cache should be volatile.

--
Said, "It's just like dice but it's got more sides And it tells me who lives and who dies"
Re:Workaround is disaster for laptops by ultranova · 2009-03-19 09:56 · Score: 1

Real solution is more like VIM:

Doesn't work. You need another fsync() call between fclose() and rename(). Otherwise you're vulnerable to this exact bug; your system could crash after the rename() has been committed but before the file contents have been written, leaving you with an empty file.
Your code snippet is exactly how the apps do it now (except no fsync(), since retaining the old file is acceptable), and it doesn't work in Ext4.

Making the OS decide which data needs to be fsync'd when is just silly.

This issue has to do with how writes to the disk are ordered relative to each other, not when they are done. The guarantee Ext3 makes (and Ext4 currently doesn't) is that file contents get written before metadata.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Workaround is disaster for laptops by shutdown+-p+now · 2009-03-19 10:06 · Score: 1

Did you miss the original /. story about this? The thing that broke was KDE and Gnome config systems (which both store data in lots of small files).
Note also that "volatile" is irrelevant here. It's okay if the filesystem update doesn't get saved for the browser cache. It's not good if the update quietly corrupts the cache, and I start getting blank squares instead of images.
Re:Workaround is disaster for laptops by RiotingPacifist · 2009-03-19 12:52 · Score: 1

1. you still lose your settings under ext3 which "bad code" does not
2. this relies on sane behavior of the OS, no standards required.
3. http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/

...I think you can see where this is going: if thereâ(TM)s a lot of data waiting to be written to disk, and Firefoxâ(TM)s (sqliteâ(TM)s) request to flush the data for one file actually sends all that data out, we could be waiting for a while. Worse, all the other applications that are writing data may end up waiting for it to complete as well. In artificial, but not entirely impossible, test conditions, those delays can be 30 seconds or more. That experience, to coin a phrase, kinda sucks...
its safe to assume your software will be running on systems that run ext3 or rieser4, so slowing down most peoples systems for the benifit of ext4 users will NOT be appreciated!

fsync_then_rename()

this is not needed and a terrible idea. Operations on a file should take place in the order they were committed (i dont care if you commit files out of order but each file should be dealt with correctly (and preferably minimally)) or do you want an rename_then_fsync , fsync_then_delete ... commands for every single in/plausible situation?

--
IranAir Flight 655 never forget!
Re:Workaround is disaster for laptops by slamb · 2009-03-19 13:39 · Score: 1

The ext4 people have kindly illuminated the problem. Now it is time to define a solution. Maybe it will be some sort of barrier logic, maybe a new kind of sync syscall. But it needs to be done.
+1 for a new fbarrier(2) syscall.
Re:Workaround is disaster for laptops by Zancarius · 2009-03-19 15:27 · Score: 1

I think the point the OP was making was in relation to laptops where, to improve battery performance, the idea is to collate all writes to disk so it's only necessary to spin up the drive once every half hour or so. While battery-backed cache is a good thing, it's out of scope from the original point made.
Yes, I know--if the battery suddenly comes loose or the OS crashes, that's a fair bit of data lost. However, some people actually like squeezing out as much battery time as they can at the cost of potential data loss. Granted, most of the solutions I've seen have been intended instead to allow the laptop to sit with its disk in a powered-down state while syslog is still happily generating events without any clue that no data is actually being written (yet). Beyond that, I've never seen a situation where someone is actually doing real work where the disks are powered down, so I'd imagine the original poster had the syslog issue in mind when he wrote. :)
It may be arguing semantics here, but I felt it was necessary to point this out since you may not have been aware of the power-down issue, syslog, and laptop use case scenarios where writing--but not really writing--is happening in memory with period commits to flush the cache. Personally? I just leave the damn thing on. ;)

--
He who has no .plan has small finger. ~ Confucius on UNIX
Re:Workaround is disaster for laptops by RAMMS+EIN · 2009-03-19 21:26 · Score: 1

``the standard sucks, big time''
Yes, I agree. I know it isn't a popular point of view, but filesystems are really databases. All the issues that apply to databases also apply to filesystems. And, as with databases, you can go a long way pretending the issues don't exist. And then get everything messed up when one of the issues does hit you.
The major difference between filesystems and databases is that, for databases, a lot of research has been done about making them efficent _and_ correct. For filesystems, a lot of work has been done on making them efficient. But, as this drama shows, not a lot of work has gone into making them efficent and _correct_. And I don't mean ext4 specifically, I mean the filesystem API itself.
What we are seeing here is basically a transaction. A file is truncated and some data are written to it. We want either both of these things to happen (commit), or neither (roll back). Now, we have two options. We can do this asynchronously, which gives good performance but no guarantee the result will be a desireable one. Or we can do it synchronously, giving good results but bad performance. And this is a simple case. Try to come up with transactions involving multiple files and you will see why the filesystem makes a very _poor_ database.

--
Please correct me if I got my facts wrong.
Re:Workaround is disaster for laptops by wastedlife · 2009-03-20 02:57 · Score: 1

I had misunderstood the previous story due to seeing an overwhelming number of posts stating that you shouldn't expect data to be written successfully unless you explicitly fsync(). Now I understand that a crash between rename() and whenever the filesystem forces a write will leave 2 zero sized files. As far as browser cache goes, I'm surprised the browser would not ignore cache files with a size of zero, but this is still bad filesystem behavior and not application behavior.

--
Said, "It's just like dice but it's got more sides And it tells me who lives and who dies"

Dunno by Shivetya · 2009-03-19 06:05 · Score: 4, Insightful

but if you want a write later file system shouldn't it be restricted to hardware that can preserve it?

I understand that doing writes immediately when requested leads to performance degradation but that is why business systems which defer writes to disk only do so when the hardware can guarantee it. In other words, we have a battery backed cache, if the battery is low or nearing end of life the cache is turned off and all writes are made when the data changes.

Trying to make performance gains to overcome limitations of the hardware never wins out.

--
* Winners compare their achievements to their goals, losers compare theirs to that of others.

Re:Dunno by gnasher719 · 2009-03-19 07:15 · Score: 1

I understand that doing writes immediately when requested leads to performance degradation but that is why business systems which defer writes to disk only do so when the hardware can guarantee it. In other words, we have a battery backed cache, if the battery is low or nearing end of life the cache is turned off and all writes are made when the data changes.
You don't even need to do this. The reported problem happened (I think) during some installation of five hundred files. The computer crashed just after the installation was finished, at a time when half the changes were written to disk. If the computer had crashed _before_ the installation started, everything would have been fine. If the computer had delayed _all_ writes by two minutes, and the computer crashed a minute after it said "installation finished", but before anything was actually written to disk, everything would have been fine (Ok, you would have to repeat the installation process, but that is no problem).

What the file system must do is do a bunch of changes together that belong together, and minimize the time interval where a crash would have bad results, preferably to zero.
Re:Dunno by mewsenews · 2009-03-19 07:21 · Score: 1

In other words, we have a battery backed cache, if the battery is low or nearing end of life the cache is turned off and all writes are made when the data changes.
A capacitor would probably have enough juice to do an emergency flush of the cache without wearing out like a battery. I am not an electrical engineer.
Re:Dunno by Anonymous Coward · 2009-03-19 07:33 · Score: 0

this is the problem running enterprise software methods on consumer class PCs. With our AS400 (iSeries) it had dedicated SCSI controllers with battery backed up write cache. The OS "told" the SCSI card what to write and then put the data out there for the controller to decide when the optimal time to do that was.
The problem is "cheap" hardware doesn't do that properly (and your built in "raid" don't count) expecting the OS or file system driver to jump in and make the decision for the card because it doesn't have a controller to do that. The real solution all the enterprise vendors do, is to limit the use of software to only hardware guaranteed to have all the features (i.e. their specific model numbers) For the rest of us, this is an "here be dragons" moment where they could add in all the extra logic to make sure your hardware supports the feature.. but then people would complain they didn't get better performance.
Re:Dunno by MikeBabcock · 2009-03-19 08:31 · Score: 2, Informative

Without write-back (that's delaying writes until later and keeping them in a cache), you lose elevator sorting. No elevator sorting makes heavy drive usage ridiculously slower than with.
You can't re-sort and organize your disk activity without the ability to delay the data in a pool.
The difference between EXT3 and EXT4 is not whether the data gets written immediately -- neither do that. The difference is how long they wait. EXT4 finally gives major power preservation by delaying writes until necessary so my laptop hard drive doesn't spin up for brief moments of unnecessary disk activity all the time.
You want your data written synchronously? Just mount your filesystem with 'sync' and its all done for you. No problem, no bug.
"mount -o remount,sync /dev/sda1 /" all done.

--
- Michael T. Babcock (Yes, I blog)
Re:Dunno by mmontour · 2009-03-19 08:44 · Score: 1

A capacitor would probably have enough juice to do an emergency flush of the cache without wearing out like a battery. I am not an electrical engineer.
The battery is not there to do an emergency flush. It's to preserve the data in RAM for a couple of days until main power is restored. Once that happens and the disks spin up again the cached data is written out. There has been a huge improvement in capacitors over the years but they still do not have the same energy density as a good battery.
One approach that I think would be viable would be to have a capacitor and some flash memory on the controller card. In the event of a power failure the capacitor would only have to supply power long enough for the controller to copy all of the RAM into flash. I don't know if anybody is producing this yet but it seems like an obvious step now that flash is dirt-cheap.
Re:Dunno by Yokaze · 2009-03-19 10:49 · Score: 1

> The difference is how long they wait.
Actually the problem lies in another difference: Contrary to the default ext3 mode (data=ordered), the ext4 default behaviour allows reordering of the operation, resulting in a rename to be completed before a write, despite being issued in a different order. That leaves you with unwritten files after a crash, instead of just losing the last write.

--
"Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"
Re:Dunno by AigariusDebian · 2009-03-19 10:52 · Score: 1

ext3 is also delaying writes. The bug is that ext4 is not delaying renames to happen after writes. Instead renames happen immediately, and guess what, they spin your hard drive up, then you get to wait 60 second until real data starts to be written.
Oh and if you lose power or crash during these 60 seconds, you loose all data - new and old. Oh and you common desktops programs do that cycle several times a minute.
Have fun.
Re:Dunno by mr3038 · 2009-03-20 01:04 · Score: 2, Insightful

ext3 is also delaying writes. The bug is that ext4 is not delaying renames to happen after writes. Instead renames happen immediately, and guess what, they spin your hard drive up, then you get to wait 60 second until real data starts to be written. Oh and if you lose power or crash during these 60 seconds, you loose all data - new and old. Oh and you common desktops programs do that cycle several times a minute.

Excuse my language, but why the fuck are those "common desktop programs" writing and renaming files several times a minute? I understand that files are written if I change any settings but this is something different. Perhaps there should be some special filesystem that is designed to freeze the whole system for 1 second for every write() any application does. Such filesystem could be used for application testing. That way it would be immediately obvious if any program is writing too much stuff without a good reason.
The EXT4 is doing exactly the right thing because it's never actually writing any of those files to the disk. Because those files are constantly replaced with new versions, there's no point trying to save any unless the application ask so. To do that, the application should call fsync(). Otherwise, the FS has no obligation to write anything in any given order to the disk until the FS is unmounted. A high performance FS with enough cache will not write anything to disk until fsync() unless the CPU and disk have nothing else to do (and even then, only because it probably improves the performance of possibly following fsync() or unmount in the future).

--
_________________________
Spelling and grammar mistakes left as an exercise for the reader.
Re:Dunno by MikeBabcock · 2009-03-24 02:52 · Score: 1

Kopete for example is moronic and manages to lose ALL my settings on a regular basis if I don't close it out nicely or if I run out of space on /home for it to write new settings.
The old settings were on disk, why are the new ones blank? Yeah. That's without a crash.
Do I blame application authors? You bet I do.

--
- Michael T. Babcock (Yes, I blog)

yeah old data in a crash cool no data not so cool by Anonymous Coward · 2009-03-19 06:10 · Score: 0

That is the issue. Ext3 generally gives me a consistent previous point in time in power failure or crash. I would expect ext4 to too. I used XFS and had a power cable get yanked accidentally in the middle of a project. Everything was gone. I immediately dumped XFS over this.

This is unacceptable behavior. Open files should not be zeroed by design. They should be at last point time. I understand HW issues of a power failure, but that is different than it doing it on purpose. Any system dev. that thinks its acceptable is a fool.

Re:Those who fail to learn the lessons of history. by Samschnooks · 2009-03-19 06:10 · Score: 2, Insightful

Speaking as someone who has developed OS commercial code (OS/2), I always assumed that the person before me understood what they were doing; because, if you didn't, you were spending all your time researching how the 'wheel' was invented. Also, aside from this very rare occurrence, it is pretty arrogant to think that your predecessors are incompetent or, to be generous, ignorant.

This problem is just something that slipped through the cracks and I'm sure the originator of this bug is kicking himself in the ass for being so "stupid".

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 06:12 · Score: 5, Insightful

Rubbish. Sorry, if the syncs were implicit, app developers would just be demanding a way to to turn them off most of the time because they were killing performance.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 06:14 · Score: 0

Yes. All new kernel features should do anything it takes to ensure they work with popular applications. If a new kernel feature breaks an application, even if it is because the developers made incorrect assumptions about how things work, then the new kernel feature should be discarded. This is simple common sense, and something that even Microsoft gets right.

Re:LOL: Bug Report by von_rick · 2009-03-19 06:15 · Score: 4, Insightful

And also consider - ext4 is relatively new, so it will improve over time. If you want stability stick to ext3 or ext2.

QFT

The filesystem was first released sometime towards the end of December 2008. The Linux distros that incorporated it, gave it as an option, but the default for /root and /home was always EXT3.

In addition, this problem is not a week old like the article states. People have been discussing this problem on forums ever since mid-January, when the benchmarks for EXT4 were published and several people decided to try it out to see how it fares. I have been using EXT4 for my /root partition since January. Fortunately I haven't had any data loss, but if I do end up losing some data, I'd understand that since I have been using a brand new file-system which has not been thoroughly tested by users, nor has it been used on any servers that I know of.

--

Face your daemons!

Re:LOL: Bug Report by berend+botje · 2009-03-19 06:15 · Score: 1, Interesting

Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.

Mr Ts'o is mistaken about this. When he introduces optimasation features that other filesystems (Reiser, for example) have already tried and undone because it doesn't work he is not fit to write filing systems. First learn how others did it, then do it better.

With Ext4 now proven unstable, the only viable new filesystem is ZFS. Or just stick with ext3 or UFS.

Re:Those who fail to learn the lessons of history. by dotancohen · 2009-03-19 06:22 · Score: 2, Insightful

Before you get to write any filesystem code, you should have to study how other people have done it...

No. Being innovative means being original, and that means taking new and different paths. Once you have seen somebody else's path, it is difficult to go out on your own original path. That is why there are alpha nad beta stages to a project, so that watchful eyes can find the mistakes that you will undoubtedly make, even those that have been made before you.

--
It is dangerous to be right when the government is wrong.

CanSecWest security conference by rs232 · 2009-03-19 06:31 · Score: 0, Offtopic

Pwn2Own 2009 Day 1 - Safari, Internet Explorer, and Firefox Taken Down by Four Zero-Day Exploits

Charlie Miller got the luck of the draw, and had the first time slot for the browser competition. His target- Safari on Mac OS X. Before I could even pull my camera out, it was over within 2 minutes- and Charlie (coincidentally also last year's first winner of the day) is now the proud owner of yet another MacBook, and $5,000 from the Zero Day Initiative.

Next up, Nils. Just Nils- you know, like "Prince" or "Madonna". With a little tweaking, he ran a sleek exploit against IE8, defying Microsoft's latest built in protection technologies- DEP (Data Execution Prevention) as well as ASLR (Address Space Layout Randomization) to take home the Sony Vaio and $5,000 from ZDI.

--
davecb5620@gmail.com

Quick workaround - no patches required by canadiangoose · 2009-03-19 06:32 · Score: 5, Informative

If you mount your ext4 partitions with nodelalloc you should be fine. You will of course no longer benefit from the performance enhancements that delayed allocation bring, but at least you'll have all of your freaking data. I'm running Debian on Linux 2.6.29-rc8-git4, and so far my limited testing has shown this to be very effective.

--
Never eat more than you can lift -- Miss Piggy

Re:Quick workaround - no patches required by Anonymous Coward · 2009-03-19 06:39 · Score: 0

nodelalloc fantastic. I hope most distros consider this a DEFAULT.
Re:Quick workaround - no patches required by pavon · 2009-03-19 07:27 · Score: 1

Do you know, when you mount with the noalloc option, is it really the same as ext3 with the data=ordered option, or is the behavior closer to ext3 with the data=writeback option?

Shoulders of Giants by turgid · 2009-03-19 06:32 · Score: 1

Standing on the shoulders of giants is usually the best way to make progress.

--
Stick Men

Re:Shoulders of Giants by Evanisincontrol · 2009-03-19 06:39 · Score: 2, Insightful

Standing on the shoulders of giants is usually the best way to make progress.
Sure, if the only direction you want to go is the direction that the giant is already moving. Doesn't help you get anywhere else, though.
Re:Shoulders of Giants by turgid · 2009-03-19 06:42 · Score: 1

Learning from the mistakes of others is a good practice no matter what direction you're going in.

--
Stick Men
Re:Shoulders of Giants by TemporalBeing · 2009-03-19 07:12 · Score: 1

Learning from the mistakes of others is a good practice no matter what direction you're going in.
only so long as they apply to the direction you are going. Not all mistakes by others apply to every direction - in fact, most probably don't.

That doesn't mean that "Lesson's Learned" are useful - just not always applicable.

--
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
Re:Shoulders of Giants by AigariusDebian · 2009-03-19 07:55 · Score: 1

In this case, all of XFS experience on the topic applies to Ext4 perfectly.
Re:Shoulders of Giants by TemporalBeing · 2009-03-19 08:47 · Score: 1

In this case, all of XFS experience on the topic applies to Ext4 perfectly.
That would only be true if Ext4 was XFS, which it is not. While I have not looked at the code, I would gander that there are a number of design differences between them that may lead to difference results even down the same general path.

As I said in my original apply (20760005) - that doesn't mean its not useful, just might not necessarily be applicable. Only those intimately familiar with the project and its source and the decisions (or the those who have intimately reviewed all such information, as is possible with FOSS but not proprietary projects) would be able to truly answer the (i) useful and (ii) applicable questions.

--
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)

POSIX spec is fine, ext4 is flawed by iYk6 · 2009-03-19 06:34 · Score: 2, Informative

Someone above says that the POSIX standard is fine, but that ext4 violates it. Here is his quote:
"When applications want to overwrite an existing file with new or changed data [...] they first create a temporary file for the new data and then rename it with the system call - rename("

It seems that ext4 renames the file first, and then writes the file up to 60 seconds later.

Re:POSIX spec is fine, ext4 is flawed by renoX · 2009-03-19 07:12 · Score: 1

No, POSIX doesn't garantee write before you do a fsync, an added rename doesn't change this.
This situation is identitical to read&write memory ordering: due to a cache, different CPU may see different value of a variable.
Different architecture has different limitation on the way to reorganise read and write, with x86 it's not too bad but with the Alpha which can truly reorganise thing a lot it becomes very difficult to put all the needed memory barriers.
IMHO, there is performance / usability tradeoff here, and Ext4 shouldn't reorganise operation too much: it's too difficult to use for application programmers.. If you have 'write then rename' then the write should always be done *before* the rename..

No kidding by Sycraft-fu · 2009-03-19 06:36 · Score: 5, Insightful

All the stuff with Ext4 strikes me as amazingly arrogant, and ignorant of the past. The issue that FS authors, well any authors of any system programs/tools/etc need to understand is that your tool being usable is the #1 important thing. In the case of a file system, that means that it reliably stores data on the drive. So, if you do something that really screws that over, well then you probably did it wrong. Doesn't matter if you fully documented it, doesn't matter if it technically "follows the spec" what matters is that it isn't usable.

I mean I could write a spec for a file system that says "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.

I'd give these guys more credit if I was aware of any other major OS/FS combo that did shit like this, but I'm not. Linux/Ext3 doesn't, Windows/NTFS doesn't, OS-X/HFS+ doesn't, Solaris/ZFS doesn't, etc. Well that tells me something. That says that the way they are doing things isn't a good idea. If it is causing problems AND it is something else nobody else does, then probably you ought not do it.

This is just bad design, in my opinion.

Re:No kidding by mr_mischief · 2009-03-19 07:11 · Score: 2, Insightful

It does store data reliably on the drive that has been properly synchronized by the application's author. This data that is lost is what has been sent to a filehandle but not yet synchronized when the system loses power or crashes.
The FS isn't the problem, but it is exposing problems in applications. If you need your FS to be a safety net for such applications, nobody is taking ext3 away just because ext4 is available. IF you want the higher performance of ext4, buy a damn UPS already.
Re:No kidding by SIR_Taco · 2009-03-19 07:39 · Score: 2, Insightful

what matters is that the damn thing loses data on a regular basis.
I guess I don't really understand what you mean by regular basis, or maybe you just like feeding quarters into the FUD machine. Maybe you live in a place where power failures are very common and/or you like to randomly hit the reset/power buttons. Or maybe you're just not peddling hard enough to keep your computer from going into black/brown-out status.
The fact is that you will not lose data on a regular basis unless you have severe power problems. This is a performance boost based on the assumption that power outages and bone-headed users are not the common-place. Take that as you will, and I'm not one to suggest that any distro accept this as their default FS, however, it does have its place and many people welcome it.
Just my two cents.

--
I say don't drink and drive, you might spill your drink. Before you get behind the wheel just stop and think.
Re:No kidding by noidentity · 2009-03-19 07:55 · Score: 1

I mean I could write a spec for a file system that says "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.
Well, said design wouldn't be flaky and guarantee data loss in all environments, it'd just be fragile (and probably not give much of a performance benefit over one that flushed more often). Still using your hypothetical example, what if its behavior DID allow drastic performance improvements over anything less fragile? Having it available would offer users another option to choose when the benefits outweighed the fragility. If the developer documented the fragility clearly, what would be the problem? There would be no obligation to use it, just the option. Some filesystems will require different programming strategies, though will make it seem that normal strategies will work even though they will fail subtly.
Re:No kidding by SanityInAnarchy · 2009-03-19 08:00 · Score: 2, Informative

The issue that FS authors, well any authors of any system programs/tools/etc need to understand is that your tool being usable is the #1 important thing.
Part of usability is performance. This is a significant performance improvement.

So, if you do something that really screws that over, well then you probably did it wrong. Doesn't matter if you fully documented it, doesn't matter if it technically "follows the spec" what matters is that it isn't usable.
The real problem here is that application developers were relying on a hack that happened to work on ext3, but not everywhere else.
Let me ask you this -- should XFS change the way it behaves because of this? EXT4 is doing exactly what XFS has done for decades.

I mean I could write a spec for a file system that says "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss.
No, that's actually precisely what the spec says, with one exception: You can guarantee it to be written to disk by calling fsync.

I'd give these guys more credit if I was aware of any other major OS/FS combo that did shit like this, but I'm not.
Only because you haven't looked.
In fact, there's a mount option to turn this behavior on in ext3.
The "bad design" goes deeper than that.

--
Don't thank God, thank a doctor!
Re:No kidding by JesseMcDonald · 2009-03-19 08:05 · Score: 1

I mean I could write a spec for a file system that says "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless.
Not quite useless; every modern Linux system makes use of a filesystem designed to work in exactly that way. It's known as tmpfs.
More seriously, this is more about users' expectations for their filesystem than it is about application's expections regarding kernel APIs. There's no reason that any application should expect data to reach the disk without an fsync(), so from that point-of-view ext4 is compliant with all the relevant APIs. However, given that this is a common use-case and that users expect modern filesystems to magically "do the right thing" and avoid data loss, it makes sense to patch this particular issue by implicitly inserting fsync()s just before rename()s over existing non-empty files are committed to disk. That would eliminate the data loss without seriously impacting performance.

--
"The state is that great fiction by which everyone tries to live at the expense of everyone else." - Bastiat
Re:No kidding by MikeBabcock · 2009-03-19 08:24 · Score: 1

First off, what we're talking about here is dealing with computer _crashes_, not standard operations.
This is not an issue of the file system doing anything wrong, this is a case of the computer hardware or user doing something wrong (like unplugging it instead of shutting it down).
Next, this is an issue that can be entirely addressed properly with a cron job that runs 'sync' every so often to commit data to disk.
Also note, this is moronic -- performance and battery life are what they're trying to address, whether the data needs to be committed immediately or not is something the application is supposed to tell the system.
Do you have any idea how many files my system writes out and subsequently deletes on a regular basis? None of them needed to be committed to disk, ever.

--
- Michael T. Babcock (Yes, I blog)
Re:No kidding by Sique · 2009-03-19 08:30 · Score: 1

All the stuff with Ext4 strikes me as amazingly arrogant, and ignorant of the past. The issue that FS authors, well any authors of any system programs/tools/etc need to understand is that your tool being usable is the #1 important thing.
That sounds amazingly arrogant to me. There might be people who don't want a production ready product, but rather an experimental prototype for research. It's not purely the fault of ext4 if applications don't synchronise their data.

--
.sig: Sique *sigh*
Re:No kidding by Tacvek · 2009-03-19 08:40 · Score: 4, Informative

I don't think you have it right.
On Ext3 with "data=ordered" (a default mount option), if one writes the file to disk, and then renames the file, ext3 will not allow the rename to take place until after the file has been written to disk.
Therefore if an application that wants to change a file uses the common pattern of writing to a temporary file and then renaming (the renaming is atomic on journaling file systems), if the system crashes at any point, when it reboots the file is guaranteed to be either the old version or the new version.
With Ext4, if you write a file and then rename it, the rename can happen before the write. Thus if the computer crashes between the rename and the write, on reboot the result will be a zero byte file.
The fact that the new version of the file may be lost is not the issue. The issue is that both versions of the file may be lost.
The end result is the write and rename method of ensuring atomic updates to files does not work under Ext4.
A new mount option that forces the rename to come after the data is written to disk is being added. Once that is available, the problem will be gone if you use that mount option. Hopefully it will be made a default mount option.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
Re:No kidding by mhall119 · 2009-03-19 08:44 · Score: 1

Just about every filesystem has the same problem as ext4. The difference (as I understand it) isn't that ext4 loses data on a system crash, it's that it writes the metadata before it write the actual data, and so it might not lose the metadata. On ext3, you wouldn't even have the metadata if your system crashed before it wrote to disk, so instead of empty files you'd have no files. I gather that in the case of KDE, no files would have been better than empty ones.

--
http://www.mhall119.com
Re:No kidding by JumboMessiah · 2009-03-19 09:00 · Score: 2, Insightful

I just posted in the wrong thread. Synopsis:
I made a lot of money back in the 90's repairing NTFS installs. The similarity with it, back then, and EXT4 is they are/were young file systems.
Give Ted and company a break. Let him get the damn thing fixed up (I have plenty of faith in Ted). Hell, I even remember losing an EXT3 file system back when it was fresh out of the gate. And I'm sure there's plenty who could say the same for all those you listed, including ZFS.
And your comment about extended data caching. Is your memory short? Remember "laptop mode", specifically setup this way to keep the hard drive from having to spin up...
Re:No kidding by Tacvek · 2009-03-19 09:03 · Score: 1

Close.
The write-replace idiom is a common way to ensure an atomic update to a file. One writes a new copy of the file to the disk, and then renames it. If the rename is atomic (and since it is metadata it is atomic on a journaling filesystem) then this will ensure the file will have either the old contents or the new contents. However, this does require that the metadata be written after the data. If the metadata is written before the real data, the write-replace idiom fails, and can result in an old file being replaced before the new file is on disk.
So we end up with a zero byte file, when the old file contents would be far preferred.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
Re:No kidding by ultranova · 2009-03-19 09:34 · Score: 1

The real problem here is that application developers were relying on a hack that happened to work on ext3, but not everywhere else.

The problem is that there is no way which will work everywhere, except fsync - and guess what fsync does to performance?
POSIX API needs to be expanded to include support for transactions. Until it's done, expect people to rely on hacks, because they have nothing else to rely on.

No, that's actually precisely what the spec says, with one exception: You can guarantee it to be written to disk by calling fsync.

But the application doesn't need that guarantee. It simply needs the guarantee that either the file is updated or left alone. It doesn't need to ensure that the file is indeed written; it simply needs to ensure that after a crash it either gets the intact old version or intact new version.
In fact this is such an useful guarantee that PostgreSQL recently added a similar connection setting, "synchronous_commit", which can be set to false to retain most of the advantages of transactions - visibility and atomicity guarantees - but give up the guarantee that the transaction has been logged into the permanent storage when "commit" command returns.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:No kidding by grumbel · 2009-03-19 09:40 · Score: 1

First off, what we're talking about here is dealing with computer _crashes_, not standard operations.
Unclean shutdowns are a normal operation of a desktop computer, even Wii, PS3 and Xbox360 crash on a regular basis these days. You can be pretty free from crashes on a server that runs a very limited set of simple applications, but not on a desktop computer that runs the lastest 3D drivers and games. The computer doesn't even need to really crash, its enough when a game in fullscreen becomes unresponsive or the battery runs out of power. If a filesystem can't deal with that in a sane manner, its broken and has no place on a desktop system.
Re:No kidding by blazerw · 2009-03-19 09:53 · Score: 1, Informative

This is an excellent description of the issue. However, if the application writers, in this instance KDE, had synched after messing with extremely important files then the issue wouldn't occur.
The real issue is this, should the filesystem itself have to figure out whether it's dealing with important files or not. Or, should the application tell the filesystem the files are important by forcing the updates to be written. Since the former is impossible, the filesystem would have to treat ALL files as important and thus never be able to do the cool things the Ext4 can do that decrease wear on SSDs, save battery power, save disk space and speed things up.
Re:No kidding by Anonymous Coward · 2009-03-19 12:52 · Score: 0

Sounds like it's a problem with the POSIX spec rather than with the applications.
Re:No kidding by causality · 2009-03-19 14:58 · Score: 1

There's no reason that any application should expect data to reach the disk without an fsync(), so from that point-of-view ext4 is compliant with all the relevant APIs. However, given that this is a common use-case and that users expect modern filesystems to magically "do the right thing" and avoid data loss [ ... ]
To me, the real question is: do users have that expectation because it's reasonable and proper that this is the filesystem's job, or, did the failure of multiple application developers give those users a false impression of what is and is not the correct role of a filesystem? The answer to that question determines what action should be taken to remedy the problem. Reference to the agreed-upon specification is the most unbiased way to answer this question because you are dealing with two groups who know (or certainly should know!) that they need to adhere to it in order to avoid these problems.

It should be answered without looking for excuses or fabricating justifications, especially those based on nebulous and mutable things like "user expectations" (which themselves are often based on convenience, not sound system design). Instead, the most simple reasoning can answer that question. One group, the ext4 developers, did adhere to the standard. Another group, consisting of certain application developers, thought that some parts of it weren't important. It's quite clear that the application developers, despite their almost certainly good intentions, have given users a false impression of what is and is not the proper role of a filesystem.

A lot of people seem to want to avoid that conclusion because it doesn't have a "quick fix." That is, in terms of effort alone, it would be much easier to make one change to one filesystem than it would be to correct each application that violates the specification. That's not a good enough justification for rejecting the simple reasoning that quite clearly determines what should happen next. For matters like this that are within its scope, reason is a precious thing -- you should not throw it away so carelessly. If you are willing to abandon things like principles and reason because the conclusions to which they lead you are not the path of least resistance, then some of your most noble qualities are effectively for sale. That's not a solid foundation on which to build much of anything and operating system components (while a mundane example) are no exception.

You cannot live that way and be free of inner conflict, I guarantee it, because for most people there is always a part of you which knows that this is a mistake. That's what inner conflict is, by the way: it's when part of you is "for" something and another part of you is "against" that same thing. It's a terrible condition, widespread though it may be. You may or may not know yourself well enough to understand the causes of inner conflict, but what I said is true whether or not you can fully see the cause-and-effect.

I've already made other comments in this thread about what happens when people refuse to correct a mistake. It becomes entrenched and it sets an undesirable precedent, one which in this case suggests that such things as standards don't really matter. Let's not go down that path. We've seen entities like Microsoft use non-compliance as a weapon; we call it embrace-and-extend. It's an instrument of discord and they use it when discord and disharmony suits their purposes. Isn't Free Software supposed to be better than that? Isn't it supposed to be free of the profit-based control motives that manifest this behavior? Yes? Then why should we replace those profit-based control motives with can't-admit-being-wrong control motives and believe that this is any sort of improvement? If you agree that there is something wrong with that, and that it's no longer tempting when you call things what they are, then let's count this as a lesson learned, support the community's developers who fix their applications, endure a bit of inconvenience while this happens, and get this over with. The right way.

--
It is a miracle that curiosity survives formal education. - Einstein
Re:No kidding by russotto · 2009-03-19 15:30 · Score: 1

Also note, this is moronic -- performance and battery life are what they're trying to address, whether the data needs to be committed immediately or not is something the application is supposed to tell the system.
The paradigm of writing out a new file and renaming it to the old name has been around for longer than the oh-so-holy POSIX standard. The data doesn't need to be committed immediately; it just needs to be committed in order, before the old file is destroyed. It's just as acceptable, in this scenario, for the file system to delay writing out the rename as it is for it to immediately write out the data. There's no way for the application to tell the system that.
Re:No kidding by AvitarX · 2009-03-19 15:48 · Score: 3, Informative

But if the application syncs the file, the new data is written to disk.
This wastes time and performance, and for most files is un-needed.
There are not only "important" and "unimportant" files, there are also "typical" files.
We don't want to lose them, but who cares if recent changes are lost.
Take for example a KDE config file. I am willing to risk all changes made to it since boot (I generally leave my computer off at night, so this is 12 or so hours). I do not want to lose all of my changes since install (this is 10,000 hours).
The method of writing a temporary file and then renaming prevents the second from happening (in EXT3, XFS now, ReiserFS now, and soon EXT4) while still allowing for very aggressive write caching.
EXT4 currently allows for the the second to happen unless a disk write is forced preventing either of the scenarios.
The loss of the file already synced to disk potentially years ago is the issue, not the loss of the relatively recent data.
EXT4 has essentially removed the option for having "typical" files, and forces them to be treated as "important".
So everything becomes every change forces a write, or we care not about this (cache for example). The typical stuff that every change is not so critical (in the rare event of a crash), but it is sure nice to have something becomes elevated to an "important" file that does all of those bad things you describe, and eliminates the ability to cache writes.

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Re:No kidding by AvitarX · 2009-03-19 16:09 · Score: 1

But it is the fault of Ext4 that it chooses to write the meta data before the real data.
The issue is not losing the seconds to minutes of data in the cache, it is that with Ext4 you can now all of the sudden lose the months old data in the cache.
The only solution offered (until now) was to work in such a way as to guarantee your data stayed out of the cache.
Ext4 suddenly allows for a crash to lose old data, not just new. Even if all file systems allow for this, the option for more integrity of already written data without having to increase the number of writes is something Linux users have been able to take advantage of for quite some time. And it is not something I expect we would want to give up.
I don't want to have to fsync all over for the sake of not losing every configuration change I made over the last 18 months, and I won't be to upset if I lose all the changes I made over the last 18 minutes.

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Re:No kidding by akozakie · 2009-03-20 02:42 · Score: 1

What problems in applications? As far as I can see, most pro-ext4 comments bash applications for not doing fsync(), calling this a bug in the application. The problem is that if we fix this "bug" in the applications, ext4 becomes useless! If you fsync() everything, write caching doesn't have a chance of speeding things up - and you'd have to fsync() most files...
The point is that there are three types of files, not two as is usually implied in this discussion. Some files are not important, data loss is acceptable, so they can be cached. Some files are important and should be written to disk ASAP - fsync() gives you that. But many files in the real world (I'd risk saying that "most" would be the better word) do not fit in either category. They're not just created, read and deleted - they're mostly modified. The point is that the changes are in the "unimportant" category, the file itself is not. So, if you lose your changes, too bad, you probably lost some time, that's it. But if you lose the file, you lost a LOT of work, or potentially end up with a system that will not even boot.
The proper, but costly solution is a transactional filesystem. Using a database as a workaround is sometimes suggested (KDE should use sqllite for config, etc.), but how would you use it for your normal files? A transactional FS would be great, but unnecessarily slow. Most files don't need something like that.
The workaround is writing to a temporary file then renaming. This works if order of these two operations is enforced. POSIX does not specify clearly a good enough consistency model for an FS. A causal or even FIFO model would suffice in this case. It's so simple to solve - if two operations on the same file by the same process (or by any process) are done in a given sequence, do not reorder them. If this limits write reordering too much for your taste, another approach might be to sync metadata and data changes - don't rename a file until pending writes to it are synced and don't write anything to a file until pending create/rename/whatever is completed (well, this is a bit more complicated that it seems - the directory is a separate file after all). Sure, this isn't in POSIX, but it is a basic usability requirement.
Another way is to extend the API, adding something like a rename_on_sync() or fclose_with_rename() call - that would be the simplest solution to the problem of low priority updates to high priority files, extremely easy to implement in most filesystems (that's how they do the normal renames, so it's just a #define macro), a bit more work in Ext4. If this is implemented, then applications can actually be fixed and everybody's happy. Don't worry about POSIX compatibility - pure POSIX programs can still use fsync, while smarter ones can add a macro to use the new function if available and replace it with a normal rename otherwise, this will work everywhere, never worse than pure POSIX.
So many solutions. Pick one and everybody's happy. Or keep shooting yourself in the foot with the fsync() BS - if application authors start doing this, everything will be slow and your precious advanced write caching will actually be used once in a blue moon.
POSIX is broken in this aspect - there is NO way to implement the behavior that is required for most files (cache writes all you want, but leave the old version until the update's on disk) based on explicit POSIX guarantees. The only choice: forget write caching or pray that your system doesn't crash before the write is performed. Since this is a very common requirement, any aggresively caching FS must offer a way to do this or risk getting abandoned by users after sufficiently many horror stories.
Re:No kidding by JesseMcDonald · 2009-03-20 03:31 · Score: 1

This isn't really about the applications. POSIX doesn't specify what happens in the event of a crash; ext2 was perfectly POSIX-compliant despite the fact that an unclean shutdown could lead to arbitrary data loss, even with fsync(). The entire point of creating ext3, and now ext4, was to improve (among other things) the behavior of the system with respect to system crashes and power outages. The nature of these improvements are specific to the filesystem, and not part of any spec.
Application developers did contribute to the problem, but they are not solely responsible. There is no API available to communicate the behavior they actually wanted -- not an immediate flush to disk, as in fsync(), but rather a delayed flush which preserves both the order of operations and performance. Given their two options of unacceptable performance and an extremely small chance of data loss -- which didn't occur at all on the most common filesystem -- I do not fault their choice. I would, in fact, go so far as to say that it was the right choice under the circumstances. The risk was small, and their goals did not include an absolute aversion to data loss. They did, however, include the desire to maximize performance during normal use.
There are two separate aspects to this issue. On the one hand, some application developers were not doing all they could according to the POSIX APIs to avoid data loss in a portable manner. Whether they should have been doing so is for the application developers to decide in combination with the application's users. There are trade-offs to be made either way. On the other hand, the filesystem developers introduced a change which increased the probability of data loss with respect to the previous version of the filesystem. Ext4 is intended to be an improvement over ext3, and regressions such as this -- even if the system remains POSIX-compliant, even if the applications in question were not written portably -- detract from that goal.

--
"The state is that great fiction by which everyone tries to live at the expense of everyone else." - Bastiat
Re:No kidding by marcosdumay · 2009-03-20 03:50 · Score: 1

"do users have that expectation because it's reasonable and proper that this is the filesystem's job, or, did the failure of multiple application developers give those users a false impression of what is and is not the correct role of a filesystem?"

Well, let's say:
$cat file1 file1 contents $echo "test" > file1.tmp $mv file1.tmp file1
Now, there were some data at file1, you moved a file with some data into it, but the power goes off. Do you really expect file1 to be empty now?

--
Rethinking email
Re:No kidding by SanityInAnarchy · 2009-03-20 08:51 · Score: 1

The problem is that there is no way which will work everywhere, except fsync - and guess what fsync does to performance?
That is true.

POSIX API needs to be expanded to include support for transactions. Until it's done, expect people to rely on hacks, because they have nothing else to rely on.
Except they do -- fsync, as above, since they have no way of detecting whether the filesystem behaves properly.

But the application doesn't need that guarantee. It simply needs the guarantee that either the file is updated or left alone.
That is true. But there are also going to be dependencies, and this is where I'm not really sure what to do.
That is: The application might update file a, and then update file b only if file a succeeded. Or a different application might update file b based on information read from file a. This means the OS has to either order all transactions, or somehow detect which transactions depend on others.
What I'm beginning to think might be the best solution is to automatically wrap all FS activity in some ideally-sized transaction, and make sure those transactions are ordered with respect to each other. Can you think of a reason this would perform worse than explicit transactions?

--
Don't thank God, thank a doctor!
Re:No kidding by ultranova · 2009-03-21 10:46 · Score: 1

Except they do -- fsync, as above, since they have no way of detecting whether the filesystem behaves properly.

And as also noted, fsync will bring the whole system to its knees if lots of applications begin using it. The interaction between fsync and GUI is especially nasty - fsync by definition blocks until a (slow) disk write has been done, and GUI requires responding as soon as possible. This pretty much requires a multithreaded program, and those are notoriously difficult to write.

That is: The application might update file a, and then update file b only if file a succeeded. Or a different application might update file b based on information read from file a. This means the OS has to either order all transactions, or somehow detect which transactions depend on others.

This isn't a problem. All you have to do is make sure that transactions are written to permanent storage in the same order they called commit - that is, if transaction A called commit before transaction B, transaction B will not be logged if the transaction A won't, and might be visible if transaction A is.
Simply steal this one from the database guys - they've been working on it for decades.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:No kidding by SanityInAnarchy · 2009-03-21 23:37 · Score: 1

This pretty much requires a multithreaded program, and those are notoriously difficult to write.
Or a multi-process program. Most of what makes threads difficult to write is shared memory, and you really don't need much to be shared to call fsync on a filehandle and then notify the parent process.

All you have to do is make sure that transactions are written to permanent storage in the same order they called commit - that is, if transaction A called commit before transaction B, transaction B will not be logged if the transaction A won't,
That is what I meant by "ordering all transactions" -- order the transactions with respect to each other, no need to order actual writes within them.
However, you'd also want to order them with respect to visibility. You wouldn't want pieces of transaction B to be visible to the process which committed transaction A, and then have transaction B fail.
But for sheer compatibility, can you think of a downside to my suggestion:

What I'm beginning to think might be the best solution is to automatically wrap all FS activity in some ideally-sized transaction, and make sure those transactions are ordered with respect to each other. Can you think of a reason this would perform worse than explicit transactions?
The only downside I can see is that fsync would be equivalent to sync, and thus likely slower.

--
Don't thank God, thank a doctor!
Re:No kidding by mr_mischief · 2009-03-23 04:16 · Score: 1

User data is always important. The application programmer who thinks otherwise is a jackass.
The problem here, as I said in #27299479 to yakovlev, is that people are expecting the atomic rename not to interrupt the non-atomic file write. Non-atomic means specifically that it can be interrupted. That's why fsync() exists in the first place -- to block until the write is finished because the write is not atomic.
How to handle this is to rename the file first. Then, open a new file with the original file name. Then, write to the file. Then, close the file. If the file data is lost, the old data is in the backup made with the atomic rename. If the new config file is invalid, the program can automatically check for the most recent valid backup. Pruning of old config backups could even be configurable, but one to three is probably enough and most people won't care how many old versions of their config is saved for most programs.

The funny thing is... by Anonymous Coward · 2009-03-19 06:43 · Score: 0

The funny thing is Theodore claimed "all modern filesystems" suffered from this issue, when in reality, ZFS and others do not :-)

Re:The funny thing is... by mr_mischief · 2009-03-19 07:24 · Score: 1

They do. The other ones that force syncs from within the FS code every X seconds just have a very short X second window during which it can happen. The ext4 FS code happens to allow a very long window for implicit synchronization because it allows great performance and any application compliant with the spec is already explicitly asking for a sync when they really need one.
Re:The funny thing is... by mikeee · 2009-03-19 08:08 · Score: 1

I don't get this, though - or at least, I'd think those heuristics could be improved. If recent I/O usage is very light (typical desktop?), why not flush immediately? Delaying enables better throughput (via batching/reordering, I guess?), but if write rate is very low we don't care about throughput anyway, so there's no issue.
Re:The funny thing is... by wastedlife · 2009-03-19 09:35 · Score: 1

If your write rate is low and you are worried about this, use nodelalloc when mounting. On a high-volume server with battery-backed cache and a solid UPS, go for the delayed allocation.

--
Said, "It's just like dice but it's got more sides And it tells me who lives and who dies"
Re:The funny thing is... by ultranova · 2009-03-19 10:09 · Score: 1

They do.

The other ones that force syncs from within the FS code every X seconds just have a very short X second window during which it can happen.

No. The issue isn't how long writes to filesystem are delayed. The issue is that Ext4 allows writes to metadata happen before writes to actual file contents. All you have to do to fix this is ensure that a rename() will only be committed after any changes to file contents which preceded it. Do that, and there will not be any window, no matter how rarely syncs are forced.

The ext4 FS code happens to allow a very long window for implicit synchronization because it allows great performance and any application compliant with the spec is already explicitly asking for a sync when they really need one.

Forcing every application to call fsync() when all they need is to ensure that the file is either updated or left alone, rather than simply truncated to zero size as Ext4 might do, will lead to the very antithesis of great performance.
This is an Ext4 bug, plain and simple. Ext4 follows POSIX, but POSIX never took recovering from crashes into account, so it is incomplete in this regard.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

Re:Those who fail to learn the lessons of history. by CannonballHead · 2009-03-19 06:44 · Score: 1

Making the same mistakes someone else made is NOT being innovative, it's being stupid or ignorant... or a number of other predicate adjectives.

Innovation is using something in a new way, not making the same mistake in a new way. That's still considered a mistake, and if it can be shown that you should have known about the mistake from someone else making it, you're still "making the same mistake" and not "innovating." Not to say you're not going to make mistakes and not know everything, but it's still a valid criticism.

Re:yeah old data in a crash cool no data not so co by Anonymous Coward · 2009-03-19 06:45 · Score: 0

Any system dev. that thinks its acceptable is a fool.

Yes, fools are the ones who actually understand the POSIX specification and plan accordingly. Those foolish admins who experience excellent performance and no data-loss. Those fools!

Surely some day they will see the error of their ways by refusing to understand the job for which they are paid! Damn them! Damn their intelligence! Damn their comprehension abilities. Damn them to hell!

Re:LOL: Bug Report by try_anything · 2009-03-19 06:48 · Score: 5, Insightful

This is the problem with new features - the users have problems using them until they fully understands and appreciates the advantages and disadvantages.

Advantages: Filesystem benchmarks improve. Real performance... I guess that improves, too. Does anybody know?

Disadvantages: You risk data loss with 95% of the apps you use on a daily basis. This will persist until the apps are rewritten to force data commits at appropriate times, but hopefully not frequently enough to eat up all the performance improvements and more.

Ext4 might be great for servers (where crucial data is stored in databases, which are presumably written by storage experts who read the Posix spec), but what is the rationale for using it on the desktop? Ext4 has been coming for years, and everyone assumed it was the natural successor to ext3 for *all* contexts where ext3 is used, including desktops. I hope distros don't start using or recommending ext4 by default until they figure out how to configure it for safe usage on the desktop. (That will happen long before the apps are rewritten.) Filesystem benchmarks be damned.

Re:Those who fail to learn the lessons of history. by ChienAndalu · 2009-03-19 06:49 · Score: 1, Informative

As explained in the article - he hasn't made a mistake. The behaviour of ext4 is perfectly compatible with the POSIX standard.

man fsync

Re:LOL: Bug Report by Aphoxema · 2009-03-19 06:52 · Score: 1

If an application decides to check the name of the file system and if the name is "ext4" it erases everything in your home directory, should that be considered a file system bug too?

No, I'd call that malice.

--
"Most people, I think, don't even know what a rootkit is, so why should they care about it?"

Re:yeah old data in a crash cool no data not so co by Anonymous Coward · 2009-03-19 06:57 · Score: 0

Read the comment. If a sys dev believes that going from a system that behaves in a certain way aka ext3 a "crash" and you either have old data or new data generally speaking. On ext4 (whiz bang new and improved) where you no data and this is acceptable. Yes they are fools. It is a regression. You want to set it as a laptop mode fine. Better give warnings though.

Sorry not as a default behavior. This is the difference between theory and practice. Also called the REAL WORLD. I can't guarantee that every app works according to spec. Seems that there is some debate POSIX addresses this.

What is at issue is ext3 was very good in this respect and ext4 no so much. This is a step backward for the vast majority of systems. Esp. servers and desktops.

I don't care what the excuse. If I have a crash or power cable failure etc... I expect the FS hasn't trashed a bunch of open files at least its not DESIGNED to.

Re:Those who fail to learn the lessons of history. by Anonymous Coward · 2009-03-19 06:58 · Score: 0

You sir are an idiot.

You need to look at your competition or those you are following and see what they've done so you don't repeat their mistakes. Then you can ask yourself "Do we need to look at that too?" or "Do we need to change that too?"

If they had spent just a couple of hours reviewing the change logs of those file systems, this probably may have never happened as it might have been fixed long ago along with what ever else is new and extremely immature with EXT4.

Even if you are creating something new, or think you are (EXT4 isn't something new, it's just another file system so people creating file systems need to review the history of all other file systems, whether what you are doing is "new" or not). You still need to look at history. You don't need to go through their code in this case with a fine tooth comb and pick it apart but just reviewing what you or your competition has done or changed in the past will make your product a better product.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 06:58 · Score: 1, Insightful

ZFS isn't all that viable for Linux users. ZFS-FUSE is too slow.

With that said, I think someone should just go ahead and put ZFS in the Linux kernel and release a patch only. This will get around the GPL issues. All it would mean is that you couldn't redistribute a kernel binary or source with ZFS stuff in it. Anyone wanting ZFS would have to patch and compile their own kernel, not that big a deal. If it's internal use only then GPL is compatible with the ZFS license.

Personally I have lost a lot of data with all the ext filesystems (and Reiser3 too). I still use it on OS and boot partitions but all my important big data partitions are XFS. I have run for years on failing hardware with XFS. I have never lost data with XFS except for the sectors that were physically damaged and even then I never lost anything important. XFS has been fairly bulletproof for me, whereas I have lost entire ext2/3 partitions due to corruption that wasn't even a hardware failure.

The odd thing is... by DragonWriter · 2009-03-19 07:00 · Score: 2, Insightful

I'm a hobbyist, and I don't program system level stuff, essentially, at all anymore, but way back when I did do C programming on Linux (~10 years ago), ISTR that this (from Ts'o in TFA) was advice you couldn't go anywhere without getting hit repeatedly over the head with:

if an application wants to ensure that data have actually been written to disk, it must call the the function fsync() before closing the file.

Is this really something that is often missed in serious applications?

Re:The odd thing is... by Anonymous Coward · 2009-03-19 07:13 · Score: 0

I've worked on so-called "enterprise" applications that were critical to the functioning of multi-billion dollar companies, and YES absolutely I can tell you that C programmers skip this kind of thing all the time. I've seen file I/O's done without error checking, I've seen bizarre recursive TCP select functions that only worked by accident, I've seen it all. The problem here is that programmers are seen by companies as an expense, not an asset, so they are constantly pressured to do their work faster and with less resources.
Re:The odd thing is... by Anonymous Coward · 2009-03-19 07:17 · Score: 0

Is this really something that is often missed in serious applications?
No, it's that ext4 applies it inconsistently.
App writes data to a file, then closes the file, then renames the file to something else. It's reasonable to assume that the data is either written to the disk, or it isn't. In this case, the rename (write) happens, but the actual data write doesn't.
Re:The odd thing is... by TypoNAM · 2009-03-19 07:28 · Score: 1

Yet there's no such thing as fsync() in standard ANSI C. But there is fflush().
A few years back there was a discussion about the confusion on Linux: http://www.linuxquestions.org/questions/programming-9/fflush-and-fsync-378849/

--
This space is not for rent.
Re:The odd thing is... by Crispy+Critters · 2009-03-19 07:44 · Score: 1

"Is this really something that is often missed in serious applications?"
That developers don't always make a distinction between (1) works-for-me reliance on undocumented side effects and (2) reliable, robustly written code? Are you seriously asking this?
Re:The odd thing is... by Cassini2 · 2009-03-19 08:17 · Score: 2, Informative

Calling fsync() excessively completely trashes system performance and usability. Essentially, operating systems have write back caches to speed code execution. fsync() disables the write back cache by writing data out immediately, and making your program wait while the flush happens. Modern computers can do activities that involve rapidly touching hundreds of files per second. Forcing each write to use an fsync() slows things down dramatically, and makes for a poor user experience.
To make matters worse, from a technical point of view, it is necessary for strict POSIX compliance to fsync() the file and then fsync() the containing directory. I have never seen a piece of normal application code that fsync() the containing directory. Even common linux utilities like rsync, and gzip don't use fsync anymore. tar uses fsync in one special case: for file verification before calling ioctl(FDFLUSH). The documentation on tar is instructive:
/* Verifying an archive is meant to check if the physical media got it correctly, so try to defeat clever in-memory buffering pertaining to this particular media. On Linux, for example, the floppy drive would not even be accessed for the whole verification. The code was using fsync only when the ioctl is unavailable, but Marty Leisner says that the ioctl does not work when not preceded by fsync. So, until we know better, or maybe to please Marty, let's do it the unbelievable way :-). */ #if HAVE_FSYNC fsync (archive); #endif #ifdef FDFLUSH ioctl (archive, FDFLUSH); #endif
In general, application writers are interested in making sure the file is readable. Unless you are really determined, and willing to go through the file verification like in the tar command, fsync() does little to guarantee a file will be readable at a later date. Under modern file systems, there are so many reasons why a file may become unreadable, and so few of them are fixed with fsync(), that one has to ask: Why bother with fsync()?
In fact, there are so few good reasons to use fsync(), that many applications have completely given up on fsync(). fsync() is disabled on Apple Macs running OSX. If you run NFS, fsync() will probably flush your data to the network, but not to the hard disk. If you are running a PC with a modern hard drive, the hard drive probably has a write back cache. As such, fsync() doesn't guarantee your data is physically on the disk. fsync() is disabled in laptop mode.
For most applications, using fsync() will only slow down your C code. It is useful for certain applications, like databases. Many other programming languages have no equivalent to fsync(). For most programs, fsync() is an infrequently used call, and is primarily used in special purpose libraries like databases.
Re:The odd thing is... by MikeBabcock · 2009-03-19 08:39 · Score: 1

If you read the comments on Slashdot, you'll find that's advice that passes right over the lazy heads of the average lay person here, and a few professional programmers too.

--
- Michael T. Babcock (Yes, I blog)
Re:The odd thing is... by grumbel · 2009-03-19 09:47 · Score: 1

fflush() has nothing to do with fsync(), all that fflush() does is flushing the user space file buffer to the kernel, it does nothing to move the data from the kernel to the actual drive.
Re:The odd thing is... by ubernostrum · 2009-03-19 13:24 · Score: 1

Is this really something that is often missed in serious applications?

Setting aside the performance implications of forcing fsync over and over, it's important to beat people over the head with the fact that fsync doesn't guarantee anything like what you think it does.

Bad POSIX by Skapare · 2009-03-19 07:01 · Score: 4, Interesting

Ext4, on the other hand, has another mechanism: delayed block allocation. After a file has been closed, up to a minute may elapse before data blocks on the disk are actually allocated. Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. If the system crashes during this time, the rename() operation may already be committed in the journal, even though the new file still contains no data. The result is that after a crash the file is empty: both the old and the new data have been lost.

Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.

If that is true, then to the extent that is true, POSIX is "broken". Related changes to a file system really need to take place in an orderly way. Creating a file, writing its data, and renaming it, are related. Letting the latter change persist while the former change is lost, is just wrong. Does POSIX really require this behavior, or just allow it? If it requires it, then IMHO, POSIX is indeed broken. And if POSIX is broken, then companies like Microsoft are vindicated in their non-conformance.

--
now we need to go OSS in diesel cars

Re:Bad POSIX by mr_mischief · 2009-03-19 07:27 · Score: 1

If the demand that an application ask for an explicit synchronization of the data it is responsible for handling is broken (as you say it is), then Microsoft is vindicated in not following thousands of other points in the POSIX specs? Is that really your position? That if one minor nit can be picked out of thousands of rules in a spec, then throwing the spec wholesale out the window is vindicated?
Remind me never to let you empty the washtub with my baby in it.
Re:Bad POSIX by LWATCDR · 2009-03-19 07:30 · Score: 1

You know I was one of the people that believed that the EXT4 people where correct. Now with this data I have to say they blew it. Yep it meets Posix but I would say that in this case Posix isn't good enough. They need to fix EXT4 or we could just use JFS, XFS, or Ext3 until this gets fixed.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Bad POSIX by TheSunborn · 2009-03-19 07:37 · Score: 1

But using FS, XFS, or Ext3 will most likely just minimize the risk of this problem, not solving it.
I Once lost power doing bootup with ext3(Fedora Core 5), and the result were that I had some empty(0) byte system files which fucked up my system.
I don't know how difficult it would be to implement, but a solution might be to change posix to require flushing data before renaming, IF
the target of the rename is an existing file.
That way you get all the nice features of delayed allocation, and the filesystem can still delay the flushing of the data in this case.
It just need to also delay the writing of the metadata in this special case, where the target of a rename exists.
That way you get the best of both worlds, but I don't know if this can be implemented without major changes.
Re:Bad POSIX by Anonymous Coward · 2009-03-19 08:02 · Score: 1, Insightful

Does POSIX really require this behavior, or just allow it?
Exactly: Can is not Should.
The Internet Protocol standard makes no guarrantees that packets will be received by the definition. Indeed, it explicitly references the fact that packet loss should be expected. But while a router which simply drops all the packets it receives might technically be standard conformant in that respect, only an idiot would think that such behavior is acceptable.
Similarly, most of the error messages and warnings we come to expect from a modern compiler are not required by the C standard. So while a compiler which doesn't give such error messages isn't technically broken, it reasonable for a user to think it's worthless, and expect to use a compiler which works like all the other ones do.
Standard writers are only human. With enough ingenuity, you can take any standard and produce an implementation which while technically conformant is horribly worthless or even hazardous to the end user.
I'll finish off with a quote from another "standard" (RFC1958): "Be strict when sending and tolerant when receiving." While not part of the POSIX specification, it's a good principle to follow. While it might not technically be the filesystem's fault the application is not strictly conformant to POSIX specification, that doesn't mean that the filesystem should shrug it's shoulders and say "meh". When possible the filesystem should be reasonably tolerant when receiving errors from applications, so that the system doesn't choke and die unexpectedly.
Standing around with your fingers in your ears singing "la, la, la, we're standard conformant and we can't hear you, la, la, la" is never acceptable behavior.
Re:Bad POSIX by ultranova · 2009-03-19 10:00 · Score: 1

If that is true, then to the extent that is true, POSIX is "broken". Related changes to a file system really need to take place in an orderly way.

In other words, we need a transactional API to the filesystem, one which guarantees atomicity of an entire transaction. It's not going to happen, of course, and we'll have to keep on relying on hacks, but it would sure be nice.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Bad POSIX by spitzak · 2009-03-19 11:24 · Score: 1

It probably should flush data on a rename even if the target of renaming does not exist. This way after a crash the target file if it exists always contains the expected data. I'm pretty certain that if it exists but is zero length, some programs are going to screw up.
Re:Bad POSIX by LWATCDR · 2009-03-21 03:44 · Score: 1

Get a UPS.
Really they are cheap and everybody should be using them.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.

Re:If this was a Windows issue by Anarke_Incarnate · 2009-03-19 07:01 · Score: 1

if it was a default FS on the latest version of the $OS_Shipped_On_95_Percent_Of_Desktops and had this bug, sure. If it is a relatively new and untested file system on an OS with choices of stable FS like Reiser, Ext2/3, JFS, XFS, OCFS2, etc, then no, not as big a deal....

Re:Those who fail to learn the lessons of history. by Anonymous Coward · 2009-03-19 07:02 · Score: 0

No. Being innovative means being original, and that means taking new and different paths.

Sounds like you'd fit right in at Microsoft. Ignoring technology "not invented here" isn't innovation, it's reinventing the wheel, aka a stupid waste of time.

Re:LOL: Bug Report by causality · 2009-03-19 07:04 · Score: 3, Interesting

Disadvantages: You risk data loss with 95% of the apps you use on a daily basis. This will persist until the apps are rewritten to force data commits at appropriate times, but hopefully not frequently enough to eat up all the performance improvements and more.

For those of us who are not so familiar with the data loss issues surrounding EXT4, can someone please explain this? The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?" I.e. if I ask OpenOffice to save a file, it should do that the exact same way whether I ask it to save that file to an ext2 partition, an ext3 partition, a reiserfs partition, etc. What would make ext4 an exception? Isn't abstraction of lower-level filesystem details a good thing?

--
It is a miracle that curiosity survives formal education. - Einstein

voting by Skapare · 2009-03-19 07:06 · Score: 3, Funny

So is this why we can't have voting (where correctness is paramount over performance) systems developed on Linux?

--
now we need to go OSS in diesel cars

Re:LOL: Bug Report by larry+bagina · 2009-03-19 07:08 · Score: 2, Informative

My one experience with XFS involved the partition being corrupted beyond recoverability within 15 minutes. Too bad, in theory XFS is great.

Anyhow, ZFS is raid, lvm, and fs rolled up into one, so keeping the patch up to date with linux changes could be a bit of work.

--
Do you even lift?

These aren't the 'roids you're looking for.

Re:LOL: Bug Report by shentino · 2009-03-19 07:08 · Score: 2, Insightful

Ext4 is still alpha-ish, and declared as such.

Any *user* who trusts production data to an experimental filesystem is already too stupid to have the right to gripe about losing said data.

Easier Fix by maz2331 · 2009-03-19 07:11 · Score: 3, Insightful

Why not just make the actual "flushing" process work primarily on memory cache data - including any "renames", "deletes", etc.?

If any "writes" are pending, then the other operations should be done in the strict order in which they were requested. There should be no pattern possible where cache and file metadata can be out of sync with one another.

Re:Easier Fix by citizenr · 2009-03-19 08:00 · Score: 1

exactly, simple solution would be to delay renames, just keep them in memory cache and flush in order they were issued

--
Who logs in to gdm? Not I, said the duck.
Re:Easier Fix by should_be_linear · 2009-03-19 11:36 · Score: 1

I agree, plus I think this is not in POSIX maybe because original POSIX authors considered this obvious enough. POSIX doesn't explicitly request many other obvious things (like: "in case of fsync() user's laser printer should not start burning in flames and his cat should stay alive").

--
839*929

Re:Those who fail to learn the lessons of history. by Mr.+Underbridge · 2009-03-19 07:17 · Score: 1

No. Being innovative means being original, and that means taking new and different paths.

Yeah, but you still have to get on the road so you can blaze your own trail off of it. That means knowing how other people have done things.

Otherwise, how far do you go with this? First principles? Hell, really, the only way to ensure a totally creative being is to have a baby and hand it over to wolves for rearing. You can be sure that that kid's ideas will be totally uncorrupted by the ideas of other humans. Of course, the kid's ideas will also be useless, but that's the price you pay for creativity.

Re:If this was a Windows issue by Extide · 2009-03-19 07:17 · Score: 1

Interestingly enough NTFS is probably one of the best things about Windows. It has most of the modern features, is incredibly resilient, and has existed for a LONG time.

--
Technophile

Explain? What wasn't known? by SIR_Taco · 2009-03-19 07:18 · Score: 1

Ok...
A) Data loss is due to corrupting/interruption in the time it takes for the file-system to write pending items to the disk. We know that.
B) The time it takes to write items, that are not specifically (in code) told to write to disk NOW, is longer than in previous incarnations. We know that.
C) The main reason no one complained about this feature in ext3 was that the pending time was about 5secs and often times it was never noticed. We know that.

Honestly, any distro that would make this default on install may be brain-dead... The average users is more concerned with data retention than performance. However, having a mechanism to scale the pending write times variably is a good option and scalable to anyone's needs (home -> large data centre).

--
I say don't drink and drive, you might spill your drink. Before you get behind the wheel just stop and think.

Re:Those who fail to learn the lessons of history. by Extide · 2009-03-19 07:18 · Score: 1

I like the saying, Working as designed, too bad it's a shitty design. I understand it complies to POSIX but isn't the goal to make something that is perceived as better? In any case I think the semi-arrogance of the authors is the real issue here, not the behavior of the fs.

--
Technophile

The applications are broken, not the FS by k-zed · 2009-03-19 07:19 · Score: 1

So as expected, there is a veritable army of people demanding the old behavior restored; also, most probably a lot of them will "downgrade" or stay with using EXT3.

Of course, the things at fault are really the buggy applications. But even deeper than that, the *paradigm* of having a lot of generated files (that store important user data) that are rewritten unconditionally at each program startup is wrong. What the hell is up with that?

Can't they come up with a method where you rewrite a file only when absolutely necessary? Why must all icon locations, thumbnails and other such GUI desktop bullshit be written and rewritten zillions of times?

Not to mention that EXT3 is just one file system out of many, and arguably not even a very good one. It's rather weird that it was chosen as a default option for so many "popular" distributions (maybe out of some misguided desire to be backwards compatible?). If your application (or again, *paradigm*) works well on only one file system, then it's most probably not the file system's fault.

--
we discovered a new way to think.

Re:The applications are broken, not the FS by grumbel · 2009-03-19 09:20 · Score: 1

Of course, the things at fault are really the buggy applications.
Neither ANSI-C nor ISO-C++ provide any way to "fix" an application, fsync() isn't part of either language.

What the hell is up with that?
That has been the Unix way of doing things for a few decades.
Re:The applications are broken, not the FS by kigrwik · 2009-03-19 10:15 · Score: 1

WTF are you smoking ? Of course it's not "part of either language", it's part of the POSIX C API.
It's a system-level library, if you prefer.

--
-- don't discount flying pigs until you have good air defense
Re:The applications are broken, not the FS by grumbel · 2009-03-19 10:30 · Score: 1

The point is that there are many ANSI-C and ISO-C++ applications out there, if the OS can't execute them properly, that's a problem with the OS, not the applications

Misleading headilne; try "Buggy Apps Lose Data" by mkcmkc · 2009-03-19 07:21 · Score: 1

...under ext4.

--
"Not an actor, but he plays one on TV."

Re:LOL: Bug Report by nedlohs · 2009-03-19 07:22 · Score: 1

Actually

Solution: an update to the code to behave as idiot application programmers require with a simple mount option.

Re:Those who fail to learn the lessons of history. by david_thornley · 2009-03-19 07:23 · Score: 1

I hope you're careful in all your C or C++ programming to never run, even for test, a program with behavior undefined according to the appropriate standard. (Something like accidentally getting two modifications of the same thing in without an intervening sequence point, or accidentally dereferencing a null pointer, or missing the new-line character after the last line of your program, or write an integer literal that doesn't fit in long int (for C++, anyway).)

If not, I hope you don't mind if it emails a nasty letter of resignation to your boss, and your porn collection to your mother. That's perfectly compatible with both the C and C++ standards.

--
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes

Re:LOL: Bug Report by swillden · 2009-03-19 07:24 · Score: 5, Interesting

The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"

They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.

Re:LOL: Bug Report by PitaBred · 2009-03-19 07:25 · Score: 2, Informative

Basically, the spec was written one way, but the actual behavior was slightly different. Even though the standard didn't guarantee something to be written, most filesystems did it anyway. When EXT4 didn't write things immediately to improve performance, the applications that depended on filesystems writing data ASAP (even though it wasn't required behavior) started risking data loss in case of a crash and data not being written explicitly.
br/> The mechanism (fsync) has been around for ages, it's just that most apps didn't use it when they should because there wasn't a "need" to until EXT4, and other systems like XFS which are less popular and tend to be run by people who know what behavior to expect.

--
My blog. Good stuff (when I remember to update it). Read it.

Standards? by Crispy+Critters · 2009-03-19 07:29 · Score: 1

Actually, a lot of people would argue that the philosophy of "as long as it works good enough for now for most people, it's perfect" leads to unmaintainable cruft. And some would say that this is the biggest problem with Windows, although I lack any personal knowledge of the matter.

I would be surprised to see an example of anyone criticizing Windows developers for following established standards.

The Definitive Fixâ by Anonymous Coward · 2009-03-19 07:33 · Score: 0

while : ; do sync ; sleep 5 ; done

Re:Those who fail to learn the lessons of history. by ChienAndalu · 2009-03-19 07:33 · Score: 2, Interesting

Ext4 *is* better, and probably because it benefits from the wiggle room provided by the specifications. The question is if you accept the tradeoff between performance and security. I choose performance, because my system doesn't crash that often.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 07:34 · Score: 1, Informative

My reading is that applications have been relying on an undocumented feature of the old filesystems instead of being implemented in an fs independent way. Ext4 removed this "feature" and exposed the already existing dependence of these applications. Thus to be fs independent, applications should call fsync to force data be physically written to disk.

The problem is they weren't. Instead they are relying on an (undocumented) feature of ext2/3 to do the fsync for them.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 07:35 · Score: 1, Informative

Even though the standard didn't guarantee something to be written, most filesystems did it anyway

No they didn't - ext3 was quite atypical. Even on Windoze, NTFS requires a fsync. (Mind you, vista introduces a transactional API like reiser was on about for linux before he turned all murdery...)

Re:LOL: Bug Report by manifoldronin · 2009-03-19 07:37 · Score: 1

Clue sticks? Why not chairs?

--
Tyranny isn't the worst enemy of a democracy. Cynicism is.

wrong by doug · 2009-03-19 07:38 · Score: 1

Yes. All new kernel features should do anything it takes to ensure they work with popular applications. If a new kernel feature breaks an application, even if it is because the developers made incorrect assumptions about how things work, then the new kernel feature should be discarded. This is simple common sense, and something that even Microsoft gets right.

Then why bother coming up with standards? If Firefox doesn't properly render IE specific pages, is firefox at fault? Or is the webserver that isn't following W3C standards? Sorry, but following the standard always trumps everything else.

I don't particularly like the standard, but I've known about the need to make sure that writes hit the disk for years. And since it is a standard, it is documented. If the KDE folks (and whomever else) don't bother learning how to follow standards, then the egg is on their face.

Channel your indignation to change (ie "fix") the POSIX standard. But as someone else posted, a bunch of people will immediately scream for a way to defer the writes because the way you think it should work is too slow.

- doug

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 07:39 · Score: 1, Insightful

Part of the problem as I understand it was that ext3 performed horribly if you did do things according to the spec (i.e. fsync wrote everything pending, not just the file descriptor you gave it) which caused horrible performance.

I think it will be a great idea to use it for desktops as it might force applications to be written correctly, those that are really worried about it can put off upgrading to a new ubuntu until the dust settles.

I don't expect my OS to crash often enough for it to be a concern anyway and the places where its really important (like document based apps like emacs/vi/ooffice) had better have been using fsync already.

Losing a max 2 minutes of recent data changes for extra performance only when an App isn't written to spec, I think I can live with that.

fclose? behavior by MetalOne · 2009-03-19 07:40 · Score: 1

If using 'C' IO does
fclose() alone guarantee that your data is written to disk or must one do
fsync() and then fclose().

fsync is not defined for 'C' IO, it is a UNIX system call.
I think most code does
fopen()
fwrite()
fclose()

If this is buggy code, then this must affect about every 'C' program ever written.
If this is about cases where fclose() does not get called because of a crash, then it is definitely an application bug.

Re:fclose? behavior by Cassini2 · 2009-03-19 08:31 · Score: 1

If this is buggy code, then this must affect about every 'C' program ever written.
If this is about cases where fclose() does not get called because of a crash, then it is definitely an application bug.
You are correct on the first statement, and wrong on the second. This bug affects almost every 'C' program ever written. Essentially, POSIX allows for a successful fclose, even if the file has not been written to disk. This permits a file system to implement a write-back cache.
Many UNIX and Linux file systems will completely screw up if the system suddenly crashes before the data has been successfully written to disk. The complaint is that the Ext4 system had a bug that did this in a very egregious way, and this bug would likely cause serious data loss on any system that is not using a UPS. Ext3 was usually mounted with the "data=ordered" option. For most realistic scenarios, Ext3 will give data loss failures that a normal person would expect. Specifically, with Ext3, you might loose your most recent files. With Ext4, the complaint was that you can loose fairly old files. Some of the old UNIX file systems, would become unreadable if the system crashed suddenly. The problems with Ext4 are a matter of scale.
To make matters worse, the fsync() remarks are incendiary. They would force modifications to almost every program on Linux. Your fopen, fwrite, fclose example is in almost every C programming textbook in existence. The fact they were made by an individual working on the Ext4 system, didn't help things either. Saying "All applications should fsync() the file and the containing directory if they don't want data loss!", when you file system has a data loss bug, creates a sudden and severe reaction ...

POSIX by 200_success · 2009-03-19 07:41 · Score: 3, Insightful

If I had wanted POSIX-compliant behavior, I could have gotten Windows NT! (Windows was just POSIX-compliant enough to be certified, but the POSIX implementation was so half-assed that it was unusable in practice.) Just because Ext4 complies with the minimum requirements of the spec doesn't make it right, especially if it trashes your data.

Re:Data loss? Schmata loss! by Dunbal · 2009-03-19 07:41 · Score: 0, Troll

Many married men would consider this to be a feature, not a bug.

--
Seven puppies were harmed during the making of this post.

A bad design that it is used everywhere by diegocgteleline.es · 2009-03-19 07:46 · Score: 5, Informative

"No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.

It turns out that all the modern operative systems work exactly like that. In ALL of them you need to use explicit syncronization (fsync and friends) to get a notification that your data has really been written to disk (and that's all what you get, a notification, because the system could oops before fsync finishes). You also can mount your filesystem as "sync", which sucks.

Journaling, COW/transaction-based filesystems like ZFS only guarantee the integrity, not that your data is safe. It turns out that Ext3 has the same problem, it's just that the window is smaller (5 seconds). And I wouldn't bet that HFS and ZFS have not the same problem (btrfs is COW and transaction based, like ZFS, and has the same problem).

Welcome to the real world...

Re:A bad design that it is used everywhere by Tacvek · 2009-03-19 08:52 · Score: 5, Informative

The Ext3 5 seconds thing is true, but that is not the important difference.
On Ext3, with the default mount options, if one writes a file to disk, and then renames the file the write is guarantee to come before the rename. This can be used to ensure atomic updates to files, by writing a temporary copy of the file with the desired changes, and then renaming the file.
On Ext4, if one writes a file to the disk, and then renames the file, the rename can happen first. The result of this is that it is not possible to ensure atomic updates to files unless one uses fsync between the writing and the renaming. However, that would hurt performance, since fsync will force the file to be committed to disk right now, when all that is really important is that it is committed to disk before the rename is.
Thankfully the Ext4 module will be gaining a new mount option that will ensure that a file is written to disk before the renaming occurs. This mount option should have no real impact on performance, but will ensure the atomic update idiom that works on Ext3 will also work on Ext4.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
Re:A bad design that it is used everywhere by Beatles_Rock_Number9 · 2009-03-19 09:27 · Score: 1

On Ext3, with the default mount options, if one writes a file to disk, and then renames the file the write is guarantee to come before the rename. This can be used to ensure atomic updates to files, by writing a temporary copy of the file with the desired changes, and then renaming the file.
If a programmer is going to depend on this behavior, then his/her application *must* check that the filesystem it is writing to is ext3 and even more importantly that it is a version of ext3 that was tested against when the program was released.
Not checking at least the above two items while assuming that the behavior is constant across all versions of ext3 and/or all filesystems is a BUG() in the program.
Am I incorrect in my description of the issue?
Re:A bad design that it is used everywhere by mbessey · 2009-03-19 10:49 · Score: 2, Insightful

There's a ton of software out there that uses the "write to new file with temporary name, then rename it to the final name" pattern, much of it written before Ext4 (or Ext3, or Ext) was designed, and rather a lot of it written before most of the folks on the Linux Kernel mailing list were even out of elementary school. This is a well-established method for reliably updating files, and it works, or fails gracefully, on almost every filesystem implementation from 1976 to the present day - except for Ext4.
Claiming that otherwise-portable software ought to include Linux-specific (not to mention Ext4-specific!) code to avoid massive data loss seems a bit backward.
Re:A bad design that it is used everywhere by raynet · 2009-03-19 11:38 · Score: 1

Yup, either that or warn the user not to use any other FS than EXT3 and specify the correct mount options to be used.

--
- Raynet --> .
Re:A bad design that it is used everywhere by Yfrwlf · 2009-03-19 21:49 · Score: 1

This mount option should have no real impact on performance, but will ensure the atomic update idiom that works on Ext3 will also work on Ext4.

Performance not being harmed pretty much tells us that it was a bug, or that this new "option" is a vast improvement. In any case, sane defaults should be the standard, and even if there was some small performance hit, should be the default for normal users.

Data centers have redundancy in data and power, normal users don't.

--
Promote true freedom - support standards and interoperability.
Re:A bad design that it is used everywhere by Tacvek · 2009-03-20 04:12 · Score: 1

I lied a bit. It has some theoretical impact on performance, but AFAICT very small. Ext4 should still be higher performance than Ext3, even with this new option on.
The performance with this option is better than that of leaving Ext4 alone, and forcing applications to call fsync().
The performance also still greatly exceeds the performance of turning on data journaling, which would also fix this issue, and ensure that other file writing modifications become atomic.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524

Re:LOL: Bug Report by quickOnTheUptake · 2009-03-19 07:46 · Score: 1

I disagree. The mentality of backwards compatibility, even if the old app doesn't follow spec, is what keeps systems from moving forward. I mean, just think of how much further behind webstandards would be if FF, Opera, and Safari thought it was paramount to emulate every quirk or IE6 for the sake of backward compatibility.
The right way to do it is more or less what they are doing, implement the new system to the spec, roll it out as an option or beta, and give all the app developers a chance to realise and correct their mistakes and flawed assumptions before the new tech gets widely adopted.

--
Mod points: Guaranteed to remove your sense of humor.
Side effects may include gullibility and temporary retardation

Re:LOL: Bug Report by causality · 2009-03-19 07:46 · Score: 4, Insightful

The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"

They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.

Thanks for explaining that. In that case, I salute Mr. Tso and others for telling the truth and not caving in to pressure when they are in fact correctly following the specification. Too often people who are correct don't have the fortitude to value that more than immediate convenience, so this is a refreshing thing to see. Perhaps this will become the sort of history with which developers are expected to be familiar.

I imagine it will take a lot of work but at least with Free Software this can be fixed. That's definitely what should happen, anyway. There are sometimes when things just go wrong no matter how correct your effort was; in those cases, it makes sense to just deal with the problem in the most hassle-free manner possible. This, however, is not one of those times. Thinking that you can selectively adhere to a standard and then claim that you are compliant with that standard is just the sort of thing that really should cause problems. Correcting the applications that made faulty assumptions is therefore the right way to deal with this, daunting and inconvenient though that may be.

Removing this delayed-allocation feature from ext4 or placing limits on it that are not required by the POSIX standard is definitely the wrong way to deal with this. To do so would surely invite more of the same. It would only encourage developers to believe that the standards aren't really important, that they'll just be "bailed out" if they fail to implement them. You don't need any sort of programming or system design expertise to understand that, just an understanding of how human beings operate and what they do with precedents that are set.

--
It is a miracle that curiosity survives formal education. - Einstein

Re:LOL: Bug Report by ijakings · 2009-03-19 07:46 · Score: 4, Funny

Microsoft Patent

Re:LOL: Bug Report by AigariusDebian · 2009-03-19 07:48 · Score: 2, Insightful

You have a separate partition for /root ? How large can the home folder of the root user be?

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 07:49 · Score: 0

The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync().

So, in principle, the filesystem could just throw away the data unless the application explicitly calls a fsync ?
This seems to be a slightly bit of...hmmm....stupid ?

And I really don't understand why the data access isn't "virtualized", i.e. data access before commit will access the new data in the write cache and not the old stuff. Yes, yes, commit and blah. But a filesystem is not a database (microsoft failed with that idea) and you are really weighting data loss/inconsistency (not fs, just stored data !) on harddisk /power failure versus constant data loss on a lots of programs versus paranoid system calls which break all power saving options and kill performance.

Re:Those who fail to learn the lessons of history. by noidentity · 2009-03-19 07:49 · Score: 2, Funny

Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. [...] And now my question: Why did the Ext4 developers make the same mistakes Reiser and XFS both made (and later corrected) years ago? [...] Those who fail to learn the lessons of [change] history are doomed to repeat it.

They tried to, but history was just a 0-byte file.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 07:52 · Score: 1, Informative

When was this that you tried XFS? Originally it did have problems but it got very stable several years ago.

ZFS is just a filesystem with lots of features. Hell, it runs in userspace via FUSE. There is nothing magical or difficult about it.

Re:LOL: Bug Report by AigariusDebian · 2009-03-19 07:53 · Score: 4, Insightful

1) Modern filesystems are expected behave better than POSIX demands.

2) POSIX does not cover what should happen in a system crash at all.

3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.

4) fsync is not a solution, because ir forces the operation to complete *now*, which is counterproductive to write performance, cache coherence, laptop battery life, excessive SSD wear and a bunch of other reasons.

We don't need reliable data-on-disk-now, we need reliable old-or-new data without using a sledgehammer of fsync.

Re:LOL: Bug Report by davester666 · 2009-03-19 07:55 · Score: 1

Using chairs is patented by Microsoft, with Steve Ballmer listed as "inventor".

--
Sleep your way to a whiter smile...date a dentist!

XFS by SanityInAnarchy · 2009-03-19 07:56 · Score: 1

XFS does the exact same thing, for what it's worth.

--
Don't thank God, thank a doctor!

Re:Those who fail to learn the lessons of history. by AigariusDebian · 2009-03-19 07:58 · Score: 2, Insightful

A few percent performance difference will be easily wiped away when the filesystem erases an important file that one time a year when a snowstorm knocks your power out.

Re:LOL: Bug Report by xous · 2009-03-19 08:05 · Score: 1

He probably means /

Re:O rly? by Anonymous Coward · 2009-03-19 08:06 · Score: 0

How much did you pay for you Linux distro?

I've never understood this "it's free so you have no right to complain" bollocks.

Re:LOL: Bug Report by AigariusDebian · 2009-03-19 08:13 · Score: 1

A filesystem could erase files from disk every time read() is done and still be perfectly POSIX compilant (if they would put it back on a fsync() call), however that would also be retarded and an outright disregard to valuable user data.

Same here - the bug shows utter disregard to user data. POSIX compilance here is just as irrelevant as whether the code is indented according to C coding guidelines or not. It is still a regression from ext3 and a data loss under common usage scenarios.

Re:LOL: Bug Report by mrwolf007 · 2009-03-19 08:13 · Score: 2, Insightful

Absolutely correct.
And thats the way it should be done.
Stability by default, increased performance by request.
Lets be realistic, how many applications benefit from this delayed write. Not many is guess. Now, on the other hand, if you have an extremely I/O heavy app, disable the auto syncs and do it manually.

Re:LOL: Bug Report by MikeBabcock · 2009-03-19 08:15 · Score: 4, Interesting

The POSIX standard is just fine. The problem is application assumptions that aren't up to snuff.

Read the qmail source code sometime. Every time the author wants to assure himself that data has been written to the disk, it calls fsync.

If you don't, you risk losing data. Plain and simple.

--
- Michael T. Babcock (Yes, I blog)

Re:LOL: Bug Report by causality · 2009-03-19 08:16 · Score: 4, Interesting

So, in principle, the filesystem could just throw away the data unless the application explicitly calls a fsync ?
This seems to be a slightly bit of...hmmm....stupid ?

From the explanations I received and some reading I've done, I don't think the data is just getting "thrown away" so that isn't really a valid question. The issue seems to be that unless fsync is called, the changes requested by the application may happen in a sequence that is other than what the application programmer expected. The example I saw in this discussion involved first writing data to a file and then renaming it soon afterwards. If I understand this correctly, the application is assuming that the rename cannot possibly happen before the writing of the data is done even though the specification has no such requirement. If the application needs this to happen in the order in which it was requested, it needs to write the data, then call fsync, then rename the file. You could probably fill a library with what I don't know about low-level filesystem details, so please correct me if I have misunderstood this.

The example I found in the Wikipedia entry on ext4 was different. That one involved data loss because the application updates/overwrites an existing file and does not call fsync and then the system crashes. The Wiki article states that this leads to undefined behavior (which, afaik, is correct per the spec). The article also states that a typical result is that the file was set to zero-length in preparation for being overwritten but because of the crash, the new data was never written so it remains zero-length, causing the loss of the old version of the file. Under ext3 you would usually find either the old version of the file or the new version.

What I don't understand and hope that a more knowledgable person could explain is why this can't be done a slightly different way. This is where I can apply reason to come up with something that sounds preferable to me but I simply don't have the background knowledge of filesystems to understand the "why". If the overwrite of the file is delayed, why isn't the truncation of the file to zero-length also delayed? That is, instead of doing it this way:

Step 1: Truncate file length to zero in preparation of overwriting it.
Step 2: Delay the writing of the new data for performance reasons.
Step 3: After the delay has elapsed, actually write the data to the disk.

Why can't it be done this way instead?

Step 1: Delay the truncation of the file length to zero in preparation of overwriting it.
Step 2: Delay the writing of the new data.
Step 3: After the delay has elapsed, set the file length to zero and immediately write the new data, as a single operation if that is possible, or as one operation immediately followed by the other.

That way if there is a crash, you'd still get either the old version or the new one and not a zero-length file where data used to be. The only disadvantage I can see is that this might continue to enable developers to make assumptions that are not found in the standard because the buggy behavior ext4 is now exposing may continue to work. If there's no technical reason why it cannot be done that way, perhaps the bad precedent alone is a good reason to either not handle it this way or to change the spec.

--
It is a miracle that curiosity survives formal education. - Einstein

Re:LOL: Bug Report by AigariusDebian · 2009-03-19 08:19 · Score: 1

The right thing is to code a reliable filesystem, despite POSIX not demanding such feat. POSIX is written in the 80s when system level code was crap. Now we can do better than that.

fsync() is no an answer - we do not care if the old data or the new data is saved after the crash. Ext4 looses both. If we start using fsync() that will hurt performance by a ton, cripple write caching, destroy laptop battery life, wear out SSDs much faster than really necessary and cause a bunch of other side effects without any actual gain except hiding this filesystem regression.

It is a filesystem bug - regression with a mayor data loss in common usage scenarions.

Re:LOL: Bug Report by MikeBabcock · 2009-03-19 08:20 · Score: 2, Informative

You don't risk any data loss, ever, if you shut down your system properly. The system will sync the data to disk as expected and everything will be peachy. You risk data loss if you lose power or otherwise shut down at an inopportune time and the data hasn't been sync'd to disk yet.

That is to say, 99% of people who use their computers properly won't have a problem.

Also note, the software you use should be doing something like:

loop: write some data, write some more data, finish writing data, fsync the data.

The problem here is that the program is doing the "writing" part and because of how caching and delayed writes work (without which, your computer would crawl), the data isn't written to disk _yet_ but will be, eventually.

Old software assumed the data would be written soon. With Ext4 its possible it won't be written until much much later for performance and power benefits.

PS you can just open a terminal window and type "sync" at any time to flush the data to disk on your system. I'm sure someone could write a tray icon that does the same in 30 seconds.

--
- Michael T. Babcock (Yes, I blog)

There are 2 separate issues being confused here by Rob+Y. · 2009-03-19 08:33 · Score: 1

There are 2 'new' things about ext4 that are contributing to data loss:

1. The filesystem doesn't flush to disk as often as ext2/3 did.

2. When it *does* flush, the order of operations is such that it's possible to crash and end up with neither the old nor the new version of a file.

Number 1 is definitely a performance enhancer, and may be okay (certainly would be easy to tune down for desktop systems if you want to).

Number 2 is the real problem. A lot of apps were written assuming that if their data didn't get flushed out, it's no big deal. For example, if you write to a temp file and then rename, you were always guaranteed to either have a good copy of the old file or the new file. That being true, it's not 'wrong' do your update without an fsync(). If you don't really care about losing the changes (and *only* the changes), then you don't need to force the disk to spin up just to guarantee you don't lose them.

But ext4 is a game changer here. No guarantees at all, and no way to guarantee a good file other than to do an fsync().

In fact, if order of operations makes it possible to end up with a corrupt file after a crash, it may well be possible that this could happen even if you do an fsync(). The system can still crash in the middle of your fsync(), and if at any time, the filesystem produces something inconsistent on disk, you can end up with a problem. No filesystem should ever be coded that intentionally creates inconsistent data on disk, however transient. Imagine a DBMS doing that.

I don't know how much of a performance gain you get by the order of operations change, but I suspect it's not so much. And if it opens up a window for data corruption, IMHO, it's not worth it.

--
Posted from my Android phone. Oh, I can change this? There, that's better...

Re:There are 2 separate issues being confused here by noidentity · 2009-03-19 10:18 · Score: 1

My point was simply that IF a filesystem traded robustness for performance, we should blame the user for using it inappropriately. On ther hand, if the filesystem simply threw away robustness, without any performance benefit, and when being robust wasn't a significant design challenge, THEN we can rightly criticize the design. It merely being difficult to use (in a way unavoidable without throwing away performance) is not a reason to criticize it, because such criticism treats one possible use of the filesystem as the ONLY possible use. This is what the original poster seemed to be doing, since he was focusing only on its fragility, rather than fragility + no performance benefit (I don't know whether ext4 provides no corresponding benefit, just noting what I didn't see in the original argument). I really dislike arguments that suggest that we should only design for normal uses, and never design more specialized things that trade off generality for other desirable characteristics.

In fact, if order of operations makes it possible to end up with a corrupt file after a crash, it may well be possible that this could happen even if you do an fsync(). The system can still crash in the middle of your fsync(), and if at any time, the filesystem produces something inconsistent on disk, you can end up with a problem. No filesystem should ever be coded that intentionally creates inconsistent data on disk, however transient. Imagine a DBMS doing that.
Yes, not being able to make an entire operation atomic would be a big drawback! It'd be like doing multithreaded programming without locks, because you found that it was very rare for both to try to say increment a shared integer at the same moment.
Re:There are 2 separate issues being confused here by coryking · 2009-03-19 10:40 · Score: 1

I really dislike arguments that suggest that we should only design for normal uses, and never design more specialized things that trade off generality for other desirable characteristics.
You can go hog-wild with crazy, escoteric, features as long they are off by default. The default configuration of a filesystem should leave those things off. If people want turn on the crazy shit, knock them selves out.

It merely being difficult to use
If the default configuration is "hose my box, but do it really fucking fast", then it isn't just difficult to use, but it is badly designed. There is no excuse for bad design.
Re:There are 2 separate issues being confused here by noidentity · 2009-03-19 11:37 · Score: 1

If the default configuration is "hose my box, but do it really fucking fast", then it isn't just difficult to use, but it is badly designed. There is no excuse for bad design.

So then it gets to the point I touched on above: if the filesystem was meant for specialized uses, then it's the fault of the person putting it out there as useful on normal desktop machines. In this case it sounds like the filesystem was just taking advantage of some aspects of the API that previous ones didn't, ones that broke programs that made bad assumptions. If this is true, I'd side with you that these everyday-program breaking features should have been off by default. The logic is that even though the API allowed them, the fact that they were never utilized basically removed that allowance from the commonly-accepted meaning of the API. The proper approach is then to extend the API in a backwards-compatible way that allows specialized programs to get better performance, but for everyday programs to not need to do anything special to have basic data integrity. The fact that the original API allowed such behavior is a dry technical point.
Re:There are 2 separate issues being confused here by yakovlev · 2009-03-20 00:55 · Score: 1

In fact, if order of operations makes it possible to end up with a corrupt file after a crash, it may well be possible that this could happen even if you do an fsync(). The system can still crash in the middle of your fsync(), and if at any time, the filesystem produces something inconsistent on disk, you can end up with a problem. No filesystem should ever be coded that intentionally creates inconsistent data on disk, however transient. Imagine a DBMS doing that.

<p>Just to be clear, the filesystem will be okay with this new definition IF THEY ADD fsync(), so long as the user does what everyone agrees was best practice and uses the rename method. In that case the rename() will not occur until after the fsync().
Re:There are 2 separate issues being confused here by marcosdumay · 2009-03-20 03:41 · Score: 1

Well, here goes a hint:
If you want to write a ninche FS for specialized uses, DON'T name it after the ext family.

--
Rethinking email

Re:LOL: Bug Report by TheSunborn · 2009-03-19 08:34 · Score: 1

Is writing a new file, and then renaming it over an existing file really a 'typical workload'???

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 08:37 · Score: 0

It is not the application program to understand the functionality of hardware. That is the reason of the operating system like Linux (monolith kernel) to exist. If you would not have an operating system, like Linux (kernel), OpenBSD, FreeBSD, NT or XNU. Every system program or application program should be developed in such way, that they know how to control the hardware. How to move the diskdrive head or how to store data to/from RAM and how to show data on the screen and overall, control the I/O functions.

In these days, you have very complex operating systems (like Linux kernel, what is monolith operating system) and even more complex software systems (like Ubuntu, Windows 7, Mac OS X) what includes the operating system itself (Linux, NT, XNU) and lots of middleware (system programs, like GNU project applications) and then all kind other layers of other platforms like Qt, Java, GTK+ etc. And most software developers do not need to know anything under the layer what they are using to develop their application. Java developer do not need to know how the operating system is controlling hardware. Just that what the Java is doing and for what it is connected and how it talks to the operating system etc.

Ext4 is just one part of the operating system what is not needed to know by normal developers.

Re:LOL: Bug Report by ultrabot · 2009-03-19 08:40 · Score: 1

.
Lets be realistic, how many applications benefit from this delayed write. Not many is guess.

The guess is wrong.

By delaying writes, ext4 has bigger window to determine how it will allocate stuff.

Also, this seems like just what the doctor ordered for flash drives.

--
Save your wrists today - switch to Dvorak

Schrodinger's innovation by sabt-pestnu · 2009-03-19 08:41 · Score: 1

Innovation is using something in a new way,...

Stop right there. Anything more than that, and you involve observers.

The inventor thinks he is innovating.
The observer, knowing precedents to the invention, does not.

Context is implicit in "a new way". Moving things by wheeled dredges (that is, carts) would have been an innovation to the Incas, but not to the Sumerians.

Saying "should have" is simply a justification for blame.

There is no "should". There is only "do", or "do not".

Re:LOL: Bug Report by zenyu · 2009-03-19 08:47 · Score: 4, Informative

They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

Yup, and the problem has existed with KDE startup for years. I remember the startup files getting trashed when Mandrake first came out and I tried KDE for long enough to get hooked, and it's happened to me a few times a year ever since with every filesystem I've used. I just make my own backups of the .kde directory and fix this manually when it happens. I'm pretty good at this restore by now. Hopefully this bug in KDE will get fixed now that it is causing the KDE project such great embarrassment. I had a silent wish Tso would increase the default commit interval to 10 minutes when the first defenders of the KDE bug started squawking, but he's was too gracious for that.

PS I use a lot of experimental graphics drivers for work, hence lockups during startup are common enough that I probably see this KDE bug more than most KDE users. But they really violate every rule of using config files: 1st. open with minimum permission needed, in this case read only, unless a write is absolutely necessary. 2nd. only update a file when it needs updating. 3rd. when updating a config file make a copy, commit it to disk, and then replace the original, making sure file permissions and ownership are unchanged, then commit the rename if necessary.

PS2 Those computer users saying an fsync will kill performance need to get cluebat applied to them by the nearest programmer. 1st. There will be no fsyncs of config files at startup once the KDE startup is fixed. 2nd. fsyncs on modern filesystems are pretty fast, ext3 is the rare exception to that norm; this will be non-noticable when you apply a settings change. 3rd. These types of programming errors are not the norm; I've graded first and second year computer science classes and each of the three major mistakes made would have lost you 20-30% of your score for the assignment.

Re:LOL: Bug Report by DragonWriter · 2009-03-19 08:47 · Score: 3, Informative

Is writing a new file, and then renaming it over an existing file really a 'typical workload'???

Its a fairly typical way of trying to acheive something loosely approximating transactional behavior with respect to updates to the file in question without relying on transactional file system semantics.

Re:If this was a Windows issue by JumboMessiah · 2009-03-19 08:53 · Score: 1

And I used to make my living repairing NTFS filesystems back in the 90's. Back then the smart folks had their boot drive formatted FAT for a reason. Of course, NTFS is much more mature now than back then. The same argument applies here, EXT4 was just released for general use. We should all give it, and Ted and company, a break.

I've followed Ted's work for many years on the FOSS front. I fully expect him to make EXT4 work best in both scenarios (data safety and performance optimized).

Sounds like they need to talk to Kirk McKusick by argent · 2009-03-19 08:58 · Score: 4, Informative

Kirk McKusick spent a lot of time working out the right order to write metadata and file data in FFS and the resulting file system, FFS with Soft Updates, gets high performance and high reliability... even after a crash.

Re:Sounds like they need to talk to Kirk McKusick by tytso · 2009-03-19 15:09 · Score: 1

Actually FFS with Soft Updates is only about preserving file system metadata so they don't require fsck's. BSD with FFS and Soft Updates still pushes out meta-data after 5 seconds, and data blocks after 30 seconds. Soft Updates only worries about metadata blocks, and not data blocks.
In fact, after a crash with FFS you can sometimes access uninitialized data blocks that contain data from someone else's mail file, or p0rn stash. This was the problem which ext3's data=ordered was trying to solve; unfortunately it does so by making fsync==sync, which also had the unfortunate side effect of making people think that fsync()'s always had to be slow. It doesn't have to be, if it's properly implemented --- but I'll be the first to admit that ext3 didn't do a proper job.
Re:Sounds like they need to talk to Kirk McKusick by Anonymous Coward · 2009-03-19 23:29 · Score: 0

Argent, these are Linux fanbois you're talking to. They don't respect quality design or people that actually know what they are talking about.
Linux followers only respect reinventing the wheel, especially if the reinvention is square.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 08:59 · Score: 0

It's a nice safety net. If you have to reformat your OS partition for some reason, you won't lose all of your stuff in /home.

XFS? by Megatog615 · 2009-03-19 09:00 · Score: 1

Doesn't XFS have a sort of rewinding capability to restart a write operation if a crash occurs? I used XFS on my laptop for about a year, and the laptop had various thermal-related issues, so it locked up a lot. I don't think I ever lost any data with XFS, even with delayed allocation enabled..

Re:XFS? by grumbel · 2009-03-19 09:14 · Score: 1

My experience with XFS (Ubuntu 8.10, default kernel, wire cache disabled) was the polar opposite, I lost files on almost every single crash, I consider the filesystem completly unusable for desktop use were flanky OpenGL drivers are rather common. Never lost files on crash with any other filesystem, except those very early versions of reiserfs back then years ago, but those issues got fixed and loss was the rare exception, not common occurrence as with XFS.

I asked my favorite programmer buddy about this... by rickb928 · 2009-03-19 09:04 · Score: 1

... and he had an interesting take on it.

First, he said he got one of his vendor codemonkeys (emphasis on monkey here) to say that he understood why people did what they did, it always annoyed him to have to wait for data to write so his applicaiton could get on to the important stuff. His application is an inventory management system that runs on RPG midrange machines.

My buddy would howl at this. Um, excuse me, but the data *is* the important stuff. One of many reasons my bud ended up re-writing much of the canned software he was saddled with a few years ago when he took his current position. Some stuff he just 'tweaks', he says.

And he then related many a story of older systems and newer systems, from PDP-11s through the whole IBM System3x range and E-Series, and the infamous Windows servers he had on those processor cards and all, and the flaky stuff he saw.

He throughly understands the temptation to cache writes, and considers it pure poison. He says, "If your data isn't important enough to write out, it isn't important. Send it to /dev/null, that'll improve performance too!"

Of course, /dev/null isn't an option. But he recognizes the OS is not always going to optimize yout app.

And he didn't joke much about this EXT4/EXT3 issue. Something about being there before, or something. But he's weirderer than I am anyways.

--
deleting the extra space after periods so i can stay relevant, yeah.

Re:LOL: Bug Report by Sparr0 · 2009-03-19 09:09 · Score: 1

Depending on how much non-system software root has installed in his home directory... pretty large.

Re:LOL: Bug Report by mrwolf007 · 2009-03-19 09:11 · Score: 1

Care to elaborate?
The only reasons for this delayed write is the performance gain unless im overlooking something crucial here.
Ok, so how many (especially desktop) apps will be faster by an amount observable by the user? I still think this wont be many.
As to the "this seems to be what the doctor ordered for slash drives". Does this imply the number of reads and writes to the drive is reduced? As far as i know only the number of writes is relevant to the longevity of flash drives, not their timing.

Re:LOL: Bug Report by ultranova · 2009-03-19 09:15 · Score: 4, Insightful

Solution: an update to the code to behave as idiot application programmers require with a simple mount option.

The application programmers aren't at fault here, the POSIX spec is. A filesystem is essentially a hierarchical database, yet POSIX doesn't include a way to make atomic updates to it. The only tool provided is fsync, which kills performance if used. And even with fsync some things - such as rewriting a configuration file - are either outright impossible or complex and fragile.

The real solution is to come up with a transactional API for filesystem. Until that's done, problems like this will persist. Calling fsync - which forces a disk write - or playing around with temporary files isn't reasonable when all you want to do is make sure that the file will be updated properly or left alone.

The alternative is to have every program call fsync constantly, which not only kills performance, but ironically enough also negates some of Ext4's advantages, such as delayed block allocation, since it essentially disables write caching. And it doesn't work if you are doing more complex things, such as, say, mass renaming files in a directory; you have no way of ensuring that either they are all renamed, or none are.

--

Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

Re:LOL: Bug Report by Upsilonish · 2009-03-19 09:17 · Score: 1

Then why is it marked as stable in 2.6.28? Unless there's some strange definition of "stable" they use there...

Re:LOL: Bug Report by ultrabot · 2009-03-19 09:21 · Score: 1

Ok, so how many (especially desktop) apps will be faster by an amount observable by the user? I still think this wont be many.

Depends how much the apps depend on the file system (both reading and writing). Many desktop apps depend on reading a lot, and they benefit from better ordering of the data in file system.

Of course you only need to care about this if you care about file system performance in the first place - I don't think ext3 is going anywhere soon.

As to the "this seems to be what the doctor ordered for slash drives". Does this imply the number of reads and writes to the drive is reduced?

Yes - if you delay the writes more, you have a better idea what actually needs to hit the disk in the end, so you can cut away unnecessary writes. Since you can also combine several writes, you need to erase fewer blocks (though this one is more speculation than actual knowledge).

--
Save your wrists today - switch to Dvorak

Bollocks by Colin+Smith · 2009-03-19 09:30 · Score: 2, Interesting

A filesystem is not a Database Management System. It's purpose is to store files. If you want transactions, use a DBMS. There are plenty out there which use fsync correctly. Try SQLite.

--
Deleted

Re:Bollocks by 21mhz · 2009-03-19 10:07 · Score: 1

If you want transactions, use a DBMS. There are plenty out there which use fsync correctly. Try SQLite.
I feel this way too, but on the other hand I realise that in SQLite fsync is done at every commit, and this means also every database query not in an explicit transaction. Which is what clueless developers tend to use all the time, with the result of SQLite becoming a bane of SSDs.
In quite recent Mozilla versions, every link click caused an fsync. Many people started blaming SQLite and/or fsync implementation on ext3 in doing things the "slow" way, while the right question actually was, does the user really need browser history always persisted to the last click.

--
My exception safety is -fno-exceptions.
Re:Bollocks by Anonymous Coward · 2009-03-19 10:12 · Score: 1, Insightful

You are entirely incorrect, sir. A file system IS a Database Management System. It is not a Relational Database Management System, but it's sole purpose is to store and access data in an organized fashion, creating, if you will, a base of data.

Re:LOL: Bug Report by blazerw · 2009-03-19 09:31 · Score: 5, Insightful

1) Modern filesystems are expected behave better than POSIX demands.

2) POSIX does not cover what should happen in a system crash at all.

3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.

4) fsync is not a solution, because ir forces the operation to complete *now*, which is counterproductive to write performance, cache coherence, laptop battery life, excessive SSD wear and a bunch of other reasons.

We don't need reliable data-on-disk-now, we need reliable old-or-new data without using a sledgehammer of fsync.

1. POSIX is an API. It tries not to force the filesystem into being anything at all. So, for instance, you can write a filesystem that waits to do writes more efficiently to cut down on the wear of SSDs.
2. Ext3 has a max 5 second delay. That means this bug exists in Ext3 as well.
3. If you have important data that if not written to the hard drive will cause catastrophic failure, then you use the part of the API that forces that write.
4. Atomicity does not guarantee the filesystem be synchronized with cache. It means that during the update no other process can alter the affected file and that after the update the change will be seen by all other processes.

We don't need a filesystem that sledgehammers each and every byte of data to the hard drive just in case there is a crash. What we DO need is a filesystem that can flexibly handle important data when told it is important, and less important data very efficiently.

What you are asking is that the filesystem be some kind of sentient all knowing being that can tell when data is important or not and then can write important data immediately and non-important data efficiently. I think that it is a little better to have the application be the one that knows when it's dealing with important data or not.

Re:Those who fail to learn the lessons of history. by tkinnun0 · 2009-03-19 09:34 · Score: 2, Interesting

If the filesystem is a few percents faster but then your disk sits idle half of the time and then you have a crash and lose a file that takes two hours to recreate, have you actually gained any performance?

Re:LOL: Bug Report by bgillespie · 2009-03-19 09:37 · Score: 1

A big limitation with flash drives is that repeated reads and writes to a given sector of storage "wear it out" and cause failure more quickly than the same amount of reads and writes to a given sector on a traditional disk device. The generally accepted solution is to use an algorithmic approach to distribute reads and writes evenly throughout the disk (note: transparent to software developers, at least above the kernel level), and this is what the GP is talking about--more time between physical disk writes means that there is more opportunity for an algorithm to decide intelligently where different pieces of the written data should go.

Re:LOL: Bug Report by Foolhardy · 2009-03-19 09:42 · Score: 2, Insightful

It sounds like the correct solution is for the file system to implement transactional semantics. That is what the applications need and were incidentally getting, despite it not being in the spec.

Why isn't this being considered as the solution? There are other major OSes have implemented basic atomic transactions in their filesystems successfully, why not Linux?

important ext4 workaround by Colin+Smith · 2009-03-19 09:53 · Score: 1

Disadvantages: You risk data loss with 95% of the apps you use on a daily basis.

Wooo. A whole 30 seconds. Horrifying. Here's a workaround for you. Put it in your .bashrc

while true
do
sync
done &

You can't be too careful.

--
Deleted

Re:LOL: Bug Report by von_rick · 2009-03-19 09:55 · Score: 1

That's what I meant :) Thanks for clarifying.

--

Face your daemons!

Funny by coryking · 2009-03-19 10:00 · Score: 1

It's purpose is to store files.

Sounds to me like EXT4 kinda fails at this minor detail, eh?

I mean, if it isn't isn't commit the changes to the file system in the right order, it isn't exactly storing files.

Or do you care to amend your definition to read "It's purpose is to attempt to store files. It does not promise to actually store files".

Re:LOL: Bug Report by Yfrwlf · 2009-03-19 10:08 · Score: 1

So at what point is "sentience" achieved then? :3

ReiserFS is pretty dead, while btrfs may actually be better though I can't believe it, so perhaps when Tux himself, in his 3rd incarnation, finally is my secretary and is managing my files will my computer finally have "sentience".

--
Promote true freedom - support standards and interoperability.

Re:LOL: Bug Report by grumbel · 2009-03-19 10:08 · Score: 3, Informative

3. If you have important data that if not written to the hard drive will cause catastrophic failure, then you use the part of the API that forces that write.

You completly missed the point. The new data isn't important, it could be lost and nobody would care. The troublesome part is that you lose the old data too. If you would lose the last 5 minutes of changes in your KDE config that would be a non-issue, what however happens is that you not just lose the last few changes, but your complete config, it ends up as 0 byte files, which is a state that the filesystem never had.

Re:LOL: Bug Report by somenickname · 2009-03-19 10:17 · Score: 4, Insightful

fsyncs have other nasty side effects other than performance. For example, in Firefox 3, places.sqlite is fsynced after every page is loaded. For a laptop user, this behavior is unacceptable as it prevents the disks from staying spun down (not to mention the infuriating whine it creates to spin the disk up after every or nearly every page load). The use of fsync in Firefox 3 has actually caused some people (myself included), to mount ~/.mozilla as tmpfs and just write a cron job to write changed files back to disk once every 10 minutes.

So, while I'm all for applications using fsync when it's really needed, the last thing I'd like to see every application on the planet sprinkling their code with fsync "just to be sure".

Re:LOL: Bug Report by mrwolf007 · 2009-03-19 10:18 · Score: 1

Ok, so how many (especially desktop) apps will be faster by an amount observable by the user? I still think this wont be many.

Depends how much the apps depend on the file system (both reading and writing). Many desktop apps depend on reading a lot, and they benefit from better ordering of the data in file system.

Ok, i havent read though the source for ext4 so i dont know how this "magic" ordering works, but it sounds a bit like "defragmentation before writing". I'll assume that instead of looking for a large enough consecutive blook for the next file it instead looks for large enough block for the files in the "cache".
While this can obviously be advantageous at times, there still is no way the file system can know in which order the app will read these files on next startup (the app could even store the files in reverse order it reads them,e.g. read global configs first, then specific but write in reverse order) making this ordering even detrimental to performance.
But i'm sure you are right about ext3. It wont be going away anytime soon.

As to the "this seems to be what the doctor ordered for slash drives". Does this imply the number of reads and writes to the drive is reduced?

Yes - if you delay the writes more, you have a better idea what actually needs to hit the disk in the end, so you can cut away unnecessary writes.

Right. But thats exactly the problem. After a power outage or system crash these omited writes cant be recovered.
Just spin this idea to the maximum. Delay all writes until the system is out of RAM. This maximises the ability to do an inteligent sync with the drive. Nice in theory, but prone to loose ALL written data since last system start in the event of a crash.

Since you can also combine several writes, you need to erase fewer blocks (though this one is more speculation than actual knowledge).

Might well be true for some flash drives and not for others. And a good reason wear leveling algorythms should be in the device controller, not a filesystem driver.

Re:LOL: Bug Report by tabrisnet · 2009-03-19 10:19 · Score: 1

Yes. It provides, according to the old semantics, that the file change will appear atomic. It is in effect a slightly modified Read-Modify-Write.

Looks like this (crappy C pseudocode, so don't get all pedantic):
fh = open(oldfile);
olddata = parsefile(fh);
close(fh);
newdata = modify_file(olddata);
fh = open(tmpfile);
write(fh, newdata);
close(fh);
unlink(originalfile); rename(tmpfilepath, originalpath);

Note that this provides atomicity for the data. Either the file is existent and contains consistent data, or it doesn't exist.

it looks like ext4 broke this by reordering some of the operations. It made the metadata commit before it actually wrote data to tmpfile, thus leaving us with a zero-sized file if the system crashes before the data-commit.

Re:LOL: Bug Report by Cassini2 · 2009-03-19 10:22 · Score: 4, Interesting

PS2 Those computer users saying an fsync will kill performance need to get cluebat applied to them by the nearest programmer.
1st. There will be no fsyncs of config files at startup once the KDE startup is fixed.

KDE isn't fixed right now. Additionally, KDE is not the only application that generates lots of write activity. I work with real-time systems, and write performance on data collection systems is important.

2nd. fsyncs on modern filesystems are pretty fast, ext3 is the rare exception to that norm; this will be non-noticable when you apply a settings change.

I did some benchmarks on the ext3 file system, the ext4 system without the patch, and the ext4 system with the patch. Code followed the open(), write(), close() sequence was 76% faster than the code with fsync(). Code that followed the open(), write(), close(), rename() sequence was 28% faster than code with that followed the open(), write(), fsync(), close(), rename() sequence. Additionally, the benchmarks were not significantly affected by the presence which file system was used (ext3, ext4, or ext4 patched.) You can look up the spreadsheet and the discussion at the launchpad discussion.

3rd. These types of programming errors are not the norm; I've graded first and second year computer science classes and each of the three major mistakes made would have lost you 20-30% of your score for the assignment.

Major Linux file backup utilities, like tar, gzip, and rsync don't use fsync as part of normal operations. The only application of the three, tar, that uses fsync, only uses it when verifying data is physically written to disk. In that situation, it writes the data, calls fsync, calls ioctl(FDFLUSH), and the reads the data back. Strictly speaking, that is the only way to make sure the file is written to disk, and is readable.

Finally, as Theodore Ts'o has pointed out, if you really want to make sure the file is saved to disk, you also have to fsync() the directory too. I have never seen anyone do that, as part of a normal file save. Most C programming textbooks simply have fopen, fwrite, fclose as the recommended way to save files. Calling fsync this often is unusual for most C programmers.

I would hate to be in your programming class. Your enforcing programming standards that aren't followed by key Linux utilities, aren't in most textbooks, and aren't portable to non-Linux file systems.

If you require your students to fsync() the file and the directory, as part of a normal assignment, you are requiring them to do things that aren't done by any Linux utility out there. Further, if you are that paranoid, you better follow the example from the tar utility, and after the fsync completes, read all the data back to verify it was successfully written.

Re:LOL: Bug Report by ChaosDiscord · 2009-03-19 10:23 · Score: 2, Insightful

Glossing over some details, what is happening is closer to this:

The goal is to replace config with a new version. The programmer is essentially doing this:

1. Create config.new. (Should be empty, because it's new)
2. Write the new contents into config.new
3. Move config.new onto config

The goal is that when you replace config, you're replacing it with a guaranteed complete version, config.new. Assuming it happens in this order (and that step 3 is atomic; it happens or doesn't, never partially) if you crash midway through, you'll either end up with the old config or the new config, but never a partial config. Unfortunately the operating system tries to speed things up, and for a variety of good reasons delaying step 2 makes sense. Doing so is allowed by the standards specifically for these good reasons. So what actually happens is this:

1. Create config.new. (Should be empty, because it's new)
3. Move config.new onto config
2. Write the new contents into config.new (which is actually config now, so it works)

This works fine... unless something happens between steps 3 and 2. If we stop there, we have a new, empty file in place of "config." With ext4, the window between 3 and 2 could be as long as a minute, a window during which you can lose data.

The correct solution is for the program, not the operating system, to take care with files it cares about:

1. Create config.new. (Should be empty, because it's new)
2a. Write the new contents into config.new
2b. Wait until the contents are on disk. ("fsync")
3. Move config.new onto config

Now it's not possible to move 2a after 3, so you're guaranteed safe behavior. But you lose the speed benefits of reordering. For data you care about, this is a good idea. For data you don't care about (Your web browser cache leaps to mind), it's overkill and makes you slower.

ext3 (and the new ext4 option) essentially adds 2b automatically. It's good in that it's safer for everyone involved, but it's bad in that everyone takes a speed hit, even in cases where speed is more important than safety.

--
Search 2010 Gen Con events

Re:LOL: Bug Report by spitzak · 2009-03-19 10:23 · Score: 2, Informative

Is writing a new file, and then renaming it over an existing file really a 'typical workload'???

YES!!!!!!

Re:LOL: Bug Report by mrwolf007 · 2009-03-19 10:26 · Score: 1

Good idea, wrong place.
Stuff like that belongs in the device controller, not the file system.
Or do you really fancy something like this in a filesystem?
... switch(VENDOR_ID){ case VENDOR1: switch(PRODUCT_ID){ case... case... ...

is there nothing to learn from rdbms by Anonymous Coward · 2009-03-19 10:26 · Score: 0

given that they have similar concurrency and parallel access issues - at least the principles?

and why cannot those principles be applied?

and if POSIX does not guarantee data integrity, then maybe it is time for a POSIX1.1 or POSIX++ ?

(retrospective disclaimer: i am not a hacker or file system programmer, but issues seem similar in principle...)

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 10:27 · Score: 0

Just like KDE4 then.

Re:LOL: Bug Report by spitzak · 2009-03-19 10:30 · Score: 4, Insightful

You don't understand the problem.

You are wrong when you say EXT3 has this problem. It does not have it. If the EXT3 system crashes during those 5 seconds, you either get the old file or the new one. For EXT4, if it crashes, you can get a zero-length file, with both the old and new data lost.

The long delay is irrelevant and is confusing people about this bug. In fact the long delay is very nice in EXT4 as it means it is much more efficient and will use less power. I don't really mind if a crash during this time means I lose the new version of a file. But PLEASE don't lose the old one as well!!! That is inexcusable, and I don't care if the delay is .1 second.

Re:LOL: Bug Report by AigariusDebian · 2009-03-19 10:34 · Score: 1

Amost correct, but what actually happens with ext3 is the following:
* 1. Create config.new. (Should be empty, because it's new)
* 2. Write the new contents into config.new (cached)
* 3. Move config.new onto config (cached)

(time passes)

* 3b. Filesystem decides that it is time to commit cache to disk and tries to commit metadata first. All commits are written to a journal
* 2b. Metadata commit is determined to be dependant on file data, so file data is written first.
* 3c. Metadata is written do disk.

If a crash happens at any point before 3c, after crash you get the old file, if after 3c, you get the new file.

Re:LOL: Bug Report by spitzak · 2009-03-19 10:36 · Score: 1

Arrgh it is annoying to keep having to fix people here.

KDE is already fixed exactly as you suggest. It writes a temp file and then renames it over the original. The problem with EXT4 is that this produces a totally unexpected result if it crashes (ie the result is that the destination is neither the old or new file, but garbage).

The people saying "fsync!" do need the cluebat. They are saying this should be done even if the rename is done exactly like you suggest. That is really slow, just like you say.

I do agree KDE should be fixed to not attempt to write any of it's files except when they really change. Rewriting all of them on startup is stupid!

You know by coryking · 2009-03-19 10:36 · Score: 1

Ext4 module will be gaining a new mount option that will ensure that a file is written to disk before the renaming occurs

If they cared about data integrity, they'd have a mount option to turn it *on*. Then in the manual, put a nice fat warning about "if you set this flag, there is a chance we will trash your filesystem, but do it really fast :-)".

I judge a program by shit like this. PostgreSQL comes out of the box with all kinds of integrity improving, but performance hurting options enabled by default and has nice fat comments about the lines were you can turn off stuff like fsync (which I'd never consider).

Bottom line, if they want to restore faith in their file system, the default, flag-free option would be the most stable but worst performing. Let us decide when to run the risk of trashing our file system, not find out the defaults sucked after the file system is hosed.

> this space reserved for nitwits who will claim I should have read the docs before doing anything so I'd know why I should always set this "dont-trash-my-filesystem" flag.... piss off, I dont read the manual before mounting a NTFS formatted USB drive, why should I have to read it before mounting your shitty filesystem?

Re:You know by eyegone · 2009-03-19 15:44 · Score: 1

ext3 seems to have done just fine with data=ordered as the default (rather than data=journaled).

--
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety."
Re:You know by Tacvek · 2009-03-20 04:03 · Score: 1

data=journaled would the worst performace wise, but is safety-wise the best, as atomic changes would be guaranteed in many cases even if the write-and-replace idiom was not used.
data=ordered ensures that the write-and-replace idiom works atomically.
AFAICT, data=writeback ensures only that the file system data structures remain consistent. It is the highest performance mode. However, in this mode atomic write operations are not possible without explicitly using fsync(), which kills performance.
I believe the Ext4 default is equivalent to data=writeback.
I suspect Ext4 may also have had some equivalent to data=journaled,
but it only now is getting an equivalent to data=ordered.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 10:38 · Score: 0

It's a strange question coming from someone with @debian.org email address. Are you for real?

Re:LOL: Bug Report by Yokaze · 2009-03-19 10:39 · Score: 1

> Every time the author wants to assure himself that data has been written to the disk, it calls fsync.

The problem is, the application developers are not complaining about the not having the strong requirement of having the data on the disc, but losing a weaker consistency, as Matthew Garret explained quite aptly.

--
"Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"

Re:LOL: Bug Report by spitzak · 2009-03-19 10:41 · Score: 2, Informative

ARRGH! This has nothing to do with the data being written "soon".

The problem with EXT4 is that people expect the data to be written before the rename!

Fsync() is not the solution. We don't want it written now. It is ok if the data and rename are delayed until next week, as long as the rename happens after the data is in the file!

Re:LOL: Bug Report by grumbel · 2009-03-19 10:41 · Score: 1

I imagine it will take a lot of work but at least with Free Software this can be fixed.

The applications aren't broken, what they do is perfectly normal and taught in pretty much an C programming book out there. Adding fsync() all over the place wouldn't fix anything. For one thing it would mean inserting platform specific code into every application that might otherwise be completly portable ANSI-C or ISO-C++, which would be really ugly, but it would also make the filessytem extremely slow, since now everything gets written to disk instantly and can't be cached. If you want to have fast and secure file writing there is only one place where you can fix that and that is the filesystem.

Posix is just an excuse. by bored · 2009-03-19 10:41 · Score: 1

The problem with using posix for anything is that the specifications are so loose as to be nearly useless. This happened because the specs were written by committees manned by the major UNIX vendors. Those vendors all made sure that the specification covered their implementations ugly edge cases, so they wouldn't have to update anything in their OS.

In the end, Posix is basically useless to write any kind of application code. If every application developer out there tried to deal with every edge case in the Posix specifications they would never get any real code written*.

On the other hand, kernel and OS developers love Posix because they can write just about anything, and it conforms to the specification. They can write all kinds of broken threading implementations, messed up communication and filesystem crap, and it conforms. When something unexpected happens, its the application developers problem because they failed to account for some strange case that doesn't happen in 99% of situations or cannot happen on the platform the application might have originally been written on.

Not only that, but just about any non trivial application ends up writing big chunks of code to deal with huge swaths of platform interactions that aren't covered by posix (or any other standard for example SUS). Everything from playing sounds, configuring a network interface to rewinding a tape.

*Ok, how about an example: Did you know that the close(2) can fail? What does this mean for real applications? Well you have to sit in a loop doing closes on a handle until it returns EBADF. Ok, simple! Now what happens if you have a threaded application where the threads are doing opens/closes? Open and close are marked as thread safe, but because close can fail you need a loop, now what happens if your loop successfully closes a file handle, and another open somewhere else successfully opens and reuses your file handle? That is right, the close loop will close it, leaving you trying to use a file handle that is actually been closed! Whats the solution? Well now you have to write a open/close wrapper that pthread_mutex_locks() a global lock to assure that you aren't getting open/closes running at the same time. This can get worse in some other cases, but i've proven my point. Posix is a rabbit hole.

Re:Posix is just an excuse. by isj · 2009-03-19 13:12 · Score: 1

> Ok, how about an example: Did you know that the close(2) can fail? What does this mean for real applications? Well you have to sit in a loop doing closes on a handle until it returns EBADF
No. You write a loop that terminates when close() succeeds.
Re:Posix is just an excuse. by bored · 2009-03-19 17:24 · Score: 1

No. You write a loop that terminates when close() succeeds.
BTW:
http://www.opengroup.org/onlinepubs/000095399/functions/close.html
"If close() is interrupted by a signal that is to be caught, it shall return -1 with errno set to [EINTR] and the state of fildes is unspecified. If an I/O error occurred while reading from or writing to the file system during close(), it may return -1 with errno set to [EIO]; if this error is returned, the state of fildes is unspecified."
You may successfully close the file and never get a successful return from close.
Re:Posix is just an excuse. by bored · 2009-03-19 17:39 · Score: 1

No. You write a loop that terminates when close() succeeds.
And you end up with bugs...
Slashdot appears to have eaten my original comment... I posted the applicable posix section in another comment. The bottom line is you have to check for a failure with EBADF as well, as its possible you never get a successful close. This actually happened to me a few years ago on a unix which shall remain nameless while accessing a NFS mounted files. The end result was a piece of code very similar to what my original comment indicated. This is because you may have successfully closed the file, but gotten a failing return, followed by an open reusing the FD (see posix open, "Upon successful completion, the function shall open the file and return a non-negative integer representing the lowest numbered unused file descriptor."). That is why you have to lock it, there isn't any guarantee that the FD wasn't reused between a close that returned a failure, and the next successive call to close.
I believe my original point stands, and is further reinforced by your comment which will probably work in 99% of cases, but deadlock if you get a signal during the close, or the close fails due to some other IO error.
Re:Posix is just an excuse. by isj · 2009-03-19 18:48 · Score: 1

Does that OS rhyme with Polaris? It must have been version 2.6, which was notorious for interrupting system calls. Using SA_RESTART in sigaction() and calling fork1() (instead of the implicit abomination forkall()) solved a lot of issues for me.
But yes, your point stands.
I wouldn't loop for EBADF though because there is nothing safe you can do multithreaded programs.

Heh by coryking · 2009-03-19 10:46 · Score: 1

it makes sense to patch this particular issue by implicitly inserting fsync()s just before rename()s over existing non-empty files are committed to disk

I can already visualize the colorful, profanity laced comments your idea will produce. Something like
/* Bugfix #30534: Workaround for fucking EXT4, who is so fucking stupid it cannot even write out our rename in the proper sequence and might trash the users filesystem. Since the EXT4 guys refuse to fix their shitty ass filesystem, we have to hack around their busted shit by fucking fsyncing any god damn changes we might have done before doing any kind of directory operations. bite me assclowns. */

... but my example comment is probably the PG13 version of what it would really look like. I imagine it will be much worse and probably not safe for even some adults.

Re:Heh by JesseMcDonald · 2009-03-20 03:36 · Score: 1

What are you rambling on about?
Adding fsync()s before rename()s are committed to disk in the ext4 filesystem module, as I proposed, would fix the issue from the application's P.O.V. No workarounds would be required.

--
"The state is that great fiction by which everyone tries to live at the expense of everyone else." - Bastiat

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 10:47 · Score: 0

QFT

What does quantum field theory have to do with it?

Re:LOL: Bug Report by Captain+Segfault · 2009-03-19 10:50 · Score: 1

You don't risk any data loss, ever, if you shut down your system properly.

That's meaningless, in that you can't completely eliminate the risk of a kernel panic or similar bug.

Re:LOL: Bug Report by Schraegstrichpunkt · 2009-03-19 10:51 · Score: 1

The POSIX standard is just fine.

Using POSIX semantics, how does the operating system distinguish between the following two requests?

Replace this file atomically. I don't care when you do it, but make sure I either get the complete old file or the complete new file.
Do the above, NOW.

--
http://outcampaign.org/

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 10:56 · Score: 0

The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"

They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.

It may also mean that application devs are not testing their software on more than one platform. Just because it works on ext3, doesn't mean it works well on BSD FFS, UFS, ext4, btrfs, ZFS, etc.

Just compiling and running your code on more than one processor family may show bugs in your data structures, etc., running on more than one OS may show you bugs or bad assumptions about APIs and behaviour.

With all the virtualization software out there you don't even need multiple machines anymore.

open; write; fsync; rename by Anonymous Coward · 2009-03-19 11:06 · Score: 0

Step 1: Truncate file length to zero in preparation of overwriting it.
Step 2: Delay the writing of the new data for performance reasons.
Step 3: After the delay has elapsed, actually write the data to the disk.

1. open("myconfig.new", O_CREATE|O_TRUNC)
2. write("myconfig.new")
3. fsync("myconfig.new")
4. rename("myconfig.new","myconfig")

What's the big fucking deal? You either get the old data or the new data, and your code will be good on any present or future POSIX system.

Why are you truncating files that you want to keep data on? What happens if you're on a mount point where the "sync" option is on in fstab so all operations are synchronous? Is that in one of your test cases?

This isn't rocket surgery.

Re:open; write; fsync; rename by fractoid · 2009-03-19 14:43 · Score: 1

The big fucking deal is that when your power goes out / PHB trips over the cable / infant daughter presses the reset button, within a minute or few of point (4), it's already truncated 'myconfig' in preparation of the new data being written, but the new data is still in the write queue and is lost when the power goes off. So myconfig.new is deleted, myconfig is truncated, and you, my anonymous friend, are screwed.

--
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.

Re:LOL: Bug Report by AigariusDebian · 2009-03-19 11:08 · Score: 1

2. This bug does NOT exist in ext3.

Ext3 writes out data before metadata is written (at lest in the default mode data=ordered), this there is no window opportunity where a crash could cause a data loss on ext3. On ext4 there is a 60 second window of opportunity. Or was, before this bug was fixed by patches pending for 2.6.30.

The new data is NOT important, it can be thrown away in a crash an no one will complain. The problem is that ext4 managed to destroy data that was already on the disk. That is unacceptable.

Re:LOL: Bug Report by Rennt · 2009-03-19 11:08 · Score: 1

The OS partition you are thinking of is 'root' or '/' - not '/root'.

Its an easy mistake to make, and one that Aigarius was having some fun with.

Hypocrisy by kokho · 2009-03-19 11:09 · Score: 0

I'm willing to bet that a ton of people complaining about the standards compliance not being an issue, and "user's needs first..." etc are the same people who rip on Microsoft and IE for not being standards compliant on the web. It's funny how people are so inconsistent with their evangelism.

Re:LOL: Bug Report by AigariusDebian · 2009-03-19 11:09 · Score: 0

FHS specifies to use /usr/local or /opt for that.

Re:LOL: Bug Report by complete+loony · 2009-03-19 11:13 · Score: 1

Personally, I think we need a user level API standard that has those guarantee's. This would most likely contain a wrapper around the POSIX compliant API, but may use a different approach for different filesystems that provide different guarantee's.

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.

Re:LOL: Bug Report by shentino · 2009-03-19 11:15 · Score: 1

Just because it's stable (i.e., compilable, doesn't spew warnings, has survived poking) doesn't mean it's mature enough for major distros to rely upon.

it is experimental at best.

Whether the kernel folks issued the appropriate disclaimers on it or not, it still lies upon the distros not to include code that is unproven/brand new/reasonably suspect.

This code may be stable, but it is definitely green.

quick! by thePowerOfGrayskull · 2009-03-19 11:15 · Score: 1

everybody rehash the exact same arguments as when the original article appeared!

Ah, too late. You already did.

POSIX atomic updates with no atomic file systems? by Anonymous Coward · 2009-03-19 11:17 · Score: 0

The application programmers aren't at fault here, the POSIX spec is. A filesystem is essentially a hierarchical database, yet POSIX doesn't include a way to make atomic updates to it.

The idea of what a file system is or is not has changed over the twenty years that POSIX has been around. The reason there have been no atomic updates with POSIX because there have been no file system that have been able to do it--except maybe BSD's LFS, and more recently ZFS.

These use COW, so you always consistent on-disk structures and so you can be atomic. Ext1, not atomic; ext2, not atomic; ext3 and 4, not atomic; UFS, not atomic; FFS, not atomic; XFS, not atomic; NFS, not atomic.

The reason why POSIX does not specify "atomicity" is that because until recently it wasn't available.

Now, with all of these new COW file systems coming up (ZFS, btrfs) we can discuss the possibility of a new POSIX API.

Re:LOL: Bug Report by bgillespie · 2009-03-19 11:26 · Score: 1

Well, I must concede that I've only gotten as far as the POSIX standard in my computer science curriculum, so I'm not as familiar as I could be with system workings at the operating system level. I certainly agree with you that placing hardware specific code in a part of the operating system meant to generalize the algorithmic interaction with mass storage devices makes very little sense.

My understanding is that there is a logical representation of the bytes available on a physical disk (at which level the file system operates), and device drivers and hardware components translate that in some fashion into physical bytes, possibly generating this translation on the fly rather than as a simple bijection of "fixed logical byte maps to fixed physical byte". Wouldn't an algorithm implemented at these lower levels still be able to use the fact that more data is being written at a time to make more intelligent decisions about where to physically place that data?

Re:If this was a Windows issue by internettoughguy · 2009-03-19 11:28 · Score: 1

instead, because its an article about Ext4, theres a load of Ext4-bashing and jokes. it would be quite unfair if it was full of Microsoft-bashing.

Re:LOL: Bug Report by Sparr0 · 2009-03-19 11:33 · Score: 2, Informative

No, both of those are, implicitly, expected to be world readable, and at least usually for software that any user can run (to some degree of success). /root is the only place for root to put a local application (or any other files) that he doesn't want a user to be able to see at all.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 11:38 · Score: 0

In which case if you do not want a zero length file then you NEED to sync the file before renaming over the old one or not be dumb and truncate/write to a single file.

POSIX makes no guarantees about file data and metadata being in sync so you need to make sure they are with an fsync() before you commit to using the new file with a rename().

Re:LOL: Bug Report by spitzak · 2009-03-19 11:49 · Score: 3, Interesting

Yes I would like that as well. It would remove the annoying need to figure out a temp filename and to do the rename.

One suggestion was to add a new flag to open. I think it might also work to change O_CREAT|O_TRUNC|O_WRONLY to work this way, as I believe this behavior is exactly what any program using that is assuming.

f = creat(filename) would result in an open file that is completely hidden to any process. Anybody else attempting to open filename will either get the old file or no file. This should be easy to implement as the result should be similar to unlinking an already-opened file.

close(f) would then atomically rename the hidden file to filename. Anything that already has filename open would keep seeing the old file, anything that opens it afterwards will see the new file.

If the program crashes without closing the file then the hidden file goes away with no side effects. It might also be useful to have a call that does this, so a program could abandon a write. Not sure what call to use for that.

Calling fsync(f) would act like close() and force the rename, so after fsync it is exactly like current creat().

Re:LOL: Bug Report by ConceptJunkie · 2009-03-19 11:50 · Score: 1

Completely aside from your point, with which I agree, I'd like to mod you up just for spelling "lose" correctly.

--
You are in a maze of twisty little passages, all alike.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 11:54 · Score: 0

And how would one make renaming all files (an arbitrary number) in a directory atomic. POSIX does support an atomic way of making changes. It might be somewhat rudimentary but it does work no matter if you are using local filesystems or something like NFS. The behaviour would acually work very well with NFSv4 with the parallel data storage and separate metadata options. Forcing POSIX to be atomic in the general case would preclude options such as this and drag performance of high-end systems down just because 16 year-old developers can't be arsed to code properly.

Re:LOL: Bug Report by spitzak · 2009-03-19 11:55 · Score: 1

As about 6000 people have tried to point out to all you clueless "fsync!" posters, fsync() will kill performance unacceptably. It forces far more to happen then the programmer wants. We only want the order preserved, it is ok if the data write is delayed for a long time.

The fact that "POSIX allows this" is completely bogus. POSIX allows the file to be deleted when you read() it, as long as it is written back when the disk is unmounted. That does not mean that programs should all call unmount all the time just because such stupid behavior is possible!

Re:LOL: Bug Report by DragonWriter · 2009-03-19 12:06 · Score: 1

Why isn't this being considered as the solution? There are other major OSes have implemented basic atomic transactions in their filesystems successfully, why not Linux?

Who says Linux isn't heading that way? Most existing transactional filesystems that I know of have license issues for inclusion in the Linux kernel, though I think Btrfs is both transactional and currently supported in the kernel, though not yet production-ready and stable.

Re:LOL: Bug Report by mrwolf007 · 2009-03-19 12:20 · Score: 1

One might note, that its excactly this "fixed logical byte mapping to fixed physical byte" isnt a bijection. Even normal hard drives have an amount of reserve blocks, afaik 10-20%. As soon as the drive has troubles reading a block it maps it to a reserve block. This is completely invisible to the the OS, so a hard drive which has worn out 5% of its blocks would still appear brand new (though you may see a performance decrease reading a file written to seemingly consecutive blocks).
The device electronics are the ideal place for such wear leveling algorythms. Sure, its theoreticly possible to place those algorythms in driver level. But then every device would require this knowledge for their drivers. Just think of small and embedded devices for a moment to see why thats a bad idea.

Re:LOL: Bug Report by Eskarel · 2009-03-19 12:34 · Score: 3, Insightful

This is actually even stupider for flash drives. There is essentially zero seek time on a flash drive, so, in theory, it shouldn't really matter how much you write at any given time(since hte only delay should be how long it takes to actually write the cell).

In addition, presuming reasonable wear algorithms(which should be implemented in the device controller not in any sort of software), every bit of Math I've seen says that for any realistic amount of data writes the flash drives will last substantially longer than any current physical drives(last I saw it was about 30 years if you wrote every sector on the disk once a day, scaling down as writes increase. Even writing 6 times the volume of the drive per day that's 5 years which is a fairly long time for consumer grade physical drives, and unlike a physical drive, even if you can't read it, you can write it so you can just clone it over to a new drive.

File systems will definitely have to change for flash drives, but delaying writes probably isn't going to be the way to do it, especially since there's no need to do so.

Linus Torvalds: "a spec is close to useless" by Anonymous Coward · 2009-03-19 12:54 · Score: 0

Linux creator Linus Torvalds began the discussion saying, "a 'spec' is close to useless. I have _never_ seen a spec that was both big enough to be useful _and_ accurate. And I have seen _lots_ of total crap work that was based on specs. It's _the_ single worst way to write software, because it by definition means that the software was written to match theory, not reality."

http://kerneltrap.org/node/5725

Re:Linus Torvalds: "a spec is close to useless" by starfall-elf · 2009-03-20 18:03 · Score: 1

Wouldn't this be a POSIX-compliant version of write(2)?

ssize_t write(int fd, const void *buf, size_t count) { return count; }

Since there's no guarantee that the data is written until fsync(2) is called, and it hasn't been called at the point fwrite(2) returns, why attempt to write the data at all?

Re:LOL: Bug Report by Eskarel · 2009-03-19 13:03 · Score: 1

It's not exactly how it's being described by the GP.

Yes, it's technically accurate, but it's not the point. Ext4 extends the delay between writes out from a maximum of 5 ms to well over a second.

Yes you read that right, if you write data in an ext4 file system it won't be on disk until an amount of time you can actually count has past.

The only reason this gives any kind of performance benefits at all is because most applications are not calling fsync(). The resolution to the data loss this is causing is for pretty much every application to call fsync() a whole lot of the time, which will probably end up with data being written to the disk even more inefficiently than it was before.

Just because the POSIX standards say it, doesn't mean it's right. POSIX is very old now, and was based around technological ideas which are out of date now.

Re:LOL: Bug Report by greg1104 · 2009-03-19 13:08 · Score: 1

There is no such thing as an atomic disk write operation, so your proposed step (3) is working on a bad premise. "One operation immediately followed by the other" is no help either--you have no way of knowing which bytes out of those two writes will and won't be there if there's a crash in the middle of that write.

Let's say you queue writes to two 4096 byte sectors. The power goes out. What made it to disk? Just one sector? Both? The first half of the second one, because the drive reordered the writes based on where the disk heads were at, and it got half the sector written before the capacitors in the drive fully discharged? You have no idea at all what you got, which is why filesystem designers avoid even thinking like this.

The only thing you can do is provide a mechanism to confirm whether a write was successful or not before moving onto a second one. fsync provides such a mechanism, which is why discussion of this issue invariably wander into talking about it.

Re:LOL: Bug Report by Eskarel · 2009-03-19 13:11 · Score: 1

Just because POSIX allows it, does't mean it's not stupid, legal standards allow me to call my child "dog turd" but that doesn't make it a good ideal.
Last I checked it was 5ms ,as opposed to 1.5 seconds in ext4. ext3 has delayed writes, delayed writes are not a problem. the issue is that 1.5 seconds is a retarded amount of time to wait on anything in a modern computer.
When everyone starts using fsync() because ext4 won't write their data for 1.5 seconds, then all the performance gains by waiting for 1.5 seconds will disappear anyway.

You're right, we don't need a filesystem which sledges in every byte in case theirs a crash, and which handles important data differently.

At the same time we don't need a file system which takes so long to write data to a disk that every program has to treat its data as important so we end up with a system which does hammer every byte onto the disk in the case of a crash.

1.5 seconds is stupid in PC which can perform fifteen thousand operations in that time span, and when everyone has to use fsync() it won't be any faster.

Re:LOL: Bug Report by greg1104 · 2009-03-19 13:14 · Score: 1

I had a silent wish Tso would increase the default commit interval to 10 minutes when the first defenders of the KDE bug started squawking, but he's was too gracious for that.

Exactly; the code as written right now has an ugly race condition in it. The best thing you can do when you have one of those is make the conditions under which the problem occurs much more common, so that you get the right feedback for fixing it correctly.

Re:LOL: Bug Report by ubernostrum · 2009-03-19 13:20 · Score: 1

Read the qmail source code sometime. Every time the author wants to assure himself that data has been written to the disk, it calls fsync.

Except even that's not enough, and risks data loss. Consider the following from the OS X man page for fsync:

Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.
Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present, while earlier writes are not.
This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures.
For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.

This is almost certainly done in the name of performance, just as at the FS level. Which raises the question of whether, down the road, further "performance improvements" will create a need for F_FULLFSYNC_NO_REALLY_I_MEAN_IT, F_FULLFSYNC_JUST_WRITE_THE_DATA_ALREADY, etc.

Re:LOL: Bug Report by AceofSpades19 · 2009-03-19 13:22 · Score: 1

You do realize that /root is different then / right?

Re:LOL: Bug Report by jd · 2009-03-19 13:46 · Score: 1

As far as I know, you can't, with POSIX. What you COULD do, however, is have a small battery on the drive and dollops of RAM (say a couple of gigs) dedicated to queueing diskbound traffic. All transactions are dumped to the queue. If the power fails, the queue is intact and can continue being run to the drive when power is restored. This would be good for as long as the battery can maintain the RAM (no reads, no writes, no mechanical devices, no software or CPUs, just the RAM).

An alternative is to have a processor on the drive and shift the filesystem(s) over to that as a program, same way SETI@Home can offload some maths to the GPU. The advantage there is that the filesystem can then spend as long as it likes sorting things out, as it's not tying up the main CPU doing so. Again, it requires a fair chunk of RAM on the drive, and again you'd want to battery-back that RAM dedicated to data to the drive (you don't need to preserve anything else).

Either way would bypass the POSIX issue - to a degree, at least - because it's no longer FS semantics that handle the communication but the virtual FS layer to logical FS layer, and that's not POSIX-specified and can therefore be whatever the inventor of the hardware wants it to be.

(If it's a Linux programmer, the obvious protocol would be the existing virtual-FS-to-logical-FS API that Linux currently uses. Of course, Linux would be the only OS that could use those hard drives for some time, and Windows users likely never could.)

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 14:08 · Score: 0

and unlike a physical drive, even if you can't read it, you can write it so you can just clone it over to a new drive.

Did you just get 'read it' and 'write it' backwards or do I need more coffee? O.o Mah head hurthth.

I agree about delaying writes not making any sense for flash drives, though - as you said, there's no seek time, so there's no advantage to buffering then squirting a bunch of physically sequential writes.

Re:LOL: Bug Report by fractoid · 2009-03-19 14:13 · Score: 1

This was one of the first real-world uses that I saw of ultracapacitors. An ultracap can store just enough power to get the buffer written and the disk parked before it shuts down. Of course, your solution works pretty well too, with a lithium button cell probably able to keep the ram refreshed for at least a day or two.

--
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 14:19 · Score: 0

"Why can't it be done this way instead? "

The file metadata (ie: file length) may not be stored in the same place as the file data. Trying to make this operation atomic interferes with any optimization targeting the reduction of head seeking.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 14:33 · Score: 0

I can't wait for Fedora 11 to come out. EXT4 will be the default file-system. Seems like the KDE 4.0 fiasco all over again.

http://fedoraproject.org/wiki/Features/Ext4DefaultFs

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 14:39 · Score: 0

You could probably fill a library with what I don't know about low-level filesystem details, so please correct me if I have misunderstood this.

There ought to be plenty of room in the Bush Library...

Re:Those who fail to learn the lessons of history. by Zancarius · 2009-03-19 14:50 · Score: 1

Those benchmarks are pretty interesting, but it seems to me that an overwhelming majority of those posted later in the article are CPU-bound operations rather than disk-bound (or GPU, as may have been the case with UT2004).

To be fair both to ext4 and the other file systems, I can't really see the benchmarks you linked to as being representative of real world operations. I'd be hesitant to make my judgments of the merits of a given FS based upon that alone; after all, I personally don't create 4GiB+ files regularly (perhaps someone who does video editing might). I think it would be far more useful to test a file system's capability for reading, writing, creating, and deleting hundreds of smaller files (e-mail and web service load profile) or perhaps the average time taken to load a specific application or series of applications (great for most general usage, such as word processors and the likes). Perhaps I just overlooked that in the article...

I do seem to remember a benchmark some time back that tested things similar to what I mentioned regarding small files between ext4 and reiserfs. It would be wonderful if someone benchmarked both of those file systems in addition to ext3, XFS, and others. Perhaps I'm stricken with excessive skepticism, but the benchmark linked by the OP smells too artificial for my taste. ;)

--
He who has no .plan has small finger. ~ Confucius on UNIX

Re:LOL: Bug Report by Viraptor · 2009-03-19 14:51 · Score: 1

Pretty much every programming book presents a simplified view of the world. Because it teaches C, not systems. Try Rochkind's "Advanced unix programming" one day if you want something close to real world. fsync()'s fairly portable and can be redefined to noop where needed, so I don't see a problem there.

Of course "Adding fsync() all over the place wouldn't fix anything". On the other hand, adding it where it's supposed to go, will :) You cannot be both fast AND transactional in every operation.

Workaround patches already in Fedora and Ubuntu by tytso · 2009-03-19 15:04 · Score: 4, Informative

It's really depressing that there are so many clueless comments in Slashdot --- but I guess I shouldn't be surprised.

Patches to work around buggy applications which don't call fsync() have been around long before this issue got slashdotted, and before the Ubuntu Laundpad page got slammed with comments. In fact, I commented very early in the Ubuntu log that patches that detected the buggy applications and implicitly forced the disk blocks to disk were already available. Since then, both Fedora and Ubuntu are shipping with these workaround patches.

And yet, people are still saying that ext4 is broken, and will never work, and that I'm saying all of this so that I don't have to change my code, etc ---- when in fact I created the patches to work around the broken applications *first*, and only then started trying to advocate that people fix their d*mn broken applications.

If you want to make your applications such that they are only safe on Linux and ext3/ext4, be my guest. The workaround patches are all you need for ext4. The fixes have been queued for 2.6.30 as soon as its merge window opens (probably in a week or so), and Fedora and Ubuntu have already merged them into their kernels for their beta releases which will be released in April/May. They will slow down filesystem performance in a few rare cases for properly written applications, so if you have a system that is reliable, and runs on a UPS, you can turn off the workaround patches with a mount option.

Applications that rely on this behaviour won't necessarily work well on other operating systems, and on other filesystems. But if you only care about Linux and ext3/ext4 file systems, you don't have to change anything. I will still reserve the right to call them broken, though.

Re:Workaround patches already in Fedora and Ubuntu by AvitarX · 2009-03-19 17:36 · Score: 1

Is it really fair to call them buggy applications because they prioritise as:
old changes > performance > recent changes?
I would say that failure allow for this type of prioritisation is a bug in the FS/FS driver or OS, not the application.
Being forced to fsync() to preserve the months/years old data in a file feels like overkill to me, and apparently many application programmers agree. The fact that Ext3, and now Ext4 provide mechanisms to set those priorities would IMO be a feature that allows for significantly increased use of caching, as now a choice can be made to defer the write, as long as keeping the old file in the event of a crash is OK.
I am curious as to how much is gained by writing the file many seconds after the meta data though, perhaps it is wrong headed for me to think the gains of delaying the write in those instances out-way the the harm of having to fsync() those cases, but more aggressive re-ordering of operations is permitted.
As for the suggestion of using a single binary database for configuration, please no.
And I am also sceptical that Ext3 will be in a state that I can lose my months old data for up to 30 seconds without an fsync().
Or am I misreading this comment?
https://bugs.launchpad.net/ubuntu/jaunty/+source/ecryptfs-utils/+bug/317781/comments/54

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Re:Workaround patches already in Fedora and Ubuntu by amorsen · 2009-03-19 21:21 · Score: 1

If I come across other broken file systems, I reserve the right to yell at their authors too.
Meanwhile, if KDE used fsync where you wanted them to, I'd be yelling at them for making the drive spin up for no good reason. Well ok, I wouldn't, because I don't use KDE, but the same is true for Gnome.
The problem isn't that the data gets to disk too late, the problem is that the rename hits the disk too early. You can delay that for half an hour for all I care, when I'm running on battery.

--
Finally! A year of moderation! Ready for 2019?
Re:Workaround patches already in Fedora and Ubuntu by Joey+Vegetables · 2009-03-20 07:41 · Score: 1

Mr. Ts'O . . . I guess I'm still confused about the "proper" behavior of both the app and the FS in this case. Clearly an app should not make assumptions beyond what POSIX guarantees, but unless we do, we would seem to have no way that is both clean and fast to preserve the quasi-transactional behavior that we want . . . we can't guarantee that either the old file, or the new one, would remain, before making the potentially expensive fsync() call. And no matter how frequently we do so, unless we fsync() before we rename, we still could lose not just the new file we're trying to write but the old file as well. Am I missing something? Is this a problem that even can be solved, short of adding new API calls to POSIX?

--

Nonaggression works!

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 15:34 · Score: 0

You knew about the bug, where it is, how to reproduce it and why it exists, know enough about programming to teach a class, yet are hoping that 'they' fix it??? Why not make the fix yourself, publish it in the proper channels (kde mailing lists/bug tracker) to get it in there, and get eternal fame?

Prove that you know what you're talking about and make many others happy. Please? Pretty Please! With sugar on top. And a cherry.

Practice what you pr^H^Hteach.

questions by Vexorian · 2009-03-19 15:47 · Score: 1

A) Nevermind POSIX specs, wouldn't it be better if operations were made in order? Is it not a problem a rename could happen before a task the app ordered before? Nevermind the delay in actual disk update.
B) The apps should be calling fsync! Ok, here's what I don't get, if the point of this change was to improve performance by reducing disk writes, isn't it a little counter productive that we are basically asking apps to force a disk write every time they "write" something? Sounds a little counter-effective to me.
Well, It could as well be I do not understand the issue correctly.

--

Copyright infringement is "piracy" in the same way DRM is "consumer rape"

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 15:52 · Score: 0

Just because it's stable (i.e., compilable, doesn't spew warnings, has survived poking) doesn't mean it's mature enough for major distros to rely upon.

it is experimental at best.

Whether the kernel folks issued the appropriate disclaimers on it or not, it still lies upon the distros not to include code that is unproven/brand new/reasonably suspect.

This code may be stable, but it is definitely green.

Good god, you're serious!

Here's a quarter, kid -- buy yourself a real operating system.

Re:LOL: Bug Report by Hal_Porter · 2009-03-19 16:07 · Score: 1

As far as I know Intel flash drives use NCQ for this. The idea is that you can keep a bunch of requests pending until you either have one erase block's worth or you hit a timeout.

In fact waiting for a write buffer to fill up a bit before flushing it to disk is actually quite similar conceptually to the original justification for NCQ, that the drive can sort the requests using an elevator seek algorithm.

Even better I think it's done in a way that doesn't lose user data - presumably with something like NTFS the journal keeps track of a pending transation so it can be rolled back if the system fails before it is completed.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;

Re:LOL: Bug Report by Hal_Porter · 2009-03-19 16:11 · Score: 1

That's true at the raw flash level. It's not true at the LBA level that filesystems operate. Since there are loads of filesystems that expect to be able to keep overwriting things, at the LBA level all flash disks do some kind of wear levelling.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;

Re:LOL: Bug Report by Hal_Porter · 2009-03-19 16:13 · Score: 1

Isn't it usually a bad idea to compensate for a lack of a file system feature by adding hardware though?

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;

Re:LOL: Bug Report by Burpmaster · 2009-03-19 16:20 · Score: 1

Or you could implement atomic renames in software, instead of doing it in hardware...

Re:LOL: Bug Report by fractoid · 2009-03-19 17:08 · Score: 1

True, but if there's a sizable performance boost to be had by moving the bounds a little, then it's a fair tradeoff. Responsibility has been steadily moving away from the CPU (and OS) anyway, out to the peripherals as computers become more decentralised and peripherals become more autonomous. How is delegating write caching to the drive any worse than delegating video decryption to the video card or network packet buffering to the Ethernet hardware?

--
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.

Re:LOL: Bug Report by mrwolf007 · 2009-03-19 17:40 · Score: 1

Well, sure.
There is a reason drives have caches, thats behavior introduced with IDE (the I stands for intelligent).
The main difference here is that a) the drive doesnt bluescreen and b) it should have enough power stored to flush the cache (and park the head) in case of power failure.
I'm not ranting about the idea of using caches. Im just opposed to doing it someplace it doesnt belong and where it causes more problems than good.
Also there seems to be a weird assumption here that writing a large block of data actually ends up in one "large row" on the device. This isnt necesarilly the case for a hard drive (the older the drive the more sectors will have been mapped to reserve sectors) and definatly shouldnt be the case for a flash drive which can map sectors even more efficiantly since it doesnt even have a seek time.

Re:LOL: Bug Report by DiegoBravo · 2009-03-19 17:52 · Score: 2, Insightful

For many (most?) Unix admins, /root is just a nicer way to specify "/ filesystem" or "root filesystem". The path /root for root user's home directory is popular in Linux, but I never saw it in the Unixes I've used (but I don't know if that custom is a Linux invention.)

Re:LOL: Bug Report by RAMMS+EIN · 2009-03-19 19:19 · Score: 1

``So, while I'm all for applications using fsync when it's really needed, the last thing I'd like to see every application on the planet sprinkling their code with fsync "just to be sure".''

In other words, there is no substitute for doing it right.

--
Please correct me if I got my facts wrong.

Re:LOL: Bug Report by RAMMS+EIN · 2009-03-19 19:30 · Score: 1

``3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.''

This is, indeed, the central issue. And it's an issue not just with filesystems, but with anything that involves concurrency or transactions (multiple things that belong together). And it's an issue that few programmers get right.

The problem is that things often appear to work if you've not done them right. As long as there is only one thread and everything that thread does is performed successfully, there isn't a problem. So the program works, let's ship it. But then, in some Real World situation, there are suddenly multiple threads and/or failing operations. This can lead to spectacular failures.

The question is, of course, who gets to bear the burden for fixing the problems. Personally, I think there is an opportunity here for programming language and library designers to create APIs that make it easier to do the right thing, and harder to do the wrong thing. But given a specific API, it's up to the programmer to use it correctly. That doesn't mean you can't complain about the API and demand better APIs, but it does mean you can't complain when something that implements the API (correctly) does not magically make your broken code do the right thing. (Not that I am claiming that is what happened in this case - I don't know enough about it to be able to judge that.)

--
Please correct me if I got my facts wrong.

Re:LOL: Bug Report by RAMMS+EIN · 2009-03-19 19:34 · Score: 1

``For EXT4, if it crashes, you can get a zero-length file, with both the old and new data lost.''

Since you seem to know what you are talking about, I'm asking you. What is it that causes this data loss? Why does it happen with ext4, but not with other filesystems?

--
Please correct me if I got my facts wrong.

Uh no. by Anonymous Coward · 2009-03-19 20:09 · Score: 0

A admin simply calls it '/'.

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 20:16 · Score: 0

Such an algorithm should take milliseconds, not minutes.

Re:LOL: Bug Report by szundi · 2009-03-19 20:24 · Score: 1

I think he's right in the first half. Let be data integrity and retention the default. Pleeease! I know i should RTFM but please raise hands who read an ext3 manual before installed. please don't just trash the work/life of other people and then laugh 'rtfm'

Re:LOL: Bug Report by Hal_Porter · 2009-03-19 20:25 · Score: 1

If you need extra hardware (drives with ultracapacitors or computers with a UPS) to run Linux, suddenly a Windows OEM license seems a lot cheaper.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;

Re:LOL: Bug Report by Anonymous Coward · 2009-03-19 20:38 · Score: 0

Can I be a dick and hold all writes in memory until I get an fsync and call it POSIX compliant?
Hold data for five minutes without an fsync?
One minute?

Maybe POSIX is just not good enough.

Re:LOL: Bug Report by Eskarel · 2009-03-19 20:53 · Score: 3, Informative

I did flip read and write, long day.

Re:LOL: Bug Report by amorsen · 2009-03-19 21:09 · Score: 1

Yes, and from your email address, you really ought to know that. Or did they ditch their Unix labs?

--
Finally! A year of moderation! Ready for 2019?

Re:LOL: Bug Report by amorsen · 2009-03-19 21:13 · Score: 1

I just want to say that I really like that idea, apart from the fsync call. fsync should not have any effect on correctly working (non-crashing) systems. An application should be able to call fsync at any time and not change anything (except ruin performance).

--
Finally! A year of moderation! Ready for 2019?

Re:LOL: Bug Report by Fred_A · 2009-03-19 21:23 · Score: 1

Read the qmail source code sometime. Every time the author wants to assure himself that data has been written to the disk, it calls fsync.
If you don't, you risk losing data. Plain and simple.

You mean a programmer actually RTFM ? That's crazy talk !

--

May contain traces of nut.
Made from the freshest electrons.

Flash. delay can be good as well. by leuk_he · 2009-03-19 22:13 · Score: 1

You make the wrong assumption that writing on flash has the same speed as reading. This is the case for hard disks, but for flash writing takes 10 times longer than reading.

you will not notice this until your write chache is full. Good controllers like intel will hide this slow writing for a long time.

Also most benchmark programs are for disks, and writing is disabled. This does not matter a lot for disks since wringing is the same speed as reading. It matters a lot for flash.

But any writing that is prevented, maybe becuase it delayed is a good thing for flash drives.

Re:LOL: Bug Report by ivucica · 2009-03-19 22:56 · Score: 1

Does, for example, Firefox claim full POSIX conformity? Does POSIX demand file to be fsync()ed? No, it doesn't, but you may experience data loss in case of system crash.

I didn't read the POSIX spec, and I can be fairly sure that most FLOSS developers didn't either. And if they did, they surely don't remember every paragraph and every sentence. Not enough to be aware of it at every stage of the development process.

And now a question, does fclose() also call fsync()? That is, if I close the file, can I be reasonably sure that it'll be written to the HDD immediately?

Right, but also wrong by Macka · 2009-03-19 23:33 · Score: 1

I happen to agree with Ted and yourself from a technical stand point. But where Ted stuffed up is that from ext3 to ext4 he moved the goal posts and didn't communicate effectively to the community what effect this would have. Nor did he explore before hand the consequences of moving said goal posts on software widely in use. He's since done the right thing, which is to give people a choice; and the public fall out from all this has done the communication for him. But he could have saved himself a public ear bashing if he'd gone about this differently. Hopefully he'll remember this next time.

Re:LOL: Bug Report by Macka · 2009-03-19 23:37 · Score: 1

I usually use: "/ (root)" to cover those in the know, and those who aren't.

The problem is/was in the EXT3 in the first place! by mr3038 · 2009-03-20 00:08 · Score: 2, Informative

The POSIX specifies that closing a file does not force it to permanent storage. To get that, you MUST call fsync() .

So the required code to write a new file safely is:

fd = fopen(...)
fwrite(..., fd)
fsync(fd)
fclose(fd)

The is no performance problem because fsync(fd) syncs only the requested file. However, that's in theory... use EXT3 and you'll quickly learn that fsync() is only able to sync the whole filesystem - it doesn't matter which file you ask it to sync, it will always sync the whole filesystem! Obviously that is going to be really slow.

Because of this, way too many software developers have dropped the fsync() call to make the software usable (that is, not too slow) with EXT3. The correct fix is to change all the broken software and in the process that will make EXT3 unusable because of slow performance. After that EXT3 will be fixed or it will be abandoned. An alternative choice is to use fdatasync() instead of fsync() if the features of fdatasync() are enough. If I've understood correctly, EXT3 is able to do fdatasync() with acceptable performance.

If any piece of software is writing to disk without using either fsync() or fdatasync() it's basically telling the system: the file I'm writing is not important, try to store it if you don't have better things to do.

--
_________________________
Spelling and grammar mistakes left as an exercise for the reader.

Re:LOL: Bug Report by hattig · 2009-03-20 00:36 · Score: 1

Yes, I was thinking the same. You (very simply) need:

* existing file: foo.bar

Transaction One: Write foo_tmp.bar and sync
Transaction Two: Delete foo.bar, rename foo_tmp.bar to foo.bar and sync

For desktop applications you should sacrifice performance in order to ensure that the transactions are committed. I'd go as far as requiring implicit filesystem syncs within 5 seconds of the last call. IIRC the problem with ext4 was that it thought that leaving data unwritten for minutes was okay (presumably because of high read load, but if that's streaming data then the application should have a buffer so it's safe to interrupt to write some data every so often). In addition in low-battery or power failure (via UPS) situations all writes should be synced because power loss is an imminent possibility.

As for applications that write transient temporary files for later use, they could be using a ram drive (even the Amiga had these in 1985).

Not quite. by yakovlev · 2009-03-20 00:43 · Score: 1

Actually, there's a deeper issue.

fsync() doesn't really mean what the POSIX spec says it means, and hasn't for a while.

Technically, fsync() means sit and wait until the data has been written to disk, and then return. Since the commit interval on this new filesystem is over a minute, using this view would mean that the application could hang for all of that time.

Because commit windows are now so long (even 5 seconds was a long time), filesystem authors have altered the behavior of fsync() to mean "write to the filesystem NOW." With this new meaning of fsync() and a pedantic view of the POSIX APIs, there is no longer a way to say "I want the old data or the new data after a crash, but don't really care which." (BTW, saying this with POSIX would require spawning a separate thread to do the writes.) Instead, the user is saying "write this to disk, NOW." That's a whole different set of guarantees.

If all of the applications start replacing "I want the old or the new data, but don't care which," with "I want the new data written, NOW," then THAT will REALLY prevent the kind of write optimization that ext4 is trying to do. Delaying writing the rename until after the data is written shouldn't hurt filesystem performance significantly at all. The only cost should be in the in-memory data structures necessary to track this dependency.

In this case, what the application writers are asking for is both good for system integrity AND good for filesystem performance. The alternatives (database, fsync) are all worse, not better.

(Aside: All of this applies to the atomic rename() method. Everyone agrees that using O_TRUNC on an existing file was just dumb.)

Re:Not quite. by mr_mischief · 2009-03-23 04:09 · Score: 1

Sure there's a way to say you want the old data or the new. Rename the existing config file to a backup name first. Then open a new file with the old config file name, write to it, and close it. If there's a crash after the rename, you still have a backup of the original from which to restore. If you find your config file is mangled, empty, or otherwise invalid, restore it by copying from the most recent backup.
Always put the atomic portion first. It can't be interrupted, because that's what "atomic" means. The non-atomic part of the procedure is getting interrupted by the atomic part in the applications this is affecting. The atomic part won't be interrupted by the non-atomic part.
Re:Not quite. by yakovlev · 2009-03-23 10:33 · Score: 1

You have added a significant new requirement on the software in that it must be able to detect not just a non-existent, but a corrupt config file. That alone is fairly onerous, but it's worse that that.

The worst part is that this actually doesn't work, but the reason is subtle.

Consider if the file is being altered twice in the commit interval (fairly likely for a long commit interval like the one for ext4.) In this case, the expectation is that the use should be able to recover any of three versions of the file, but doesn't care which.

However, the updates could be committed to disk in the following order:

Lines with a - are non-existent.
Lines with a * lack valid data.

1. rename file- file.old
2. create file*
3. rename file- file.old*
<CRASH>
4. create file*
<CRASH>
5. write file data (file.old* still corrupt)
6. write file.old data (now all clean.)

If a crash occurs in either of the lines marked <CRASH> then the data on disk will still be unrecoverable. Using the generally recommended method and having a behavior like ext3 data=ordered will have the desired behavior of allowing one of the three versions in a recoverable state. Your method would also work, but only with data=ordered. However, failure recovery is more difficult using the method you suggest.

Blaming the application developers is a bit rich. by spaceturtle · 2009-03-20 01:26 · Score: 1

Blaming the application developers is a bit rich. In EXT3 fsync was not only not necessary, it also could cause the system to freeze for 30 seconds, hence userland developers for Linux avoided fsync unless the data was not only just important but *really* *really* important.

Also application developers have to make many assumptions not explicitly spelt out in the POSIX specification, e.g. POSIX does not explicitly specify, e.g. that you machine has more than 16 bytes of RAM, that there is no "rm -rf /" in your initscripts etc. It is stupid to trumpet a new delayed allocation scheme, and then say "unless you explicitly disable it, your filesystem may enter a inconsistent state", so make sure that you always disable it or its *your* code thats buggy.

I also salute Mr. Tso... for recanting and fixing the damn bug. There is a more throughout discussion here.

Re:LOL: Bug Report by tignet · 2009-03-20 01:47 · Score: 1

For those of us who are not so familiar with the data loss issues surrounding EXT4, can someone please explain this? The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?" I.e. if I ask OpenOffice to save a file, it should do that the exact same way whether I ask it to save that file to an ext2 partition, an ext3 partition, a reiserfs partition, etc. What would make ext4 an exception? Isn't abstraction of lower-level filesystem details a good thing?

If you're old enough to remember back to how RAM above 640k was used in the DOS days, it was usually a RAM disk or disk cache (SmartDrv.exe). If you enabled write caching on SmartDrv.exe performance went way up, but of course you could lose data if you hit the RESET button before it had flushed.

... skip ahead a few years ...

Modern operating systems automatically cache data because it increases performance. Specifics of the size of the write cache and length of time before it's written to disk may vary, and each filesystem will have its own defaults.

EXT3 defaulted to committing data to disk after a maximum of 5 seconds. EXT4 increases that time to 150 seconds. (The exact numbers vary a bit, but you get the idea). Bottom line: When there is a system crash with EXT4 you notice losing data more often because there is a larger window of when data can get lost.

This is a very basic overview, but there are two groups weighing in on this:

Group 1: Things break under EXT4 that worked under EXT3!

Group 2: Look pal, it works fine. If you want your data committed right away so that you don't lose data maybe you should be calling fsync() so that the OS knows to commit your data? Because you know what, even with EXT3 you have data loss. It becomes more noticable with EXT4 because of the longer cache times, but the problem always existed!

Group 1: It worked before! And if commit our data immediately peformance drops!

Group 2: It didn't really work before, in laptop mode the EXT3 write time increases to 30 seconds. The problem has always existed! If you don't like taking the performance hit of committing data immediately, perhaps you shouldn't be writing so many tiny files so often!

Group 1: But it worked before! EXT4 is broken!

Group 2: Okay, look. You're obviously not listening. Why don't we make EXT4 behave more like EXT3 and do some auto-commits. Poorly coded applications will not lose data as often, and properly coded applications will not perform as well as they could.

Group 1: I'm taking this to Slashdot. EXT4 is teh suxx0rz!

Group 2: *sigh*

Re:LOL: Bug Report by skeeto · 2009-03-20 02:28 · Score: 1

Just because the POSIX standards say it, doesn't mean it's right. POSIX is very old now, and was based around technological ideas which are out of date now.

POSIX is being very flexible here, and rightly so. Ext3 and ext4 represent different trade-offs between two extremes. On one extreme you have very high data reliability, where data immediately goes to the disk and it is rare that anything is lost. You can accomplish this my mounting with sync. However, this has terrible performance.

On the other extreme we have great performance because the OS never ever writes to the disk. It would be like a live CD (ignoring CD/DVD reads). However, if the system crashes, you lose everything.

Real filesystems go somewhere in the middle, trading some reliability for performance. Ext4 just shifts things more towards performance than ext3. If POSIX was more rigid, we wouldn't have the choice of where to make the trade-off without breaking the standards. It would be poor decision making had POSIX chosen some arbitrary time limit for disk writes. POSIX isn't this way because it is old, but because they were being flexible in preparation for the future. That was good forethought.

If you don't like the ext4 trade-offs, stick with ext3. Linux is rock solid, though, and I have seen a Linux kernel panic only once in my life. The only thing I have to worry about are power outages. And my cat stepping on the damn switch on the power strip. Shifting a bit towards performance sounds nice to me.

And if we ignore POSIX, then when what do we have? We end up getting crap like the arbitrary, undocumented mess that is the Windows API.

The only reason this gives any kind of performance benefits at all is because most applications are not calling fsync().

This is good, as they are deferring to the filesystem, letting the user choose the trade-off. Apps need to do fsync() if the data is very important (i.e. system logs) or if the application is destroying important old data (writing over old config files, like KDE). I am sure there are a few other situations too. But if they are always calling fsync() for every write they are doing it wrong.

Re:LOL: Bug Report by marcosdumay · 2009-03-20 03:20 · Score: 1

"Thus to be fs independent, applications should call fsync to force data be physically written to disk."

Or they could document that feature (that it seems that everybody needs badly) and fix ext4... People could also comit to a middle point, adding another system call for atomic moves, or another couple of oppening modes.

--
Rethinking email

Re:Those who fail to learn the lessons of history. by marcosdumay · 2009-03-20 03:31 · Score: 1

"Why did the Ext4 developers make the same mistakes Reiser and XFS both made (and later corrected) years ago?"

Because the standard wasn't fixed, just the code.

--
Rethinking email

Re:LOL: Bug Report by jd · 2009-03-20 03:42 · Score: 1

If it was that way round (valilla Windows vs. hardware + Linux), sure. However, Linux wouldn't require the extra hardware to run, so you don't "need" it.

It would also be a fairer comparison to say that Linux + a capacitor + a rechargeable battery for RAM = Windows + a SAN box. And unless you're talking one serious capacitor, the SAN box is going to be more expensive.

Besides, battery-backed RAM and ultracapacitors wouldn't be OS-specific. You'd get a performance gain and a reliability gain on ANY OS that supported queued commands. At that point, any OS that failed to queue correctly (on the assumption the drive must be slow) would suddenly be much less reliable than any of its competitors.

Also, bear in mind that a rechargeable battery capable of preserving the disk's command queue is probably going to add fifty cents to every hundred dollars of disk. Unless you're buying disks in the same sort of numbers as Google, the overheads are going to be lower than the difference in disk price between stores. You'd never come close to reaching even a single Windows license fee.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:LOL: Bug Report by jd · 2009-03-20 04:07 · Score: 1

It's not really compensating for a defect in the software, because an ultracapacitor or battery-backed RAM will work on all filing systems, and indeed on all Operating Systems.

Even the most extreme solution I offered is nothing more than turning a local hard drive into NAS where the "network" just happens to be the internal bus. It has all the benefits of NAS (such as the queue not being corrupted when you power down the machine, not tying up the main CPU, and so on), but at a fraction of the cost (because you don't need the compute power or a whole new network + NICs).

You could also argue that it's not "adding" hardware, since all IEEE 488 (and many SCSI) hard drives used to be intelligent peripherals. Rather, all more modern hard drives are cut-down. It's like the Winmodem. Nobody argued that "full" modems were Winmodems + added hardware. Besides, a 50 cent capacitor and a torch battery are hardly extensive hardware mods.

Nor is this really a departure from current design methods. SCSI drives are forever increasing the size of the queues they support and they are already nominally intelligent peripherals. The most that could be said is that this suggestion extends the idea to ATA and SATA drives and replaces the typical absurd 16-command queue with a 32768-command queue.

This idea isn't heavyweight, like UPS. The drive would not run without mains power. Rather, the drive would retain all commands and finish off whatever it had been doing when power is restored. No need for whopping big batteries and expensive extras to keep the mechanical bits going. All you need is to stop the RAM decaying, just enough juice to keep the dynamic RAM refreshing, nothing more. One, maybe two, rechargeable camera batteries should be enough to handle most situations.

And if there's corruption after that, well, even a "perfect" FS wouldn't have given you any better retention.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:LOL: Bug Report by Foolhardy · 2009-03-20 04:13 · Score: 1

I was thinking of one transaction, where the file is truncated and new contents are written. Either both entirely happen, or neither do.

OTOH, I think you're on to something with a simplified transaction system that allows atomic pairs of operations that can be used to implement things like this, e.g. replace one file with another, ensuring it's written completely on commit.

Re:LOL: Bug Report by spitzak · 2009-03-20 04:56 · Score: 1

If you open file A and write some data to it, then close it, it does make sense that if it crashes and you look at the disk, you might see A as being zero length. It was zero length at one time, so this is an expected state to see it in.

However if you then rename A to B, it in effect does the rename before finishing the data. So now B can be a zero-length file.

This is really unexpected, and not good, because most programs doing this actually copied a lot of interesting and useful data from B to A, changing just a little bit (imagine a configuration file where one flag is switched, or a text editor saving the new version). All previous Unix systems when they crashed (if you ignore ones where the disk was left uselessly trashed) would have B either be it's old version or with the full data that was written to A, and programs relied on this. It is a very very useful to rely on this.

People saying "fsync!" do not understand the problem, and their solution will make performance dreadful, as bad as EXT2 or worse. The desire is to have *either* the old or new data. If the old data is still on the disk, well then a configuration or some editing was lost, which seems quite understandable considering your machine just crashed. But not the whole file including data that was correctly on the disk for possibly a year before the crash!

Re:LOL: Bug Report by spitzak · 2009-03-20 05:01 · Score: 1

I may not have explained it right. What fsync would do it cause the currently-written file to appear at the name, but it would remain open and writable. Doing creat() immediately followed by fsync() would be like creat() is today.

The reason is that otherwise the fsync() is useless. Nobody else can see the file you are writing, and if the system crashes the file you are writing had better be completely gone when it is brought up (the previous file with that name would still appear). So calling fsync() unless it makes your file appear would serve no purpose.

Re:LOL: Bug Report by Hal_Porter · 2009-03-20 05:26 · Score: 1

You could also argue that it's not "adding" hardware, since all IEEE 488 (and many SCSI) hard drives used to be intelligent peripherals. Rather, all more modern hard drives are cut-down. It's like the Winmodem. Nobody argued that "full" modems were Winmodems + added hardware.

Actually Winmodems are a better example than the harddrive one. Microsoft developed software to run on the host CPU to emulate a modem. As far as I can see they turned the modem hardware itself into a soundcard - actually AMR cards were a winmodem and a soundcard. Then they had a proprietary standard where the winmodem manufacturers could write a driver to work in their environment to abstract away the hardward differences.

The laptop I'm writing this on has an Winmodem, a Motorolla SM56 on the HDAudio bus which seems to be a descendant of this sort of technology. What's clever from the Microsoft point of view is that you can make a cheap modem, if you're willing to make it Windows only.

Of course, these days modems are essentially obsolete, but Windmodems are so cheap that they were still putting them into laptops a year ago when I bought this machine. Now imagine if the same situation happened with hard drives where you could run windows on a dumb one that was cheap to make but Linux required a more expensive, smarter one. Laptops being cost sensitive and Windows being a common case you'd quickly find that most laptops came with the dumb drive and Linux would be unable to run on them. With a laptop it's not like you could change hard drive controller either.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;

Re:LOL: Bug Report by swilver · 2009-03-20 06:05 · Score: 1

It's a bit more complex than that.

The bug does NOT exist in Ext3 with its default mount options, because it saves data in ordered mode. That means that it will either flush everything to disk up to a certain point in time (meta data and data alike) or nothing at all.

The problem with Ext4 is that they decided it would be a good idea to have 2 time lines. One for meta-data, and one for other data. It's possible those two are NOT in sync (as in, meta-data has been flushed up to point X in time, while data only has been flushed until point Y in time). Since meta-data is usually tiny in comparison to actual data, I don't even see why they would do this. Just donot flush meta-data until you have to actually flush real data as well. Problem solved.

Furthermore, if you indeed did change all applications to call fsync() when needed, performance of Ext4 would be worse than current Ext3 performance in ordered mode.

Re:The problem is/was in the EXT3 in the first pla by swilver · 2009-03-20 06:31 · Score: 1

As a filesystem author, I can tell you that calling fsync() for whatever reason is ALWAYS a huge performance hit. The only thing applications expect is that things are ==SEEMINGLY== done in order (as in a time line). Ext4 can send stuff over the internet for all I care, but when my application asks it to do A, B, C, then no matter what happens, I can NEVER EVER end up with A & C after a crash. The only acceptable states after a crash are nothing at all, A, A+B or A+B+C.

That still leaves plenty of room for writing things out-of-order and doing delayed block allocation, because, as long as order is guaranteed, I don't care if things happen like this:

A (10 minutes pause) B + C

The order of the actions is important, not when they get flushed (if ever), as long as no FUTURE events are flushed first without flushing all preceding events. I could write a filesystem that only touches the disc every 30 minutes (given sufficient memory) and still be able to preserve this simple basic expectation.

Re:LOL: Bug Report by muridae · 2009-03-20 06:45 · Score: 1

Transaction One: Write foo_tmp.bar and sync
Transaction Two: Delete foo.bar, rename foo_tmp.bar to foo.bar and sync

I think that might be the wrong order for transaction two, unless you can accomplish all of that with a single write to the disk. Might it be better to rename foo_tmp to foo.bar, then delete the original? That way, if the power were to fail between the delete and the rename cycle you would still have a file.

Or, with a journal, you could do it in three steps, just like you said:
1: journal that foo_tmp.bar will become foo.bar if foo.bar is gone.
2: write foo_tmp.bar and sync.
3: delete foo.bar, rename foo_tmp.bar, and sync.

Re:LOL: Bug Report by jd · 2009-03-20 06:53 · Score: 1

True if Linux required it. But let's say Linux'll run on any hard drive, but the "super hard drive" will let you run a desktop or server Linux install that is much more robust.

Laptops generally have power regulation, so can flush buffers and do all other I/O in a controlled way if the main battery reaches a critical level, so I don't see any value in adding robustness that would never get utilized on such a system. Although, if it's only going to add $0.5 - $5 to a hard drive, the value of the selling point would likely see it ending up in most laptops anyway.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:LOL: Bug Report by shentino · 2009-03-20 14:27 · Score: 1

I could say the same things about windows vista...

At least linux is honest when things foul up unexpectedly.

But then the *fs* wouldn't be compliant. by spaceturtle · 2009-03-20 15:18 · Score: 1

1. POSIX is an API. It tries not to force the filesystem into being anything at all. So, for instance, you can write a filesystem that waits to do writes more efficiently to cut down on the wear of SSDs.

POSIX explicitly requires that data be flushed to disk when fsync is called. This wears out SSDs and forces HDDs to spin up wearing out. So while you could write an fs that ignores fsyncs (MacOS X more-or-less does this), that fs would be not be POSIX compliant.

So, if your application wants to "delete this file after this other file has been deleted, without triggering a spinup", using fsync is *explicitly* wrong. Under *any* POSIX compliant fs this will force an spinup.

If you do not use an fsck, your app will be compatible with at least some POSIX compliant fs, including ext3 with data=ordered.

2. Ext3 has a max 5 second delay. That means this bug exists in Ext3 as well.

I understand that when running on battery this is usually increased to 15 seconds which is probably enough to stop your harddisk dieing before its time.

4. Atomicity does not guarantee the filesystem be synchronized with cache. It means that during the update no other process can alter the affected file and that after the update the change will be seen by all other processes.

Good. Because we not want the filesystem to be syncronized with disk (cache?). This forces a spin up. All we want is for atomicity to be preserved during a crash; this does not require a spin up. Note, for example, that if the drive does not spin up before the crash then the neither will the old file be deleted nor will the new file be written.

Re:LOL: Bug Report by damium · 2009-03-21 08:01 · Score: 1

Real Disadvantages: You risk data loss with any application that stores critical data using either (1) a truncate/write method or (2) a write/rename method without a asking the OS to sync it's data. I think that far fewer than 95% of applications fall under (2) and every filesystem will have issues with (1).

For (1) there is nothing the OS can do for the application, just about any file system would loose data in this case depending on how long it caches the writes in memory and if the application has a chance to finish writing all of the data. (1) is clearly bad application code at fault. Ext4 does increase the write-delay for the data but any way you use (1) is asking for problems if the system crashes/the disk fills up/etc.

For (2) the file system could implement atomic rename operations but that would be at a slight performance loss when the application didn't need this atomic operation. This is more of a do-what-I-mean-not-what-I-say workaround as I don't see too many situations where (2) would be used without expecting atomic operation. If the application didn't care about possible data loss in the file (1) works well. The real fix however is to call sync() in the application code in this situation, it makes the code more portable across posix filesystems.

Re:LOL: Bug Report by Eskarel · 2009-03-22 02:04 · Score: 1

Yes, but the point is, the reason they leave it to the file system is that it's they presume the file system won't be stupid with their data.

There are thousands of posts in here talking about the fact that you're supposed to be calling fsync() and if you're not then it's just your own damned fault. I'm just pointing out that by forcing applications who didn't need fsync before to fsync they're actually going to hurt performance all in the name of a performance gain.

Re:The problem is/was in the EXT3 in the first pla by Anonymous Coward · 2009-03-22 03:06 · Score: 0

That's not the only issue. Another is that THERE IS NO FUCKING FSYNC in C.

fopen/fwrite/fclose/rename is the best you can do without diving into platform-specific stuff. From that POV any platform that reorders the renaming to happen before fclose is horribly broken. Fuck this bullshit, 9899:1990 comes first, POSIX next.

"If you want a platform independent program, ..." by spaceturtle · 2009-03-22 03:24 · Score: 1

From your link: "If you want a platform independent program, avoid fsync". It seems unfortunate that OS implementers have so often interpreted this as "If you want a platform independent program, kiss your data goodbye."

5 seconds not the problem. by spaceturtle · 2009-03-22 03:28 · Score: 1

It doesn't matter how long the delay is, so long as the OS doesn't reorder "delete old file" to occur before the "create new file". In ext3 in ordered mode, the OS doesn't reorder writes, so you are fine. (unless your hard-disk reorders writes)

Re:LOL: Bug Report by amorsen · 2009-03-22 05:37 · Score: 1

I understand that fsync is useless when done on one of those temporary files, but I think that's preferable to it having a strange side effect.

--
Finally! A year of moderation! Ready for 2019?

Re:LOL: Bug Report by MikeBabcock · 2009-03-24 01:41 · Score: 1

There are mount options to make EXT4 behave the way you want for all applications. This is another good reason not to use one big filesystem for your whole disk.

My mail server has separate LVM mounts for the mail queue, the IMAP storage, the root partition, the logs, /var/lib, local and home so they can be mounted with different options (and independently grown as necessary).

On a home PC, this might mean mounting your /home directory with "just write it now, seriously" while /tmp and /var is more lax.

--
- Michael T. Babcock (Yes, I blog)

Re:LOL: Bug Report by MikeBabcock · 2009-03-24 02:48 · Score: 1

The rename /isn't/ happening before the data is written, unless you replay the journal, and you're not journaling data.

If you think about that, it makes perfect sense.

The rename is NOT happening on disk before the data is written to disk if the system is running normally.

If the system crashes, the log replay may rename the file without data because you're logging metadata (like renames) not data. Just turn on data logging and you'll be fine.

--
- Michael T. Babcock (Yes, I blog)

Re:LOL: Bug Report by Schraegstrichpunkt · 2009-03-30 16:40 · Score: 1

THANK YOU.

--
http://outcampaign.org/

Slashdot Mirror

Ext4 Data Losses Explained, Worked Around

421 comments