Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4

Not a bug by casualsax3 · 2009-03-11 09:06 · Score: 5, Informative

It's a consequence of not writing software properly. Relevant links later in the same comment thread for those who don't might otherwise miss them:

https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45

https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54

Re:Not a bug by mbkennel · 2009-03-11 09:19 · Score: 5, Insightful

I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.
Quoting T'so:
"The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, and that it uses fdatawrite() instead of fsync() to guarantee that data is written on disk. If sqllite had been properly written so that it grabbed new space for its database storage in chunks of 16k or 64k, and released space when it was no longer needed in similar large chunks via truncate(), and if it used fdatasync() instead of fsync(), the performance problems with FireFox 3 wouldn't have taken place."
In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.
I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
File systems are nice. That's what Unix is about.
I don't think programmers ought to be required to treat them like a pouty flake: "in some cases, depending on the whims of the kernel and entirely invisible moods, or the way the disk is mounted that you have no control over, stuff might or might not work."
Re:Not a bug by idontgno · 2009-03-11 09:20 · Score: 3, Insightful

lol.
It's a consequence of a filesystem that makes bad assumptions about file size.
I suppose in your world, you open a single file the size of the entire filesystem and just do seek()s within it?
It's a bug. A filesystem which does not responsibly handle any file of any size between 0 bytes and MAXFILESIZE is bugged.
Deal with it and join the rest of us in reality.

--
Welcome to the Panopticon. Used to be a prison, now it's your home.
Re:Not a bug by jgarra23 · 2009-03-11 09:26 · Score: 2, Interesting

Talk about doublespeak! Not a bug vs. It's a consequence of not writing software properly. reminds me of that FG episode where Stewie says, "it's not that I want to kill Lois... it's that I don't... want... her... to... live... anymore."
Re:Not a bug by Qzukk · 2009-03-11 09:29 · Score: 5, Interesting

I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
It is perfectly OK to read and write thousands of tiny files. Unless the system is going to crash while you're doing it and you somehow want the magic computer fairy to make sure that the files are still there when you reboot it. In that case, you're going to have to always write every single block out to the disk, and slow everything else down to make sure no process gets an "unreasonable" expectation that their is safe until the drive catches up.
Fortunately his patches will include an option to turn the magic computer fairy off.

--
If I have been able to see further than others, it is because I bought a pair of binoculars.
Re:Not a bug by TerranFury · 2009-03-11 09:30 · Score: 4, Insightful

Ummm... it deals correctly with files of any size. It just loses recent data if your system crashes before it has flushed what it's got in RAM to disk. That's the case for pretty much any filesystem; it's just a matter of degree, and how "recent" is recent.
Re:Not a bug by Hatta · 2009-03-11 09:32 · Score: 3, Insightful

The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries
Translation: "Our filesystem is so fucked up, even SQL is better."
WTF is this guy thinking? UNIX has used hundreds of tiny dotfiles for configuration for years and it's always worked well. If this filesystem can't handle it, it's not ready for production. Why not just keep ALL your files in an SQL database and cut out the filesystem entirely?

--
Give me Classic Slashdot or give me death!
Re:Not a bug by Anonymous Coward · 2009-03-11 09:37 · Score: 5, Informative

Quoting T'so:
"The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, ...
Linux reinvents windows registry?
Who knows what they will come up with next.
Re:Not a bug by fireman+sam · 2009-03-11 09:37 · Score: 4, Insightful

The benefit of journaling file systems is that after the crash you still have a file system that works. How many folks remember when Windows would crash, resulting in a HDD that was so corrupted the OS wouldn't start. Same with ext2.
If these folks don't like asynchronous writes, they can edit their fstab (or whatever) to have the sync option so all their writes will be synchronous and the world will be a happy place.
Note that they will also have to suffer a slower system, and possible shortened lifetime of their HDD, but at least there configuration files will be safe.

--
it is only after a long journey that you know the strength of the horse.
Re:Not a bug by GigsVT · 2009-03-11 09:37 · Score: 3, Insightful

Instead, the answer is to use a proper small database like sqllite for application registries
Yeah, linux should totally put in a Windows style registry. What the fuck is this guy on.

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.
Re:Not a bug by Logic+and+Reason · 2009-03-11 09:42 · Score: 5, Insightful

I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
To paraphrase https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 : You certainly can use tons of tiny files, but if you want to guarantee your data will still be there after a crash, you need to use fsync. And if that causes performance problems, then perhaps you should rethink how your application is doing things.
Re:Not a bug by msuarezalvarez · 2009-03-11 09:46 · Score: 5, Insightful

As an application developer, the last thing I want to worry about is whether or not the fraking filesystem is going to persist my data to disk.
As an application developer, you are expected to know what the API does, in order to use it correctly. What Ext4 is doing is 100% respectful of the spec.
Re:Not a bug by davecb · 2009-03-11 09:46 · Score: 5, Insightful

It seems exceedingly odd that issuing a write for a non-zero-sized file and having it delayed causes the file to become zero-size before the new data is written.
Generally when one is trying to maintain correctness one allocates space, places the data into it and only then links the space into place (paraphrased from from Barry Dwyer's "One more time - how to update a master file", Communications of the ACM, January 1981).
I'd be inclined to delay the metadata update until after the data was written, as Mr. Tso notes was done in ext3. That's certainly what I did back in the days of CP/M, writing DSA-formated floppies (;-))
--dave

--
davecb@spamcop.net
Re:Not a bug by davecb · 2009-03-11 09:49 · Score: 4, Informative

Er, actually it removes the previous data, then waits to replace it for long enough that the probability of noticing the disappearance approaches unity on flaky hardware (;-))
--dave

--
davecb@spamcop.net
Re:Not a bug by OeLeWaPpErKe · 2009-03-11 09:51 · Score: 5, Informative

Let's not forget that the only consequence of delayed allocation is the write-out delay changing. Instead of data being "guaranteed" on disk in 5 seconds, that becomes 60 seconds.
Oh dear God, someone inform the president ! Data that is NEVER guaranteed to be on disk according to spec is only guaranteed on disk after 60 seconds.
You should not write your application to depend on filesystem-specific behavior. You should write them to the standard, and that means fsync(). No call to fsync, look it up in the documentation (man 2 write).
The rest of what Ted T'so is saying is optimization, speeding up the boot time for gnome/kde, it is not necessary for correct workings.
Please don't FUD.
You know I'll look up the docs for you :
(quote from man 2 write)

NOTES
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee
that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has
been written, the call succeeds, and returns the number of bytes written.
That brings up another point, almost nobody is ready for the second remark either (write might return after a partial write, necessitating a second call)
So the normal case for a "reliable write" would be this code :
size_t written = 0;
int r = write(fd, &data, sizeof(data))
while (r >= 0 && r + written sizeof(data)) {
written += r;
r = write(fd, &data, sizeof(data));
}
if (r 0) { // error handling code, at the very least looking at EIO, ENOSPC and EPIPE for network sockets
}
and *NOT*
write(fd, data, sizeof(data)); // will probably work
Just because programmers continuously use the second method (just check a few sf.net projects) doesn't make it the right method (and as there is *NO* way to fix write to make that call reliable in all cases you're going to have to shut up about it eventually)
Hell, even firefox doesn't check for either EIO or ENOSPC and certainly doesn't handle either of them gracefully, at least not for downloads.
Re:Not a bug by CyprusBlue113 · 2009-03-11 09:54 · Score: 2, Insightful

Unless you have an explicit sync there, YOUR ASSUMPTION IS BUGGED. This is completely reasonable behavior of a write caching system.

--
a handful of selfish greedy people are no match for millions of selfish, greedy people -u4ya
Re:Not a bug by caerwyn · 2009-03-11 09:56 · Score: 4, Insightful

No. It's not.
If what you say is true there would be no need for the fsync() function (and related ones).
Read the standards if you want. The filesystem is only bugged if it loses recent data under conditions where the application has asked it to guarantee that the data is safe. If the app hasn't asked for any such guarantee by calling fsync() or the like, the filesystem is free to do as it likes.

--
The ringing of the division bell has begun... -PF
Re:Not a bug by Anonymous Coward · 2009-03-11 09:59 · Score: 2, Insightful

You're wrong, and so are most comments here.
When you open() a file in the filesystem, wrtei() one byte to it, and close() that file, you haven't really guaranteed crap on any normal filesystem, unless you're using a very strange filesystem or you're using non-standard mount options to force every action to happen synchronously.
If a crash happens between close() and the filesystem flushing data to disk, you will lose data. If you want to prevent this happening, you must either use calls like fsync() or fdatasync() (or many other mechanisms that act similarly), or use mount options that make all calls synchronous.
The only reason this has become a big blow-up issue with ext4 is that while other filesystems generally would sync the data shortly anyways, ext4 does not. Everyone has been relying on bad assumptions about filesystem behavior and getting by on the fact that "usually" the situation was resolved "somewhat quickly". ext4 does not resolve these things quickly, in the name of efficiency and performance. There was a never a guarantee under any filesystem of things getting done (to disk) quickly unless you explicitly ask for it.
Re:Not a bug by Profane+MuthaFucka · 2009-03-11 10:01 · Score: 5, Funny

That would be smart, but only if the SQL database is encrypted too. It's theoretically possible to read a registry with an editor, and we can't have that. Also, we need a checksum on the registry. If the checksum is bad, we have to overwrite the registry with zeroes. Registries are monolithic, and we have to make sure that either it's good data, or NONE of it is good data. Otherwise the user would get confused.
I am so excited about this that I'm going to start working on it just as soon as I get done rewriting all my userspace tools in TCL.

--
Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
Re:Not a bug by Jurily · 2009-03-11 10:04 · Score: 4, Informative

It just loses recent data if your system crashes before it has flushed what it's got in RAM to disk.
No, that's the bug. It loses ALL data. You get 0 byte files on reboot.
Re:Not a bug by PIBM · 2009-03-11 10:05 · Score: 2, Interesting

That's your filesystem definition. Even there, I can guarantee you it can't be built, thus, from your point of view, no file system will ever not be bugged.
How come ?
I open a file
I write one byte
I close the file
Data is not on disk BECAUSE IT WAS FULL and you failed to plan for intercepting errors / warnings.
The filesystems needs to be used along with their specifications, not the way you'd want them to work.
Re:Not a bug by caerwyn · 2009-03-11 10:15 · Score: 2, Insightful

I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.
No. Writing software properly means calling fsync() if you need a data guarantee.
Pretty sure no one in their right mind would call using fsync() barriers a "huge burden". There are an enormous number of programs out there that do this correctly already.
And then there are some that don't. Those have problems. They're bugs. They need to be fixed. Fixing bugs is not a "huge burden", it's a necessary task.

--
The ringing of the division bell has begun... -PF
Re:Not a bug by caerwyn · 2009-03-11 10:16 · Score: 4, Informative

You're right. The correct thing to do is to *always* call fsync() when you need a data guarantee, *regardless* of which FS you're on. The fact that not doing it in the past hasn't caused problems isn't the problem- those calls are the correct way of handling things.

--
The ringing of the division bell has begun... -PF
Re:Not a bug by dmiller · 2009-03-11 10:21 · Score: 4, Informative

You are doing it wrong; permanently failing on recoverable EINTR and EAGAIN errors. See here for how to do it right.
Re:Not a bug by xenocide2 · 2009-03-11 10:23 · Score: 4, Insightful

UNIX filesystems have used tiny files for years and they've had data loss under certain conditions. My favorite example is the XFS that would journal just enough to give you a consistent filesystem full of binary nulls on power failure. This behavior was even documented in their FAQ with the reply "If it hurts, don't do it."
Filesystems are a balancing act. If you want high performance, you want write caching to allow the system to flush writes in parallel while you go on computing, or make another overlapping write that could be merged. If you want high data security, you call fsync and the OS does its best possible job to write to disk before returning (modulo hard drives that lie to you). Or you open the damn file with O_SYNC.
What he's suggesting is that the POSIX API allows either option to programmers, who often don't know theres even a choice to be had. So he recommends concentrating the few people who do know the API in and out focus on system libraries like libsqllite, and have dumbass programmers use that instead. You and he may not be so far apart, except his solution still allows hard-nosed engineers access to low level syscalls, at the price of shooting their foot off.

--
I Browse at +4 Flamebait
Open Source Sysadmin
Re:Not a bug by Kaboom13 · 2009-03-11 10:25 · Score: 2, Informative

The point of a journal is to allow the file system to return to a defined state in the case the unexpected happens. This keeps the whole file system from being fucked by a crash or sudden data loss. It's better to know you lost some data, then have the filesystem in a state where some data is corrupt but you have no way to tell where or what it is. The situation here is ext 4 has increased the timeframe between commits. This increases performance at the cost of losing more data if a crash happens. Total crashes are pretty rare these days (unless you run some really shitty code) and UPS's are inexpensive. Hell my XP system has Blue Screened once over the last two years, and it was directly related to a beta nvidia driver.
If your system is likely to crash or lose power, don't use ext4.
Re:Not a bug by QuasiEvil · 2009-03-11 10:27 · Score: 3, Insightful

In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.
I couldn't agree more. A filesystem *is* a database, people. It's a sort of hierarchical one, but a database nonetheless.
It shouldn't care if there's some mini-SQL thing app sitting on top providing another speed hit and layer of complexity or just a bunch of apps making hundreds of f{read|write|open|close|sync}() calls against hundreds of files. Hundreds of files, while cluttered, is very simple and easily debugged/fixed when something gets trashed. Some sort of obfuscated database cannot be fixed with mere vi. (Emacs, maybe, but only because it probably has 17 database repair modules built in, right next to the 87 kitchen sinks that are also included.)
I do rather agree that it's not a bug. An unclean shutdown is an unclean shutdown, and Ts'o is right - there's not a defined behaviour. Ext4 is better at speed, but less safe in an unstable environment. Ext3 is safer, but less speedy. It's all just trade-offs, folks. Pick one appropriate to your use. (Which is why, when I install Jaunty, I'll be using Ext3.)
Re:Not a bug by OeLeWaPpErKe · 2009-03-11 10:28 · Score: 2, Informative

You're only partially right. EAGAIN cannot occur unless I asked for it first (and modified my error catching accordingly).
But you're right about EINTR causing unwarranted disruption. I should ignore that one in the while loop.
Re:Not a bug by PhilHibbs · 2009-03-11 10:37 · Score: 2, Informative

It seems exceedingly odd that issuing a write for a non-zero-sized file and having it delayed causes the file to become zero-size before the new data is written.
But you never create and write to a file as a single operation, there's always one function call to create the file and return a handle to it, and then another function call to write the data using the handle. The first operation writes data to the directory, which is itself a file that already exists, the second allocates some space for the file, writes to it, and updates the directory. Having the file system spot what your application is trying to do and reversing the order of the operations would be... tricky.
Re:Not a bug by gweihir · 2009-03-11 10:51 · Score: 2, Insightful

Indeed. And that is what the suggestion about using a database was all about. You still can use all the tiny files. And there are better options than syncing for reliability. For example, rename the file to backup and then write a new file. The backup will still be there and can be used for automated recovery. Come to think of it, any decent text editor does it that way.
Tuncating critical files without backup is just incredibly bad design.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Not a bug by gweihir · 2009-03-11 10:59 · Score: 2, Informative

The point of having a rock-solid filesystem is to have a rock-solid filesystem. Any filesystem that crashes and loses data is bad. What is the point of a journal again? To enforce someone's idea of how an API should be coded to, or to reduce data loss?
ext4 did not crash. Ext4 also did not lose any data it claimed to have gotten to disk. However, unless you want the filesystem slower by a factor of 10x....100x, you have to delay writes. And that means your data is only reliably on disk after an fsync. Any good developer knows that.
Indicentially, the journal serves to avoid filesystem corruption on crash, nothing else. And no other claim was ever made by the developers.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Not a bug by Bronster · 2009-03-11 10:59 · Score: 3, Insightful

You're welcome to write lots of little files. It will just be slow if you sync them all, or unsafe if you don't.
Same way a database will tell you to wrap lots of actions in a single transaction if you don't want the cost of a full commit after each action.
Except the filesystem API doesn't have any way to says "commit these 500 little files in a single transaction", unfortunately.
Annoyingly, it also doesn't have "unlink this directory and the files inside it in a single transaction", because unlink performance blows goats.
Re:Not a bug by bluefoxlucid · 2009-03-11 11:10 · Score: 2, Funny

No, we have journals so the file system doesn't get a gaping hole in it that starts cross-linking shit and damaging more files after the initial data loss, and then implode and fuck your mom.

--
Support my political activism on Patreon.
Re:Not a bug by drew · 2009-03-11 11:28 · Score: 2, Insightful

The whole bit you quoted about SQLite was about optimization, not correctness.
the KDE and Gnome developers would be OK using the current file structure to save data so long as they had bothered to call fsync().

What emacs (and very sophisticated, careful application writers) will do is this:
3.a) open and read file ~/.kde/foo/bar/baz
3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
3.d) fsync(fd) --- and check the error return from the fsync
3.e) close(fd)
3.f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") --- this is optional
3.g) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")
The fact that series (1) and (2) works at all is an accident. Ext3 in its default configuration happens to have the property that 5 seconds after (1) and (2) completes, the data is safely on disk. (3) is the ***only*** thing which is guaranteed not to lose data. For example, if you are using laptop mode, the 5 seconds is extended to 30 seconds.
The problem is that the KDE developers were skipping step "d", presumably because they felt it slowed down the application too much. Fortunately(?) for them, with ext3 in its default configuration, it happened to not matter too much that they were skipping an important step.
The part you quoted was merely discussing a potential way to store lots of isolated bits of data without the overhead of calling fsync() constantly.

--
If I don't put anything here, will anyone recognize me anymore?
Re:Not a bug by gweihir · 2009-03-11 11:29 · Score: 3, Insightful

The idiocy is in expecting the FS to do something it was never asked to do. There is one way to commit data to disk in Posix systems. That function has existed for well over 20 years. It's probably going on 35 years now, but I don't know my Unix history well enough to be sure.
I think the problem is with more and more people beliving themselves to be good programmers, when they really do not undertstand what they are doing. Truncating and then writing critical files is a very bad idea to begin with. The way you do it is to rename the old file to backup and write to a new file. Also have a procedure in place to recover from backup if the main file is broken. Maybe even to checksums on the main file. In addition, only write if you have to. That is robust design, not the amateur-level truncate the KDE folks seem to be doing routinely.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Not a bug by Ed+Avis · 2009-03-11 12:01 · Score: 3, Informative

YES!! That is EXACTLY what I expect the every modern file system to do.
Your expectation is quite reasonable. When the application writes something to disk, it should be there on disk, right? The way the article is presented makes it sound like a horrible bug in ext4 that it doesn't do this. But believe it or not, almost no filesystem provides this guarantee by default. ext3 doesn't (in the default mode), nor does ext2, nor a typical implementation of FAT or NTFS or the Minix filesystem or whatever.
For decades now it has been an accepted trade-off that the filesystem can hold back disk writes and do them later, giving better disk performance at the expense of losing data if there is a crash. Losing file data is bad but losing metadata is even worse, since corrupt filesystem metadata can trash the contents of many files and requires a lengthy fsck on startup. So journalling filesystems, as typically configured, keep a journal for metadata so it's not corrupted even if the power gets cut at the most inconvenient moment. But they don't extend the same care to file contents, because it would be too slow. You can enable it by setting the data=journal parameter in ext3 (and I guess ext4 too) but this isn't the detail.
It is certainly a bit unfair that the filesystem takes such pains with its own bookkeeping information but doesn't bother to be so careful about user data. But as I said, it's a known tradeoff to get better performance. If you want to be sure your file has reached disk you need to fsync(). This sucks, but it's the Unix way, and has been so for like, forever. So it's not a bug in ext4 - just bad luck and perhaps a misunderstanding between kernel and userspace about what guarantees the filesystem provides.
As SSDs replace rotating storage, there is less need to buffer writes (certainly the need to minimize seek time goes away, and that's the biggest reason), so we might see this whole situation resolved within a few years. Perhaps in 2015, when the system call returns, you can be sure that the data is written. Until that longed-for day, bear in mind that your filesystem is permitted to temporarily lie to you about what has been written, and call fsync() if you are paranoid.

--
-- Ed Avis ed@membled.com
Re:Not a bug by Qzukk · 2009-03-11 12:49 · Score: 4, Informative

A file system should take my data buffer, and after saying "Ok, I got it"
There's your problem, you didn't even bother to ask if it got it, you just threw a ton of data into the file descriptor and closed it, now didn't you. And you want me on thedailywtf?
But lets back up here, because there's more than just people too lazy to call fsync() in order to ask the file system to write the data to the disk and say "Ok, I got it".
All that stuff about creating a backup copy and doing this and that, has to happen inside the file system.
The filesystem does exactly what you tell it to do. If you don't want it to make a zero byte file, then DON'T USE O_TRUNC OR *truncate() TO EMPTY YOUR FILE. Make a new file, fill it up, rename it over the other file. Don't assume that in just a few instructions, you're going to be filling it back up with new data, because those instructions may never arrive.
You don't like it? Try and convince people that (open file, erase all the data in it, do some stuff, write some data, do some more stuff, write some more data, write data to disk, close file) should be an uninterruptable atomic operation. You want a versioning filesystem? Take your pick.

--
If I have been able to see further than others, it is because I bought a pair of binoculars.
Re:Not a bug by shutdown+-p+now · 2009-03-11 13:07 · Score: 2, Informative

The problem with that is that you have to use fsync() for each and every file descriptor you have, and for lots of small times, this is very slow (because if you're syncing after every 10-byte write, you might as well have no caching). What's needed is a way to write those files in a batch, close them all, and then say "now sync all of that".
To the best of my knowledge, though, Windows has the same problem - its fsync analog, FlushFileBuffers, also applies to a single file handle only (you can flush all writes for the volume, but only if you're an admin.
Re:Not a bug by dbIII · 2009-03-11 13:13 · Score: 4, Insightful

Linux reinvents windows registry?
It's called "gconf", and it's worse than that. It's no longer abandonware lurking at the heart of gnome but it's still a nightmare.
Re:Not a bug by GXTi · 2009-03-11 13:47 · Score: 2, Informative

and after saying "Ok, I got it", *guarantee*, that I can turn off the system in that very moment, without losing data or corrupting the file system in any way.

Which is precisely what fsync does, and is precisely what these developers didn't use. The filesystem knows better than you do how to get all the data it has to write onto the platters as fast as possible so if you need something specific like "it's important that this data get written now, so I'll wait for you to finish", you have to ask. Otherwise your apps would run a great deal slower since every little write (even a single byte!) would have to wait for the OS to say "OK, it's on disk". And if you really want that, there are flags you can use, e.g. O_SYNC. But you don't.
Re:Not a bug by shutdown+-p+now · 2009-03-11 14:00 · Score: 3, Insightful

Close, but no cigar. The data we need safe is the one already on the disk: if you don't flush, you get to keep the old version already on the disk.
That's an interesting interpretation of fsync(), but, unfortunately, one that's not supported by the POSIX spec. Nowhere it says that the system cannot flush the data that you've already written so far without an explicit fsync() call. If you're unlucky enough that this happened after you've truncated the file, but before you wrote anything into it - well, too bad. As I understand, ext3 could also exhibit this behavior, it was simply harder to reproduce because the implicit flushes were much more frequent.
Anyway, this post seems to explain what's actually going on there in the (very specific) case of KDE.
Re:Not a bug by EsbenMoseHansen · 2009-03-11 21:02 · Score: 2, Informative

No. Writing software properly means calling fsync() if you need a data guarantee.
But neither Gnome nor KDE needs this. What they need is that the file in question is either left in the old state or in the new state. The problem is that ext4 rushes in to complete the truncation, but lazily after 1-2 minutes (!) writes the actual data. That is quite broken, in my opinion. The obvious solution would be to bundle the truncation with the writing out the data.

Pretty sure no one in their right mind would call using fsync() barriers a "huge burden". There are an enormous number of programs out there that do this correctly already.
And then there are some that don't. Those have problems. They're bugs. They need to be fixed. Fixing bugs is not a "huge burden", it's a necessary task.
In KDEs case, it would be as simple as reverting a patch. The fsyncs() were removed because of the bugs associated with it, including killing laptop batteries. Dig through kde-core-devel for the gory details. The code in question is posted elsewhere.
The bug is in ext4, like it was in XFS --- where it was finally fixed. And it looks like ext4 has introduced a hack to sort of fix this problem there, too.

--
Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
Re:Not a bug by mzs · 2009-03-12 03:37 · Score: 2, Insightful

Unfortunately that is case #2 as described here:
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54
rename(2) is not guaranteed to be atomic. There are now some patches that get ext4 to perform what most people expect #2 to do. I got bitten by #2 not working correctly on MacOS X some time back, I just googled and found this:
http://www.weirdnet.nl/apple/rename.html
Ever since that time I have been using fsync in my code when I needed it. You just get into a world of hurt when you expect #2 to work right under every OS and fs and set of mount options because it doesn't.

Don't worry by sakdoctor · 2009-03-11 09:06 · Score: 5, Funny

Don't worry guys, I read the summary this time, and it only affects the German version of ext4.

Re:Don't worry by Daimanta · 2009-03-11 09:08 · Score: 3, Funny

Makes perfect sense: Germans are rediculously punctual, if the allocation is delayed you just KNOW something is terribly wrong.

--
Knowledge is power. Knowledge shared is power lost.
Re:Don't worry by microbee · 2009-03-11 09:30 · Score: 2, Funny

OMG, you expect me to RTFA??!! In a BUGzilla?

pr0n by Quintilian · 2009-03-11 09:11 · Score: 5, Funny

Real reason for the bug report: Someone's angry and wants his porn back.

Bull by Jane+Q.+Public · 2009-03-11 09:16 · Score: 4, Insightful

Blaming it on the applications is a cop-out. The filesystem is flawed, plain and simple. The journal should not be written so far in advance of the records actually being stored. That is a recipe for disaster, no matter how much you try to explain it away.

Re:Bull by Lord+Ender · 2009-03-11 09:34 · Score: 5, Funny

In fact, there is no such thing as an OS bug! All good programmers should re-implement essential and basic operating system features in their user applications whenever they run into so-called "OS bugs." If you question this, you must be a bad programmer, obviously.

--
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
Re:Bull by wild_berry · 2009-03-11 09:36 · Score: 5, Insightful

The journal isn't being written before the data. Nothing is written for periods between 45-120 seconds so as to batch up the writing to efficient lumps. The journal is there to make sure that the data on disk makes sense if a crash occurs.
If your system crashes after a write hasn't hit the disk, you lose either way. Ext3 was set to write at most 5 seconds later. Ext4 is looser than that, but with associated performance benefits.
Re:Bull by Anonymous Coward · 2009-03-11 09:36 · Score: 5, Informative

This is NOT a bug. Read the POSIX documents.
Filesystem metadata and file contents is NOT required to be synchronous and a sync is needed to ensure they are syncronised.
It's just down to retarded programmers who assume they can truncate/rename files and any data pending writes will magically meet up a-la ext3 (which has a mount option which does not sync automatically btw).
RTFPS (Read The Fine POSIX Spec).
Re:Bull by Eugenia+Loli · 2009-03-11 09:44 · Score: 4, Insightful

Rewriting the same file over and over is known for being risky. The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync. In other words, app developers must be more careful of their doings, not put all blame to the filesystems. It's so much that an fs can do to avoid such bruhahas. Many other filesystems have similar behavior to the ext4 btw.
Re:Bull by Jane+Q.+Public · 2009-03-11 09:48 · Score: 2, Interesting

That does not make it any less of a filesystem limitation. While it is true that a well-written app should be aware of potential timing issues, all the application itself should ever suffer is delays in the I/O. Anything else is a flaw. Other FSs may share the flaw, but it is still a flaw.
Re:Bull by pc486 · 2009-03-11 09:50 · Score: 5, Informative

Ext3 doesn't write out immediately either. If the system crashes within the commit interval, you'll lose whatever data was written during that interval. That's only 5 seconds of data if you're lucky, much more data if you're unlucky. Ext4 simply made that commit interval and backend behavior different than what applications were expecting.
All modern fs drivers, including ext3 and NTFS, do not write immediately to disk. If they did then system performance would really slow down to almost unbearable speeds (only about 100 syncs/sec on standard consumer magnetic drives). And sometimes the sync call will not occur since some hardware fakes syncs (RAID controllers often do this).
POSIX doesn't define flushing behavior when writing and closing files. If your applications needs data to be in NV memory, use fsync. If it doesn't care, good. If it does care and it doesn't sync, it's a bad application and is flawed, plain and simple.
Re:Bull by gweihir · 2009-03-11 09:51 · Score: 2, Insightful

The problem is KDE not doing syncs and not keeping backups on updates of critical files. Any competent implementor will try to keep these to a minimum with critical files and if they have to be done, do them carefully. Seems to me the KDS folks have to learn a basic lesson in robustness now.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Bull by Anonymous Coward · 2009-03-11 10:21 · Score: 5, Insightful

Bullshit. It is not a filesystem limitation. POSIX tells you what you can expect from file system calls. Data committed to disk as soon as an fwrite or fclose returns is not something you can or should expect. (And this is true of every OS I've used in the last 20 years.)
A great many crap programmers think APIs ought to do what they'd like them to. But APIs don't. At best they do what they are specified to do.
Re:Bull by Waffle+Iron · 2009-03-11 10:33 · Score: 4, Insightful

The filesystem should be hitting the metal about 0.001 microseconds after I call write() or whatever the function is.
If that's the behavior you expect, then you need to be running your apps under an OS like DOS, not POSIX or Windows (which both clearly specify that this is *not* how they function).
Re:Bull by Anonymous Coward · 2009-03-11 10:42 · Score: 5, Insightful

Does anyone else think that 150 second is a bit over the top in terms of writing to disk?
I could understand one or two seconds as you speculate more data might come that needs to be written.
5 seconds is a bit iffy, as with ext3.
150 seconds? That's surely a bug.
Re:Bull by billcopc · 2009-03-11 10:44 · Score: 2, Insightful

Why should synchronous writes be the default ? Programmers are already too lazy and/or stupid to add a simple fsync() where needed, why should we all drop what we're doing, make the slowest option the default, and then have to jump through hoops to make things workable again ?
If asynchronous writes are the biggest of your problems, you need to find yourself a new career. One that hopefully doesn't require meticulous attention to detail.

--
-Billco, Fnarg.com
Re:Bull by NotPenny'sBoat · 2009-03-11 10:56 · Score: 2, Funny

The more distant the target, the more you have to lead, and the greater chance there is of something happening between the time you pull the trigger and the time the bullet reaches its target zone: the wind may shift, the target may change speed, or direction...
Or your mother-in-law may step between the barrel and the target. Darn.

--
What's #FFFFFF and #000000 and #FF0000 all over?
Re:Bull by frieko · 2009-03-11 10:58 · Score: 2, Insightful

Except that NTFS does exactly the same thing. Perhaps GP meant it's not a filesystem bug.
Re:Bull by DigiShaman · 2009-03-11 11:04 · Score: 4, Insightful

Wish I had mod points for you AC as I agree with you. 150 seconds is 2.5 minutes! I don't know of any file system, let alone a RAID controller that waits that longs to commit the data.
If this is a feature and not a bug, better be sure your computer is connected to a UPS. Damn!

--
Life is not for the lazy.
Re:Bull by Dahamma · 2009-03-11 11:08 · Score: 2, Insightful

Oh great... basing ext4 performance gains on caching writes in the OS for 2 minutes just means they will focus their optimizations in ways that will suck even worse than ext3 does for applications that can't afford the risk of enabling write caching...
Re:Bull by LWATCDR · 2009-03-11 11:10 · Score: 4, Informative

It isn't a flaw. It is documented and the programmers didn't follow the docs. There is a specific command called fsync to flush the buffers to prevent the problem.
In fact here is a link to that call http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html
Yes if we had a prefect world we would have instant IO but we do not. The flaw is in the application plan and simple.
They didn't use the api properly and it really is just that simple.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Bull by Anonymous Coward · 2009-03-11 11:11 · Score: 2, Informative

Right... that way a single error can brick the whole system at once.
Re:Bull by icebike · 2009-03-11 11:20 · Score: 4, Insightful

Its not a KDE issue. Its not a Gnome issue.
Its a file system risk issue, and it affects everything running on the bos.
The EXT4 developers have decided its ok to increase the risk window by 3000% and
risk a crash for a minute and 20 seconds in an attempt to gain a little
performance. (Damn little performance).
With EXT3 the risk window was 5 seconds. Now its 150 seconds.
Its ridiculous to move what should be a low-level data integrity function
out of the File System and inflict it on user-land code.

--
Sig Battery depleted. Reverting to safe mode.
Re:Bull by BikeHelmet · 2009-03-11 11:24 · Score: 5, Insightful

Ahh yes, I love developers like you. You assume your app is the only one running, and it must have full access to the entire IO bandwidth an HDD can provide.
And then an antivirus program updates while Firefox is starting and a video is transcoding, and your program either slows to a crawl or crashes after 30 seconds of not receiving or being able to write any data.
Recently I was playing Left4Dead when one of my HDDs in my RAID array died in a very audible way. All the drives spun down, then 3 of them came back online. IOPS went to zero for over 60 seconds. No data in or out to those devices!
Interestingly, Ventrilo kept running fine. Left4Dead completely froze, but a minute or so after the 3 drives came back online, it unfroze. (CPU catching up?) All the while I was freaking out on Ventrilo, much to my friends' amusement.
Pretty much everything else crashed, except for Portable Firefox... uTorrent crashed, but first it left corrupted files all over - appearing as undeletable folders, which require a format to remove.
Time for a disk wipe. Thank you, shitty developers! Next time, use the API properly, and if you must have it written to disk, sync it immediately after you write!
Re:Bull by gweihir · 2009-03-11 11:37 · Score: 2, Insightful

Back when 10MB HDDs with 100ms access times where prevalent and floppies were all the rage, buffered I/O was a good idea. If I find that an application is somehow overwhelming my 3.0GB/s SATA bus and 10,000 RPM hard drive, I'll be sure to turn this "feature" on.
Use the "sync" option on a mount some day and be surprised. Synchronous I/O is dog-slow.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Bull by vadim_t · 2009-03-11 11:54 · Score: 5, Insightful

It's not going to happen immediately in any case. Some optimizations can only be done if you introduce a delay, and once introduced you have to deal with that there's a delay. Just because it's one second instead of a minute doesn't mean your computer can't crash in the precisely wrong moment.
While I'm not an expert in filesystems, I'd expect writing a single file to be at least 4 writes: inode, data, update the directory the file is in, and a bitmap to show space allocation. If there's a journal add a write for the journal. Each of those will require a seek due to all of these things being in different places on the disk in most filesystems.
So your 40 small files just turned into 400-500 seeks, which at 8ms each will take 1.6 to 2 seconds to complete.
Now let's suppose we can batch things up. We need to write the inode and data for each file, and can do just one seek for the directory (the same for all), and the bitmap and journal can be updated in one operation. Now we're down to 2 writes per file, giving 80 seeks, plus 3 for metadata, giving 83 seeks, which can be done in 0.6 seconds.
But what if we do delayed allocation and create the all the inodes and write all the data as one large contigous area? We're now down to 5 writes total, with a seek time of 40ms. The time needed to write the data can probably be disregarded, since modern disks easily write at 50MB/s, and those 40 files with metatata probably amount to less than 32K.
And with some optimization, we just reduced the time it takes to write your 40 files to just 2% of the unoptimized time.
You're not going to get this sort of improvement without some sort of delay. If you insist on a per-file write you'll get really, really awful performance on the sort of workload you're using as an example. And you can even see it in practice, just boot a DOS box, and do benchmarks with and without smartdrv. Running something like a virus scanner should show a huge difference in the presence of a cache.
Re:Bull by Eskarel · 2009-03-11 11:55 · Score: 2, Insightful

That application developers don't always get to choose what filesystem their application is being run on would be my guess.
Disk caching is a good thing(well at the moment, if/when SSD's become large enough and cheap enough to replace regular old spinning disks for speed dependent applications, then it probably won't be all that useful), it makes everything faster and more efficient. That said, 2.5 seconds is an absolutely huge amount of time in computer terms, even on a really slow PC these days that's thousands of operatings being executed before any attempt is even made to write the data to disk. It's a huge, and unecessary risk. Average latency on normal hard drives now is easily below 5 ms, queueing up for 30 times that to try and make things more efficient is just stupid.
Re:Bull by gweihir · 2009-03-11 12:06 · Score: 3, Insightful

It is a KDE issue. Only userland knows which data is critical. Only userland knows whether data can ba backed up or not. The OS cannot enure full data integrity without massice negative performance impact, however much you may wish for it. So what the OS does is give you a way to tell it which data needs to be on disk and which data should be on disk in a while if nothing goes wrong.
There really is no other way of doing it. Unless you think fundamentally defective code is acceptable if the risk of getting hit is a bit smaller?

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Bull by LWATCDR · 2009-03-11 12:12 · Score: 2, Informative

Just use fsync()
Problem solved. Read the Posix docs, or the clib docs and you will never run into this problem.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Bull by mr_walrus · 2009-03-11 12:17 · Score: 2, Insightful

only userland knows WHICH data is critical.
dude, ALL data is critical.
no, this is a serious implementation stoopidity in ext4, et.al.
blame the victim. eeesh. data rape is still rape.
and saying programs should be calling fsync is absurd.
i'm old enough to remember when programmers were admonished
to NOT call fsync, or it would "slow down the system."
sync/flushing data already written by userland standard i/o calls
should never be a userland responsibility.
[shaking head...]
Re:Bull by LWATCDR · 2009-03-11 12:18 · Score: 4, Insightful

No. That is why we have fsync().
No file system will promise you data integrity with a power failure. That is why you should run with a UPS.
You can not depend on the write delay time. What happens if you get a really fast processor and say a really slow drive? Unless you are building software that only runs on ONE set of hardware you just can not do that.
This is a bug that was always in KDE and they got lucky up till now.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Bull by gweihir · 2009-03-11 12:24 · Score: 4, Insightful

dude, ALL data is critical.
If you really think that, then you should leave the aera of modern disk access and mount all your partitions with the "sync" option. Then none of your software will have to think about syncing. Of course all file access will be so slow that nobody will want to work with that system either.
Hmm. I wonder why "sync" is not a default mount option?

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Bull by amirulbahr · 2009-03-11 12:58 · Score: 3, Informative

Who modded this up? Jane Q. Public is completely clueless on this topic, but she manages to sound like she has an idea to fellow clueless moderators. She should be called out for the karma whoring ignoramus she is.
Some choice quotes from her on this thread.

Delayed allocation is like leading a moving target when shooting.
BadAnalogyGuy would be proud. Probably also worth mentioning that without delayed allocation, the system would be unbearably slow.

The longer you delay allocation after writing the journal (and Ext4 seems to take this to extremes), the more chance there is of something -- almost anything really -- going wrong
A kernel crash or power outage is certainly something that could go wrong. Modern journalling file-systems handle this gracefully by making sure the file-system is in a consistent state when it comes back up.

The filesystem is flawed, plain and simple.
You'll realize why that one is a gem when you read her next quote. As the discussion continues, she begins to realize how far off the mark she is and begins to correct...

It most definitely is a filesystem limitation. That is different from saying that it's the filesystem's fault.
Still off the mark, but perhaps she is beginning to figure out what a file system should offer and what the issue being discussed is.

If an application that reads and writes lots of small files fails under Ext4, then it is Ext4's fault, not the application. An application should be able to read and write lots of small files if it wants... I can think of a great many practical examples.
Go ahead and do that. But if you want to make sure you're data is written, in case of a kernel crash or power outage, then you had better understand what is going on at the FS level.

As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter.
No, but you should understand the API of the language you are dealing with. Since when does a compiler handle disk I/O anyway? As for your interpreter, it is free to call fsync whenever it wants, but what has that got to do with the FS again?

Excuse the heck out of me, but the issue being discussed was a failure that was NOT due to a power loss or other such system problem. It was a crash caused by this very issue. If it were simply missing data due to power loss or some such, there would be no point in this discussion at all.
The purpose of this quote is to demonstrate that she both has no regard for TFA and also has no idea what this issue being discussed is. I encourage anyone looking to give her mod points actually RTFA and also do a bit of background reading on file systems and in particular delayed writes.

My point was and still is: if the data is not flushed to disk yet, it should either be accessible from the buffer, or not at all.
This sentence alone deserves a -1 Huh? If you do a write, and it is successful, then you can do a read on the same file and it will return what you wrote, whether or not it had been flushed to disk. This is the way it is supposed to work. Think about it for like 10 seconds and you'll begin to get it.

not supposed to have to worry about OS-specific details
WE ARE TALKING ABOUT UNEXPECTED KERNEL CRASHED AND POWER OUTAGES. If you care about that situation then you should get a clue before you start coding. If not, then what is the problem, or was it fault... er, sorry limitation?

One should not have to know about syncing to do something like a few simple file writes
And one doesn't need to if she is not concerned with the rare possibility that the system CRASHES OR LOSES POWER in the next few minutes.
Anyway, I've never called out another poster like this before and now I feel dirty.
Re:Bull by phantomlord · 2009-03-11 13:27 · Score: 2, Interesting

I just bought a new laptop that, unfortunately, came pre-installed with Vista. I spent the better part of the day creating settings by hand, tweaking this and that, to get things setup how I wanted them to be. I don't know of any handy way to copy my XP registry over from my old laptop to Vista on the new laptop(I could be wrong, I don't use windows for anything of importance so I haven't taken the time to learn all the power user tricks). That's to say nothing of all my application settings that were lost since they were written to the registry in my old laptop.

I installed Linux on it as well. You know what it took to copy over all of my settings and data?
cd /hpme
cp -a /mnt/nfs/home/user .

<sarcasm>That registry sure does make everything so much easier...</sarcasm> and that cp works even across different architectures, Linux distributions, etc.

--
Don't leave your mind so open that your brain falls out. Don't close it so much that you cut off the blood.
Re:Bull by vadim_t · 2009-03-11 13:54 · Score: 4, Interesting

That's all very well, but in most cases, the machine is a single-user desktop, and it's sitting *idle* during the 5-40 seconds in which the filesystem writes are being batched up. It seems to me that the writes should be sent to disk immediately, unless the disk is already busy, in which case they may be delayed for (at most) 40 seconds, while other reads take place ahead of them.
Mount your filesystem with the "sync" option, that should do what you want I guess. Performance will be bad though.
There are only two ways to do this: either you do it completely synchronously, and get a guarantee of the write being done when the application is done writing, or you have a delay of arbitrary length. If you have a delay, even if it's 1ms, and you care about the possibility of something going wrong at that moment, the application has to deal with the possibility. Reducing the delay only makes it less likely, but given enough time it'll happen.
Even doing it fully synchronously you can run into problems. A file can be half written (it's written by the block, after all), and of those 40 files, perhaps one references data in another.
Point being, if such things are really a problem for the application, the application must do things correctly, by writing to temporary files, renaming, and writing in the right sequence so that even if something is interrupted in the middle the data on disk still makes sense.
Even if the FS does like you want and starts writing immediately, that won't save you from the fact that it has no clue how your file is internally structured, and will perform writes in fs-sized blocks. So your 10K sized file can be interrupted in the middle and get cut off at 4K in size after a crash. If your application then goes and chokes on that, there's no way the FS can fix that for you.

Also, with a modern SATA disk supporting Native Command Queuing, the OS should immediately write the data to the disk's buffer, and the disk's firmware gets to decide about re-ordering.
NCQ doesn't take care of half that's needed for safe writing to disk. Two problems for a start:
1. Your hard disk doesn't know about your filesystem's structure. Unless told otherwise, the HDD will happily reorder writes and update ondisk data first, journal second, leading to disk corruption. The hard disk can't magically figure out what's the right way to write the data so that it remains consistent, only the OS and the application can ensure that.
2. NCQ is limited to 32 commands anyway, the OS has to do handling on its own anyhow.

As for the argument about using sqlite - why have yet another abstraction? After all, the filesystem is already a sort of database!
Because it's a simpler abstration. If you're not willing to learn or deal with the POSIX semantics, such as fsync and rename, and checking the return code of every system call, you can use something like sqlite that does it internally and saves you the effort, and returns one unique value that tells you whether the whole update worked or not.
Re:Bull by slamb · 2009-03-11 14:25 · Score: 2, Interesting
RTFPS (Read The Fine POSIX Spec).
I've RTFPS (well, not quite - the Single Unix Specification; where do I find the Fine POSIX Spec free online?).
I am...dissatisfied with this answer because POSIX appears to provide so few guarantees that applications basically have to assume more than it promises to get anything done. The Linux documentation doesn't appear to promise anything more. For instance,
- If I create a new file and fsync it, am I guaranteed that it hit disk? (Hint: on Linux this isn't true according to the #ifdef linux block of this file. It says I must fsync the directory, and nothing in Posix even says it's possible to open() or fsync() a directory; you have to use opendir().)
- If I overwrite or append just a few bytes of an existing file and lose power before calling fdatasync(), what is guaranteed about the contents of the file? If you say "nothing", the only safe approach to updating anything is to write a complete replacement for the file, fsync() it (but pay attention to the special Linux case described above), and rename() it into place. Of course, that's a pretty significant performance hit and basically screws over any reasonable way of implementing shadow paging or write-ahead logging.
So...where is the specification that describes the filesystem's behavior in a useful way?
Re:Bull by slamb · 2009-03-11 14:32 · Score: 2, Insightful

Rewriting the same file over and over is known for being risky. The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync.
Except on Linux you must sync the parent directory as well. None of this behavior is usefully documented anywhere, so it's upsetting when kernel developers tell application developers they're doing it wrong.
Re:Bull by slamb · 2009-03-11 14:36 · Score: 2, Interesting

To clarify my own question:

# # If I overwrite or append just a few bytes of an existing file and lose power before calling fdatasync(), what is guaranteed about the contents of the file?

I'd like to know which of the unmodified bytes are guaranteed to be preserved. None of them? All of them? Ones not in the same block as new bytes? (And what's a block? Is it st_blksize, or is it possible that block size varies within the file or changes over time?)
Re:Bull by icebike · 2009-03-11 15:14 · Score: 2, Insightful

On way to test if your argument makes sense is to extend it to absurdity.
What if the FS NEVER wrote anything until a fsync was called?
All applications would then have to add these calls.
The net affect would be uncontrolled write management at the application level with no hope of IO management or optimization at the FS/OS level.
Is this what you propose? Is this technically correct? Be careful what you wish for.
If this was done, the FS would (sooner or later) have to ignore fsync totally and re-assert control of commits in order to achieve any reasonable performance.
So you see, I believe you are recommending something that is not in the best interests of the OS or the users in the long run. (However technically correct it might be at the moment). This functionality really does belong at the OS/FS level. I could go further and say it would be nice if it could be done at the hardware level. If disk drives could manage this by themselves it would be great. A write would get immediately sent to the disk, and it would cache as needed but never more than it could write with stored power after feed power fails.

--
Sig Battery depleted. Reverting to safe mode.
Re:Bull by poliopteragriseoapte · 2009-03-11 16:03 · Score: 2, Insightful

I think it is a brilliant idea to write less frequently to disk, and even 10 mins would not be bad. Much easier on power consumption, the drive can spin off, less wear on flash memory, etc. Coders who forget to positively flush critical data are just asking for problems. After all, what is the difference between 5 secs and 120 secs? Just 24 times. And if disaster can strike with high probability p, then p/24 is notmuch better.
Re:Bull by russotto · 2009-03-11 16:25 · Score: 3, Insightful

It is a KDE issue. Only userland knows which data is critical.

Data that userland applications WRITES TO DISK is critical. If the filesystem takes its sweet time about actually doing the write, it's not the application's fault. And no, calling fsync() or fdatasync() constantly is no good, because that really does make your performance poor.
Re:Bull by amirulbahr · 2009-03-11 16:30 · Score: 2, Informative

They are referring to the case when the system isn't shut down cleanly. This means a kernel crash or a power outage. What is your point exactly? Seriously, and I really am doing my best to hold back on the personal insults (even when you something as annoying as "And calm down !!"), what is so difficult that you fail to comprehend what the real issue being discussed here is?
Re:Bull by DigiShaman · 2009-03-11 16:33 · Score: 4, Insightful

Apparently, Microsoft and Intel don't think so. You can enable write-caching in both the device manager (volume) and Intel's Matrix Storage Manager (RAID), but they will both provide respective warnings about doing so when not connected to a UPS.
Granted. Write-back caching is independent of the file system in use. However, both are based on the idea "writing" the data, just not committing until a later period. It's a trade off that can be put on a sliding scale. The more often you commit the data, the less chance of data loss at the expense of performance. The less often you commit the data, the greater your chances are of data loss. Your performance improves however. The key is finding that optimum balance that suits your needs.

--
Life is not for the lazy.
Re:Bull by mysidia · 2009-03-11 17:31 · Score: 2, Informative

5 seconds might reduce the probability of problems, but it doesn't make the assumption a non-bug.
That's like saying if my code has a buffer overflow in it, but if it's only by 5 bytes, everything's ok, whereas if it's by 150 bytes, I should panic...
One way to test if your argument makes sense is to extend it to absurdity.
And the result has absolutely no bearing on the issue. Extending 5 seconds to infinity is nothing like extending 5 seconds to 150 seconds.
If this was done, the FS would (sooner or later) have to ignore fsync totally and re-assert control of commits in order to achieve any reasonable performance.
On some systems you may actually find this to be the case. On certain kernels, certain hard drives had write cache, and sync() would not force the drive itself to flush its own cache, data could be in there for minutes, to be lost in the event of an untimely power failure..
Most applications handle this reasonably; maintain transactional integrity, and sync() when it is critical that a write finish on a timely basis, and in event of a crash, revert to the last 'good' state.
Transactional database software like PostgreSQL are exceptional at this, and they do use sync.
If you have a lot of critical data, the right place to put it is in a DBM, that will handle and manage syncing correctly and optimally for the OS.
If you have small amounts of critical data, then you write them to flatfiles, and sync. The small size of the files, and the small number of writes you do to them will make performance a non-issue.
Maintaining integrity of critical data requires a lot more than a good filesystem, and the ability to ensure data is sync'ed to disk.
Because even 5 seconds is non-zero, which is all the time in the world, if you leave the files on disk such that they would be corrupt or inconsistent (should the system crash at that moment)
Filesystems don't and never did totally relieve application developers of having to worry about what might (or might not) be written to disk by the OS.
Certainly it's unreasonable they make particular assumptions about the exact nature of the duration it takes, since there are so many filesystems available, including some unusual ones like NFS.
(void)sleep(5); after a write is not, and never was a substitute for fsync(); for assuring data is written before writing more.
Re:Bull by 7+digits · 2009-03-11 22:16 · Score: 2, Insightful

> I agree. This is not way to treat startup-critical configuration files.
This is bull. Most files are critical to someone. This would means that most processes that write data must use fsync.
Are you arguing that cp should use fsync for every file it copies ? In that case, you'd better tell the maintainers of coreutils-7.1, because copy_internal (used by cp.c) does not. (And you'll be laughted at)
So, right, now, on ext4, the sequence:
> cp /disk1/file1.data /disk2/file1.data
wait a few seconds
> rm /disk1/file1.data
crash
will probably cause the file to be lost. That you choose to blame it on cp is funny, but most of the rest of the world will blame it on ext4.

Works as expected... by gweihir · 2009-03-11 09:16 · Score: 5, Insightful

The problem here is that delaying writes speeds up things greatly but has this possible side-effect. For a shorter commit time, simply stay with ext3. You can also mount your filesystems "sync" for a dramatic performance hit, but no write delay at all.

Anyways, with moderen filesystems data does not go to disk immediately, unless you take additional measures, like a call to fsync. This should be well known to anybody that develops software and is really not a surprise. It has been done like that on server OSes for a very long time. Also note that there is no loss of data older than the write delay period and this only happens on a system crash or power-failure.

Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Re:Works as expected... by girlintraining · 2009-03-11 09:27 · Score: 5, Insightful

Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.
You're right, there really is nothing to see here. Or rather, there's nothing left. As the article says, a large number of configuration files are opened and written to as KDE starts up. If KDE crashes and takes the OS with it (as it apparently does), those configuration files may be truncated or deleted entirely -- the commands to re-create and write them having never been sync'd to disk. As the startup of KDE takes longer than the write delay, it's entirely possible for this to seriously screw with the user.
The two problems are:
1. Bad application development. Don't delete and then re-create the same file. Use atomic operations that ensure that files you are reading/writing to/from will always be consistent. This can't be done by the Operating System, whatever the four color glossy told you.
2. Bad Operating System development. If an application kills the kernel, it's usually the kernel's fault (drivers and other code operating in priviledged space is obviously not the kernel's fault) -- and this appears to be a crash initiated from code running in user space. Bad kernel, no cookie for you.

--
#fuckbeta #iamslashdot #dicemustdie
Re:Works as expected... by gweihir · 2009-03-11 09:45 · Score: 3, Insightful

I agree on both counts. Some comments
1) The right sequence of events is this: Rename old file to backup name (atomic). Write new file, sync new file and then delete the backup file. It is however better for anything critical to keep the backup. In any case an application should offer to recover from the backup if the main file is missing or broken. To this end, add a clear end-mark that allows to check whether the file was written completely. Nothing new or exciting, just stuff any good software developer knows.
2) Yes, a kernel should not crash. Occasionally it happens nonetheless. It is important to notice that ext4 is blameless in the whole mess (unless it causes the crash).

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Works as expected... by kasperd · 2009-03-11 11:44 · Score: 4, Insightful

Write new file into a temp file, sync, whatever you need to do. When you're done, delete original and rename the temp to the original's name.
That's an improvement, but it can be made even safer by skipping the delete step. Once the new file is created just rename it on top of the original. The rename system call guarantees that at any point in time the name will refer to either the old or the new file. I'm not sure you really need the sync step. I haven't read the spec in that kind of detail, but if that sync step is really necessary I'd call that a design flaw. The file system may delay the write of the file as well as the rename, but it shouldn't perform the rename until the file has actually been written.

--

Do you care about the security of your wireless mouse?
Re:Works as expected... by moonbender · 2009-03-11 13:57 · Score: 2, Interesting

The rename system call guarantees that at any point in time the name will refer to either the old or the new file. I'm not sure you really need the sync step. I haven't read the spec in that kind of detail, but if that sync step is really necessary I'd call that a design flaw. The file system may delay the write of the file as well as the rename, but it shouldn't perform the rename until the file has actually been written.
As I understand it, that is EXACTLY what happens. The move/relinking is commited, but the data isn't. If true, a real case of WTF. The relinking should only be executed AFTER the data has been commited to the drive.

--
Switch back to Slashdot's D1 system.

Re:If in other "modern" filesystems.... by internerdj · 2009-03-11 09:19 · Score: 3, Insightful

It is a trade-off between reliability and performance. In this case, Older!== better either. A lot of OS design decisions are trade-offs.

Classic tradeoff by Otterley · 2009-03-11 09:26 · Score: 5, Insightful

It's amazing how fast a filesystem can be if it makes no guarantees that your data will actually be on disk when the application writes it.

Anyone who assumes modern filesystems are synchronous by default is deluded. If you need to guarantee your data is actually on disk, open the file with O_SYNC semantics. Otherwise, you take your chances.

Moreover, there's no assertion that the filesystem was corrupt as a result of the crash. That would be a far more serious concern.

Re:Classic tradeoff by imsabbel · 2009-03-11 09:36 · Score: 3, Informative

Its even WORSE than just being asynchronous:
EXT4 reproducably delays write ops, but commits journal updates concerning this write.

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
Re:Classic tradeoff by slashdotmsiriv · 2009-03-11 09:44 · Score: 2, Interesting

Even if you use O_SYNC, or fsync() there is no guarantee that the data are safely stored on disk.
You also have to disable HDD caching, e.g., using
hdparm -W0 /dev/hda1
Re:Classic tradeoff by gweihir · 2009-03-11 10:00 · Score: 2, Insightful

Even if you use O_SYNC, or fsync() there is no guarantee that the data are safely stored on disk.
You also have to disable HDD caching, e.g., using
hdparm -W0 /dev/hda1
Well, yes, but unless you have an extreme write pattern, the disk will not take long to flush to platter. And this will only result in data loss on power failure. If that is really a concern, get an UPS.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Classic tradeoff by legirons · 2009-03-11 10:08 · Score: 2, Funny

It's amazing how fast a filesystem can be if it makes no guarantees that your data will actually be on disk when the application writes it.
Backups redirected to /dev/null, run much faster... ;)

Re:Exactly by TerranFury · 2009-03-11 09:26 · Score: 5, Insightful

Meh, this is crap that happens only when the system crashes, and is pretty much unavoidable if you're doing a lot of caching in memory -- which, coincidentally, is what you need to do to maximize performance. This doesn't sound like the filesystem's "fault" or the application's "fault;" it's just the way things are. Everybody knows that if you don't cleanly unmount, most bets are off.

Theory doesn't matter; practice does by microbee · 2009-03-11 09:28 · Score: 3, Interesting

So, POSIX never guarantees your data is safe unless you do fsync(). So, ext3 was not 100% safer either. So, it's the applications' fault that they truncate files before writing.

But it doesn't matter what POSIX says. It doesn't matter where the fault belongs to. To the users, a system either works nor not, as a whole.

EXT4 aims to replace EXT3 and becomes the next gen de-facto filesystem on Linux desktop. So it has to compete with EXT3 in all regards; not just performance, but data integrity and reliability as well. If in the common scenarios people lose data on EXT4 but not EXT3, the blame is on EXT4. Period.

It's the same thing that a kernel does. You have to COPE with crappy hardware and user applications, because that's your job.

Re:Theory doesn't matter; practice does by caerwyn · 2009-03-11 10:02 · Score: 5, Insightful

This is the attitude that has the web stuck with IE.
There's a standard out there called POSIX. It's just like an HTML or CSS standard. If everyone pays attention to it, everything works better. If you fail to pay attention to it for your bit (writing files or writing web pages), it's not *my* fault if my conforming implementation (implementing the writing or the rendering) doesn't magically fix your bugs.

--
The ringing of the division bell has begun... -PF
Re:Theory doesn't matter; practice does by somenickname · 2009-03-11 10:18 · Score: 2, Insightful

"The machine crashed" isn't a common situation. In fact, it's a very, very rare situation.
Re:Theory doesn't matter; practice does by microbee · 2009-03-11 10:50 · Score: 3, Insightful

Apparently, you don't know real life.
Does POSIX tell you what happens if your OS crashes? That's right, it says "undefined". Oops, sorry, it's too hard a problem and we'll just leave it to you OS implementers.
Asking everyone to use fsync() to ensure their data not being lost is insane. Nobody want to pay that kind of performance penalty unless the data is very critical.
Normal applications have a reasonable expectation that the OS doesn't crash, or doesn't crash too often for this to be a big problem. However, shit happens, and people scream loud if their data is lost BEYOND reasonable expectations.
Forget POSIX. It's irrelevent in the real world. It's exactly this pragmatic attitude that brought Linux to its current state.
Re:Theory doesn't matter; practice does by caerwyn · 2009-03-11 11:52 · Score: 3, Insightful

Apparently, you don't know how to *deal* with real life.
POSIX *does* tell you what happens if your OS crashes. It says "as an application developer, you cannot rely on things in this instance." It also provides mechanisms for successfully dealing with this scenario.
As for fsync() being a performance issue, you can't have your cake and edit it too. If you don't want to pay a performance penalty, you can lose data. Ext4 simply only imparts that penalty to those applications that say they need it, and thereby gives a performance boost to others who are, due to their code, effectively saying "I don't particularly care about this data" - or more specifically, "I can accept a loss risk with this data."
Normal applications have a reasonable expectation that the OS doesn't crash, yes. And usually it doesn't. Out of all the installs out there... how often is this happening? Not very. They've made a performance-reliability tradeoff, and as with any risk... sometimes it's the bad outcome that occurs. If they don't want that to happen, they need to take steps to reduce that risk- and the correct way to do that has always been available in the API.
As for forgetting POSIX... it's the basis of all unix cross-platform code. It's what allows code to run on linux, BSD, Solaris, MacOS X, embedded platforms, etc, without (mostly) caring which one they're on. It's *highly* relevant to the real world because it's the API that most programs not written for windows are written to. Pull up a man page for system calls and you'll see the POSIX standard referenced- that's where they all came from.
Saying "Forget POSIX. It's irrelevant in the real world." is like people saying a few years ago "Forget CSS standards. It's irrelevant in the real world." And you know what? That's the attitude that's dying out in the web as everything moves toward standards compliance. So it is in this case with the filesystem.

--
The ringing of the division bell has begun... -PF

Excuses are false. This is a severe flaw. by rpp3po · 2009-03-11 09:28 · Score: 3, Interesting

There are several excuses circulating: 1. This is not a bug, 2. It's the apps' fault, 3. all modern filesystems are at risk.
This is all a bunch of BS! Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!

Re:Excuses are false. This is a severe flaw. by Anonymous Coward · 2009-03-11 09:57 · Score: 2, Informative

> Delayed writes should lose at most any data between commit and actual write to disk.
And that's exactly what ext4 does.
Application decides to update some file:
1) Reads the some file
2) Modifies the buffer as needed
3) Truncates the file
4) Writes the buffer to the file
Now, if the filesystem commit happens right between, 3 and 4, the truncation hits the disk, but the new content does not (yet). If a crash happens before the next commit, all what remains is the truncated file.
Re:Excuses are false. This is a severe flaw. by Anonymous Coward · 2009-03-11 10:11 · Score: 4, Informative

Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
You seem to misunderstand that's *exactly* what is happening.
KDE is *DELETING* all of its config files, then writing them back out again in two operations.
Three states now exist, the 'old old' state, where the original file existed, the 'old' state, where it is empty, and the 'new' state where it is full again.
The problem is getting caught between step #2 and step #3, which on ext3 was mostly mitigated by the write delay being only 5 seconds.
KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.
Re:Excuses are false. This is a severe flaw. by rpp3po · 2009-03-11 10:29 · Score: 2, Insightful

That's not true. KDE is not "*DELETING*" any of its files. It's just opening them with the O_TRUNC flag (expressing an intent to overwrite its contents). That's perfectly safe for a copy-on-write filesystems (as ZFS) but not for ext4. So calling all "modern" filesystems at risk is pure ignorance. Ext4 could delay content deletion of open files until write time and write both within a single transaction.
Re:Excuses are false. This is a severe flaw. by macshit · 2009-03-11 10:36 · Score: 2, Informative

ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!
I read the FA, and it actually really does look like the applications are simply using stupidly risky practices:
These applications are truncating the file before writing (i.e., opening with O_TRUNC), and then assuming that the truncation and any following write are atomic. That's obviously not true -- what happens if your system is very busy (not surprising in the startup flurry which is apparently where this stuff happens), the process doesn't get scheduled for a while after the truncate (but before the write), and the system happens to crash in that interval?
I'm as lazy as they get, but even I know enough not to do that kind of crap...
There's probably some way the FS could finesse this issue -- e.g., don't actually schedule truncation until you see the first write or close -- but it would be a workaround for buggy applications, not a FS bugfix.

--
We live, as we dream -- alone....
Re:Excuses are false. This is a severe flaw. by Tadu · 2009-03-11 11:22 · Score: 5, Informative

KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.
Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. While that may be correct from a POSIX lawyer pont of view, it is still heavily undesirable.

Re:If in other "modern" filesystems.... by CannonballHead · 2009-03-11 09:31 · Score: 3, Insightful

I'll take "I didn't lose my data" over "ext4 runs 1.5x faster than ext3," thank you. What use is performance to me if I have to be absolutely certain that it won't crash, or I lose my (in my very high performance filesystem) data?

Also, ext4 is toted as having additional reliability checks to keep up with scalability, etc... not less reliable at expense of performance.

Reliability

As file systems scale to the massive sizes possible with ext4, greater reliability concerns will certainly follow. Ext4 includes numerous self-protection and self-healing mechanisms to address this.

(from Anatomy of ext4)

I can only imagine the response if tests were done on Windows 7 beta that showed a crash after this or that resulted in loss of data. :)

Re:Exactly by gweihir · 2009-03-11 09:34 · Score: 5, Insightful

The problem is not the many small files, but the missing disk sync. The many small files just make the issue more pbvous.

True, with ext4 this is more likely to cause problems, but any delayed write can cause this type of issue when no explicit flush-to-disk is done. And lets face it: fsync/fdatasync are not really a secret to any competent developer.

What however is a mistake, and a bad one, is making ext4 the default filesystem at this time. I say give it another half year, for exactly this type of problem.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Re:If in other "modern" filesystems.... by internerdj · 2009-03-11 09:37 · Score: 2, Insightful

Thing is that ext3 is using the same strategy on a smaller scale. The same argument could be made to say that 3 seconds is far too long to be out of date. How many instructions are you going to run in 3 seconds? Defects run at 5-8 per/kloc on average. Certainly not all are fatal, but how long of a delay is too long to avoid a potentially fatal defect? Obviously the delay they have chosen is too long, but is the performance hit that ext3 takes for having a 3 second delay rather than a 5 or 10 or 15 second delay worth it?

not mounted sync,dirsync? by dltaylor · 2009-03-11 09:40 · Score: 4, Interesting

When I write data to a file (either through a descriptor or FILE *), I expect it to be stored on media at the earliest practical moment. That doesn't mean "immediately", but 150 seconds is brain-damaged. Regardless of how many reads are pending, writes must be scheduled, at least in proportion of the overall file system activity, or you might as well run on a ramdisk.

While reading/writing a flurry of small files at application startup is sloppy from a system performance point of view, data loss is not the application developers' fault, it's the file system designers'.

BTW, I write drivers for a living, have tuned SMD drive format for performance, and written microkernels, so this is not a developer's rant.

Re:not mounted sync,dirsync? by Dog-Cow · 2009-03-11 23:27 · Score: 2

Right. Its the rant of a fucked-up asshole. If the developer does not use a mechanism that GUARANTEEs writes to disk, how the fuck is at anyone else's fault? It isn't, you brain-damaged idiot.

Translation by microbee · 2009-03-11 09:50 · Score: 3, Insightful

We use techniques that show great performance so people can see we beat ext3 and other filesystems.

Oh shit, as a tradeoff we lose more data in case of a crash. But it's not our fault.

Honestly, you cannot eat your cake and have it too.

Actually, no. by Jane+Q.+Public · 2009-03-11 10:01 · Score: 2

As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter. Do not expect me to know Assembly language for a given chip, for example, in order to implement a calendar program. The very idea is ridiculous.

Re:Actually, no. by muridae · 2009-03-11 10:18 · Score: 3, Insightful

As a user of high-level languages, do not directly access the I/O API without knowing what it does. Use a higher level wrapper that properly interacts with the low level functions, and does all of the fsync and similar calls for you.

If those high level wrappers do not exist, then do not blame the API developers for you not knowing how they work.
Re:Actually, no. by TheRaven64 · 2009-03-11 10:32 · Score: 2, Interesting

As a user of a framework that doesn't suck, I don't have to worry about this problem. When I need to write a file in such a way that the entire operation either succeeds, or the entire operation fails (a common requirement), the framework I use provides a flag that I can set on the write operation to do all of the write/rename juggling that needs to happen, according to POSIX, to make it work. As such, my code will work happily on any filesystem that doesn't break the spec.
If you are using a high-level language with a low-level framework, you might want to reconsider your approach.

--
I am TheRaven on Soylent News
Re:Actually, no. by msuarezalvarez · 2009-03-11 11:17 · Score: 2, Insightful

If a programmer is using a file API with POSIX semantics in any non-trivial way, and is not aware of the fact that POSIX does not specify any assurances that data will be written to the device unless fsync is called or another similar action is taken, then that programmer is *not* competent.

Alarmist and ignorant article - not a "problem" by ivoras · 2009-03-11 10:07 · Score: 4, Insightful

*No* modern, desktop-usable file systems today guarantee new files to be there if the power goes out except if the application specifically requests it with O_SYNC, fsync() and similar techniques (and then only "within reason" - actually the most guarantee that the file system will recover itself, not the data). It is universally true - for UFS (the Unix file system), ext2/3, JFS, XFS, ZFS, raiserfs, NTFS, everything. This can only be a topic for inexperienced developers that don't know the assumptions behind the systems they use.

The same is true for data ordering - only by separating the writes with fync() can one piece of data be forced to be written before another.

This is an issue of great sensitivity for databases. See for example:

That there exist reasonably reliable databases is a testament that it *can* be done, with enough understanding and effort, but is not something that's automagically available.

--
-- Sig down

Re:Why SHOULD applications have to assume bad FSs? by caerwyn · 2009-03-11 10:11 · Score: 2, Informative

Nothing- except that it's not in the spec.

POSIX is like a contract. KDE is breaking the contract and then whining about it to ext4- which isn't breaking the contract. Just as in a court, KDE here doesn't have much of a leg to stand on.

--
The ringing of the division bell has begun... -PF

Re:Why SHOULD applications have to assume bad FSs? by gweihir · 2009-03-11 10:12 · Score: 3, Informative

Whats wrong with "After a file is closed, its synced to disk"?!?

What, you want people to have to delay/stagger/coordinate their file closes in order to avoid overloading the filesystem? That is the wrong approach. close() just means that the application is done with the file. The sync calls are not a joke, they are there precisely for the reason that close() already has an antirely sensible but different semantics. Anybody that wants close also to sync can code it that way without problem. Anybody else probably does not want this behaviour in the first place.

This is not hidden in any way. A simple "man close" not warns of this, it also refers the reader to the fsync call. Anybody getting bitten by this did not no their homework.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

man 2 fsync by Nicolas+MONNET · 2009-03-11 10:23 · Score: 5, Informative

The filesystem doesn't guarantee anything is written until you've called fsync and it has returned.

Re:man 2 fsync by setagllib · 2009-03-11 13:46 · Score: 2, Interesting

No, disk caching is now considered the default. Nothing is written until the disk decides it is time, and this is completely up to them. It doesn't even have to occur in the same order the writes were issued in, especially with TCQ.

--
Sam ty sig.

Re:Exactly by gweihir · 2009-03-11 10:42 · Score: 4, Insightful

"And lets face it: fsync/fdatasync are not really a secret to any competent developer."

I disagree. Users of high-level languages (especially those that are cross-platform) are not necessarily aware of this situation, and arguably should not need to be.

And I disagree with your disagreement. This is something any competent developer has to know. There are fundamental limits in practical computing. This is one. It cannot be hidden without dramatic negative effects on performance. It is not a platform-specific problem. It is not a language-specific problem. It is not a hidden issue. A simple "man close" will already tell you about it. Any decent OS course will cover the issue.

I reiterate: Any good developer knows about write-buffering and knows at least that extra measures have to be taken to ensure data is on disk. Those that do not are simply not good developers.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Re:To Anonymous Coward: by Bronster · 2009-03-11 10:44 · Score: 4, Informative

mount -o sync. Enjoy your slow returns and strictly ordered writes.

Things that should be improved ... by hattig · 2009-03-11 10:47 · Score: 2, Interesting

Bah. Maybe all computers should come with a single-cell battery, for a couple of minutes of backup power.

As soon as power fails to the system and it resorts to battery, all calls to write() should also call fsync(), even if that slows the system down.

Never mind an option that implicitly calls fsync() if it hasn't been called in the past 3 seconds, for a minimal performance hit. If you have a specific application that doesn't want fsync() then you can disable that feature, but clearly on a consumer box, no UPS, potentially dodgy hardware and drivers, it makes sense. 150 seconds without a sync, just dumping into a buffer for writing ... sheesh.

Re:A Windows-like registry can not be the answer. by billcopc · 2009-03-11 10:50 · Score: 2

Unix philosophy is to make configuration files user- and script-editable. NOT to create hundreds of files per app making it utterly unmanageable.

--
-Billco, Fnarg.com

Re:Exactly by Dog-Cow · 2009-03-11 11:17 · Score: 2, Insightful

You are an idiot. The design of the POSIX API dictates that fsync (or equivalent) is required to ensure data is flushed to disk. This has been true forever. If an abstraction in an i/o library is not using the API correctly, it is the fault of the library.

You are correct that the user of the abstraction should not care, but you are putting the blame in the wrong place. The whole point of using an abstraction is to hide details such as this. If the library author is too stupid to learn the API he is abstracting that is HIS fault.

Re:Bull... by Anonymous Coward · 2009-03-11 11:26 · Score: 2, Insightful

Optimize the reads all you want, but those writes better damn well happen before the calls that say data is written return.

And this is where most of the confusion comes from. There is a difference between a logical write and a physical write. When your write call completes, it says the logical write has completed. It says nothing about the physical write. Depending on file system semantics, your physical write may have already completed too - or shortly after. If you must explicitly ensure the physical write is complete then you must explicitly ensure it via code - otherwise the physical write can only be assumed. And this is where the the lessor informed seem confused by their own poor expectations and ignorance. Unless they are actually following their write with some sort of file system synchronization call, ignoring their ignorant expectation, they have no right what-so-ever to assume the data will still be there in the face of a system crash. Its a very poor coder who falls into that trap.

Good programmers know this and have known it for tens of years. Good database programmers know this. Good file system developers know this. Those that are outraged by their own ignorance are either not programmers or are not good programmers.

And lastly, I'll point out, which is exactly why Tso pointed it out - use a solution where its foundation is built by coders who already understand the proper way to ensure data is safe on the file system - for example, use a database. While I don't consider the use of a database to be an ideal solution here, it does a wonderful job of highlighting the crappy design both KDE and GNOME have used to store configuration data - and how unconcerned they are about data loss and data corruption. If the developers of KDE and GNOME don't give a crap about your configuration data then how on earth can you possibly be upset at the file system for doing what its suppose to do?

In short, both KDE and GNOME need to give a crap about how, when, and why they write configuration data. Since they don't care about data integrity, you now know who you should be angry at. Here's a hint, and it doesn't have anything to do with the file system.

Mod parent up by betterunixthanunix · 2009-03-11 11:36 · Score: 2, Informative

Much as I love to fallback on the "POSIX says that this could be the case so it is OK that it is the case" excuse, it really does not fly in this case. The POSIX doesn't allow this sort of behavior because it is a "good" thing to do, it allows it because there are systems where this is an OK thing to do -- systems intended to manage database, systems that are heavily verified and have backup power supplies, etc. This is not appropriate for desktop systems -- desktop systems have to be robust against all kinds of stupid situations, like sudden power losses, users hitting the reset button because an application is hanging, and so forth. EXT4 should not be used in a desktop system if it can cause data loss when the unexpected happens, regardless of the technical merits of writing to small configuration files.

--
Palm trees and 8

Re:To Anonymous Coward: by LWATCDR · 2009-03-11 11:44 · Score: 2, Insightful

It isn't a file system limitation. And here is why.
1. The POSIX standard specifies that writes may be delayed. Every modern file system may delay writes.
2. The POSIX standard then gives you a way to flush the buffer at the time of the programs choosing. It is called fsync(). If the programmer called that well documented function then all would have been well.
You have the best performance possible and you can insure that file is flushed before you do something else.
The file system didn't cause this bug. The posix spec didn't cause this bug. The programmer that didn't use the tools as documented caused his own bug.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.

Re:Top down reliability? by Qzukk · 2009-03-11 11:52 · Score: 2, Informative

change their applications because a new version of the file system breaks their stuff is madness

Their applications were already broken, committing everything every 5 seconds* regardless of what the applications had wanted was the workaround in ext3, but I guess it's only madness when street-makers demand that you drive with round wheels, not when you demand that street-makers accommodate your square ones.

* Unless you increased the commit time to reduce power usage (eg laptop_mode)

--
If I have been able to see further than others, it is because I bought a pair of binoculars.

Learned something today by drolli · 2009-03-11 12:08 · Score: 2, Informative

Citing from the message Ts'o post:

----
So, what is the problem. POSIX fundamentally says that what happens if the system is not shutdown cleanly is undefined. If you want to force things to be stored on disk, you must use fsync() or fdatasync(). There may be performance problems with this, which is what happened with FireFox 3.0[1] --- but that's why POSIX doesn't require that things be synched to disk as soon as the file is closed.
----

And indeed, and reading the NOTES section of "man -S2 close" explicitely notes what is not mentioned in the other sections. I up to this day also lived under the assumption that a close implies a fsync. Now i have to change my ptograms where it matters. All the Idiots who scream here that the OS is doing something worng: no, it's not. AFAIU it's following the befined behaviour which is what i expect an OS to do. It should NOT try to magically guess where i forgot to fsync my files.

Exactly. by aussersterne · 2009-03-11 12:40 · Score: 4, Insightful

People keep making arguments about the spec, but this seems like a case of throwing the baby out with the bathwater. The spec is intended to serve the interest of robustness, not the other way around; demolishing robustness and then citing the spec is forgetting why there is a spec in the first place.

Yes, you can design something that's intentionally brain-dead, but still true to spec as a kind of intellectual exercise about extremes, but in the real world, the idea should be the opposite:

Stay true to the spec and try to robustly handle as many contingencies as is possible. Both developers should do this, filesystem and application, not "just" one or the other.

It's not enough just to be true to spec; the idea is to get something that works as well, not jump through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes.

It's the bad outcomes that we're trying to mitigate by having a spec in the first place!

So my point: what exactly is wrong with meeting the spec and trying to prevent serious problems by other coders from affecting your own code? I thought this was a basic part of coding: even if someone else is an idiot programmer, that doesn't make it okay to let the whole system fall down. Or did we all miss the part where we went for protected memory access and pre-emptive multitasking? Hell, if everybody had just been a great programmer, none of that would have been needed.

The point is to have a working system by following the spec and to try to clean up behind other programmers when they don't as much as possible within your own spec-compliant code. The point is not simply to "meet spec" and the actual utility of the system or vulnerability to the mistakes of others be damned.

--
STOP . AMERICA . NOW

Re:Exactly. by RAMMS+EIN · 2009-03-11 22:35 · Score: 3, Insightful

``It's not enough just to be true to spec;''
Yes, it is. That way, you get what the spec says you get.
It can even be argued that doing better than the spec is dangerous. After all, that is what got us this riot: things doing more than the spec said, people relying on that, and then getting angry when another implementation of the spec didn't have the same additional features.
You can only assume that you get what the spec says you get. If you assume more, it's your problem if your assumptions are wrong. If you want more than the spec gives you, you either need to implement it yourself or get a new spec implemented.
``the idea is to get something that works as well, not jump through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes.''
I don't think anyone jumped through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes. I think they jumped through hoops to get the best possible performance, while still being conformant to the spec. If this breaks applications that rely on behavior that isn't in the spec, it's because those applications are buggy.
``It's the bad outcomes that we're trying to mitigate by having a spec in the first place!''
I agree completely. But we seem to differ in how this is supposed to work.
I say that specifications can be used to avoid bad results by specifying exactly what can be relied on. Everything that is not in the specification is unspecified and thus cannot be relied on. Knowing this helps you write better software, because you know what you can assume, and what you have to write code for.
You seem to be saying that having a specification means we want to avoid bad results, so whomever implements the specification must do their best to avoid bad results, no matter what it says in the specification. I find that completely unreasonable.

--
Please correct me if I got my facts wrong.
Re:Exactly. by cheater512 · 2009-03-11 23:11 · Score: 2, Insightful

And is that the woosh from what actually went wrong going over your head?
Re:Exactly. by TemporalBeing · 2009-03-12 08:03 · Score: 2, Insightful

The spec is a set of necessary conditions but for many people it will be a sufficient set, they expect a filesystem to be as bulletproof as possible in every situation.
The spec in any design is the final authority.

For example, if the spec for a bridge crossing a river says that the bridge ought to hold 20 tons of weight, then it must do at least that. If the bridget collapses because you put 20 tons on and a spec of dust landed on top of it, then it doesn't matter - it still held to spec. If you were able to get 40 tons on before it collapsed all the better, but you were only ever guaranteed by the spec (and thus the designers) 20 tons.

If the spec for an engine said it could handle 8k RPM and it blew up at 8001 RPM, it was in spec. If you managed to get it to 9k RPM great, but you were only guaranteed 8k RPM.

That doesn't mean you don't build tolerance into the spec - e.g. 8k RPM +/- 5% - or in try to exceed it where it makes sense e.g. delivering 25 ton to ensure you have 20 tons and some leeway for safety. (After all stupid is as stupid does.)

However, you can't fault the designers or engineers when the device lives up to spec and breaks because you (as the user) tried to exceed the spec and it failed.

Same goes for software. If the software spec says "provides A at rate B" then you better expect that and nothing more. If you need something different, then find a device (or API or file system, etc) that meets your requirements.

Pushing something beyond spec is not the problem of the spec designers - but of the users of the spec that expect it to exceed the spec.

And, btw, specs that supposedly are "minimum" standard specs are still specs just the same. They allow a certain minimum that (with software) allows portability; if you want to do better you still need to find another spec that supports what you want to do. For example: POSIX guarantees a portability between Unix and Unix-like OS's; but if you want to do better than POSIX then you use the Linux POSIX spec or the Solaris POSIX spec ( or BSD POSIX spec, etc.). You are get what you want, but at the cost of some portability. Failing to do that is the failure of the user of the spec, not the writers of the spec.

And just to be clear - by "user of the spec I do not mean the people implementing the spec but the people using the software (or device) that implements the spec. In this case, not the implementors of ext3 or ext4, but the implementors of the software going above the ext3/4 spec to do something else.

Furthermore the spec exists as a measurement to be able to tell when you've completed your job. If the spec says 10 tons and you get 11 tons you've finished the job; if you're only getting 9.99 tons you're not done. If you get 10.00001 tons you've got. If it say 10 tons +/- 5%, then might be done at 9.99 tons, but you really should go for the 10 tons + 5% just to be safe. Either way - once you've met the spec you're done. That doesn't mean you don't try to improve the spec and then make a better product; but there's no guarantee that will happen - the spec is the spec, and that's all you have to do - it's all you agreed to do to start with. (Think of it like a contract.)

--
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)

Re:Bad defaults by 0123456 · 2009-03-11 12:55 · Score: 2, Interesting

The old defaults were: 5 seconds in ext3, in NTFS metadata is always and data flushed asap with but no guarantees. In practice, people don't lose huge amount of work.

Actually, I've lost multi-gigabyte files on NTFS; in one particular case I left IE downloading a game installer overnight, heard it beep around 8am to tell me it had completed, and then the power went out a couple of hours later before I got up. The file system was magically 'consistent' after the power came back and it rebooted, but it achieved that by deleting over two gigabytes of my data.

Modern file systems may be a bit faster than FAT32, but they're shit when it comes to reliably storing data.

In this case, yes, the KDE developers are retarded, but if the ext4 developers want ext4 to become the default filesystem for Linux, they need to make it work with retarded developers. 'But POSIX says we can do this' is worthless if it loses large amounts of user data; heck, you can easily guarantee 'file system consistency' by simply reformatting the disk on every reboot, but your users would be pretty damn pissed.

Re:amirulbahr: by amirulbahr · 2009-03-11 13:32 · Score: 2, Informative

I assure you it is you who has mis-understood the situation. From the bug report referenced in the summary:

Today, I was experimenting with some BIOS settings that made the system crash right after loading the desktop. After a clean reboot pretty much any file written to by any application (during the previous boot) was 0 bytes. For example Plasma and some of the KDE core config files were reset. Also some of my MySQL databases were killed...

My EXT4 partitions all use the default settings with no performance tweaks. Barriers on, extents on, ordered data mode..

I used Ext3 for 2 years and I never had any problems after power losses or system crashes.

The crash was not caused by ext4 but by something else. The file system was in a consistent state because of the journal. Some data had not yet been written to disk, because of the delayed write and was thus lost.

Maybe you need to take a break, or have a coffee, or get some sleep or something. But you really are way off and posting way too much on this topic that you are not well informed of.

This is not a bug, not a flaw, not a limitation. You can write and then read regardless of whether or not actual disk commits take place. The file system takes care of that for you. If you're doing file I/O, and you want to call yourself half-way competent, then you should have some clue about the possibility that the underlying file-system will be doing delayed writes. If you a writing critical applications for which this may cause issue then you might decide to throw in some fsync calls (or there equivalent in whatever platform you are using).

I know you have learnt something today. Glad to help out.

Re:To Anonymous Coward: by Jane+Q.+Public · 2009-03-11 13:33 · Score: 2, Insightful

As it turns out, the point is probably moot. As someone else has pointed out, the bug report itself (not TFA) makes it clear that the trashed data was, in fact, caused by a system crash and not by filesystem access per se. TFA and the headline both strongly implied otherwise, but as it turns out, this is a non-issue.

Re:To Anonymous Coward: by swillden · 2009-03-11 13:35 · Score: 2, Informative

It most definitely is a filesystem limitation.

No, it's not. The file system is perfectly capable of making sure all your writes hit the disk as soon as possible.

Just mount it with the 'sync' option.

If you want the significant performance benefits of delayed writes, however, you should not use 'sync' and accept that, with Ext4, write() works the way the documentation says it does.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.

But can you do it without looking? by clarkn0va · 2009-03-11 15:20 · Score: 2, Funny

My core2quad machine with 3 SATA disk RAID runs for about 20 minutes on a tiny APC UPS I bought from newegg for less than $100.

Sure, but that's assuming you can save your work in all open applications without power to your display. Me, I like a UPS with a little more juice so I can reap the fullness of my 52" plasma while cleaning up and shutting down.

--
I am literally 3000 tokens away from the chaotic crossbow --Stephen

rename and fsync by DragonHawk · 2009-03-11 15:34 · Score: 3, Insightful

"Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. "

Two things are happening:
(1) KDE is writing a new inode.
(2) KDE is renaming the directory entry for the inode, replacing an existing inode in the process.

KDE never calls fsync(2), so the data from step one is not committed to be on disk. Thus, KDE is atomically replacing the old file with an uncommitted file. If the system crashes before it gets around to writing the data, too bad.

EXT4 isn't "broken" for doing this, as endless people have pointed out. The spec says if you don't call fsync(2) you're taking your chances. In this case, you gambled and lost.

KDE isn't "broken" for doing this unless KDE promised never to leave the disk in an inconsistent state during a crash. That's a hard promise to keep, so I doubt KDE ever made it.

A system crash means loss of data not committed to disk. A system crash frequently means loss of lots of other things, too. Unsaved application data in memory which never even made it to write(2). Process state. Service availability. Jobs. Money. System crashes are bad; this should not be news.

The database suggestion some are making comes from the fact that if you want on-disk consistency *and* good performance, you have to do a lot of implementation work, and do things like batching your updates into calls to write(2) and fsync(2). Otherwise, performance will stink. This is a big part of what databases do.

As someone else suggested, it's perfectly easy to make writes atomic in most filesystems. Mount with the "sync" option. Write performance will absolutely suck, but you get a never-loses-uncommitted-data filesystem.

--

dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.

Re:rename and fsync by QuoteMstr · 2009-03-11 17:36 · Score: 2, Informative

Telling application developers to use a database is bullshit. The filesystem is a database, albeit not a relational one. A open-write-close-rename sequence merely asks for atomicity without durability, something that's perfectly reasonable. As other posters have mentioned in vain, all the application wants is for either the old version of a file or the entire new version to appear on a reboot. He doesn't care at the instant of the rename whether that replacement has been recorded on disk, just that eventually, when the filesystem does record that replacement, that it's recorded atomically.
You might want the open-write-fsync-close-rename behavior for a mailserver, in which you must acknowledge receipt (i.e., you need durability), but asking for that same durability in a multi-file configuration setup is just stupidly degrading performance.
open-write-close-rename is saying something fundamentally different from open-write-fsync-close-rename, and it's perfectly reasonable for a filesystem to act sanely in response to both kinds of request.

Re:raid controllers don't fake it by greg1104 · 2009-03-11 17:23 · Score: 2, Informative

If your battery-backed RAID controller ever fakes a fsync it is fundamentally broken or misconfigured. When the cache is filled with a write backlog and you try to write something else, that write will block until there is free space. Same as any other write cache that fills up.

When cache space is available to cache the write again, the data goes into there, and then a fsync request after it can then return success.

Hiding behind POSIX by antientropic · 2009-03-11 20:10 · Score: 2, Informative

All the Idiots who scream here that the OS is doing something worng: no, it's not.

This is called "hiding behind the standard" (a disease very common among kernel developers). Just because the standard doesn't specify behaviour in a certain situation doesn't mean that any behaviour is equally okay. In this case, ext4's behaviour very much hurts the robustness of the system, which is rather important in unreliable environments like laptops.

In this case, what KDE does is certainly not unreasonable (and its developers are certainly not "idiots"). It doesn't overwrite configuration files in place, which would be bad even in the absence of system crashes, as doing it that way is not atomic. Instead it creates a new temporary file, writes the new contents, then renames the temporary file to the old one. This is an atomic operation on Unix: you either see the old contents or the new contents, but nothing in between. Now, the problem is that in case of a crash, ext4 gives you the worst possible outcome by reordering the operations: it will "recover" the rename for you, but not the actual write of the new data. So you end up with a 0-byte file - far from atomic. POSIX of course allows this, but POSIX allows just about anything: that doesn't mean its reasonable. The only guaranteed solution - use an fsync/fdatasync - is something that almost nobody does because the performance is horrible (ext3 in fact will write the entire journal, IIRC, when doing an fsync() on a single file - this really hurt Firefox 3 performance). So the KDE developers can be excused for not doing that.

It's the job of a modern filesystem to ensure robustness and performance. If you don't use an fsync, you should expect that there is a time window during which transactions might become undone (not the end of the world for configuration files), but they should never be reordered. For instance, this is how Berkeley DB works if you disable fsync: it guarantees ACI but not ACID. For many desktop applications, that's good enough. Destroying every file that has been updated since the last fsync isn't. And your users aren't going to be impressed by the argument that POSIX allows it.

Re:Why SHOULD applications have to assume bad FSs? by Eunuchswear · 2009-03-11 22:56 · Score: 2, Informative

People don't fsync() all the time because it's SLOW. Not just a little slow, but RTFS's bug report for the link to the Firefox 3 bug due to performing 8 syncs per page load: if there's any IO going on, firefox ground to a halt to wait its turn to ensure that your bookmarks and history and cookies and everything else were really, really written to disk.

Well, it has to be said that fsync() on ext3 is slow because of an ext3 bug - fsync() is the same as sync() on ext3.

--
Watch this Heartland Institute video

This is definitely an FS problem by Uzik2 · 2009-03-12 00:28 · Score: 2, Insightful

If the guys writing the FS can't figure out how to properly write a cache that's not the problem of the application writers.
If I save a file via an OS call and the OS tells me it didn't fail then if I can't immediately reread it then the OS is broken.

Data loss from write caching is not a new problem either. Guess this year's crop of programmers can't figure out how to use google to find out about past problems or they just figure they're smarter than everyone else that came before them.

--
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it

ZFS - copy-on-write & checksums - today by toby · 2009-03-12 02:39 · Score: 2, Informative

Re: "backup old file and write a new one" - A transactional copy-on-write filesystem such as Sun's ZFS is doing almost the same job, transparently.

I have little doubt that copy-on-write will eventually supersede overwrite-and-pray filesystems. The wins are numerous, including cheap snapshotting, etc, etc. Install OpenSolaris and give ZFS a try today!

--
you had me at #!

Slashdot Mirror

Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4

152 of 830 comments (clear)