Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4

← Back to Stories (view on slashdot.org)

Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4

Posted by timothy on Wednesday March 11, 2009 @09:04AM from the heavy-trade-off dept.

cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.

19 of 830 comments (clear)

Min score:

Reason:

Sort:

Not a bug by casualsax3 · 2009-03-11 09:06 · Score: 5, Informative

It's a consequence of not writing software properly. Relevant links later in the same comment thread for those who don't might otherwise miss them:
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54
1. Re:Not a bug by Anonymous Coward · 2009-03-11 09:37 · Score: 5, Informative
  
  Quoting T'so:
  "The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, ...
  Linux reinvents windows registry?
  Who knows what they will come up with next.
2. Re:Not a bug by davecb · 2009-03-11 09:49 · Score: 4, Informative
  
  Er, actually it removes the previous data, then waits to replace it for long enough that the probability of noticing the disappearance approaches unity on flaky hardware (;-))
  --dave
  
  --
  davecb@spamcop.net
3. Re:Not a bug by OeLeWaPpErKe · 2009-03-11 09:51 · Score: 5, Informative
  
  Let's not forget that the only consequence of delayed allocation is the write-out delay changing. Instead of data being "guaranteed" on disk in 5 seconds, that becomes 60 seconds.
  Oh dear God, someone inform the president ! Data that is NEVER guaranteed to be on disk according to spec is only guaranteed on disk after 60 seconds.
  You should not write your application to depend on filesystem-specific behavior. You should write them to the standard, and that means fsync(). No call to fsync, look it up in the documentation (man 2 write).
  The rest of what Ted T'so is saying is optimization, speeding up the boot time for gnome/kde, it is not necessary for correct workings.
  Please don't FUD.
  You know I'll look up the docs for you :
  (quote from man 2 write)
  
  NOTES
  A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee
  that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
  If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has
  been written, the call succeeds, and returns the number of bytes written.
  That brings up another point, almost nobody is ready for the second remark either (write might return after a partial write, necessitating a second call)
  So the normal case for a "reliable write" would be this code :
  size_t written = 0;
  int r = write(fd, &data, sizeof(data))
  while (r >= 0 && r + written sizeof(data)) {
  written += r;
  r = write(fd, &data, sizeof(data));
  }
  if (r 0) { // error handling code, at the very least looking at EIO, ENOSPC and EPIPE for network sockets
  }
  and *NOT*
  write(fd, data, sizeof(data)); // will probably work
  Just because programmers continuously use the second method (just check a few sf.net projects) doesn't make it the right method (and as there is *NO* way to fix write to make that call reliable in all cases you're going to have to shut up about it eventually)
  Hell, even firefox doesn't check for either EIO or ENOSPC and certainly doesn't handle either of them gracefully, at least not for downloads.
4. Re:Not a bug by Jurily · 2009-03-11 10:04 · Score: 4, Informative
  
  It just loses recent data if your system crashes before it has flushed what it's got in RAM to disk.
  No, that's the bug. It loses ALL data. You get 0 byte files on reboot.
5. Re:Not a bug by caerwyn · 2009-03-11 10:16 · Score: 4, Informative
  
  You're right. The correct thing to do is to *always* call fsync() when you need a data guarantee, *regardless* of which FS you're on. The fact that not doing it in the past hasn't caused problems isn't the problem- those calls are the correct way of handling things.
  
  --
  The ringing of the division bell has begun... -PF
6. Re:Not a bug by dmiller · 2009-03-11 10:21 · Score: 4, Informative
  
  You are doing it wrong; permanently failing on recoverable EINTR and EAGAIN errors. See here for how to do it right.
7. Re:Not a bug by Ed+Avis · 2009-03-11 12:01 · Score: 3, Informative
  
  YES!! That is EXACTLY what I expect the every modern file system to do.
  Your expectation is quite reasonable. When the application writes something to disk, it should be there on disk, right? The way the article is presented makes it sound like a horrible bug in ext4 that it doesn't do this. But believe it or not, almost no filesystem provides this guarantee by default. ext3 doesn't (in the default mode), nor does ext2, nor a typical implementation of FAT or NTFS or the Minix filesystem or whatever.
  For decades now it has been an accepted trade-off that the filesystem can hold back disk writes and do them later, giving better disk performance at the expense of losing data if there is a crash. Losing file data is bad but losing metadata is even worse, since corrupt filesystem metadata can trash the contents of many files and requires a lengthy fsck on startup. So journalling filesystems, as typically configured, keep a journal for metadata so it's not corrupted even if the power gets cut at the most inconvenient moment. But they don't extend the same care to file contents, because it would be too slow. You can enable it by setting the data=journal parameter in ext3 (and I guess ext4 too) but this isn't the detail.
  It is certainly a bit unfair that the filesystem takes such pains with its own bookkeeping information but doesn't bother to be so careful about user data. But as I said, it's a known tradeoff to get better performance. If you want to be sure your file has reached disk you need to fsync(). This sucks, but it's the Unix way, and has been so for like, forever. So it's not a bug in ext4 - just bad luck and perhaps a misunderstanding between kernel and userspace about what guarantees the filesystem provides.
  As SSDs replace rotating storage, there is less need to buffer writes (certainly the need to minimize seek time goes away, and that's the biggest reason), so we might see this whole situation resolved within a few years. Perhaps in 2015, when the system call returns, you can be sure that the data is written. Until that longed-for day, bear in mind that your filesystem is permitted to temporarily lie to you about what has been written, and call fsync() if you are paranoid.
  
  --
  -- Ed Avis ed@membled.com
8. Re:Not a bug by Qzukk · 2009-03-11 12:49 · Score: 4, Informative
  
  A file system should take my data buffer, and after saying "Ok, I got it"
  There's your problem, you didn't even bother to ask if it got it, you just threw a ton of data into the file descriptor and closed it, now didn't you. And you want me on thedailywtf?
  But lets back up here, because there's more than just people too lazy to call fsync() in order to ask the file system to write the data to the disk and say "Ok, I got it".
  All that stuff about creating a backup copy and doing this and that, has to happen inside the file system.
  The filesystem does exactly what you tell it to do. If you don't want it to make a zero byte file, then DON'T USE O_TRUNC OR *truncate() TO EMPTY YOUR FILE. Make a new file, fill it up, rename it over the other file. Don't assume that in just a few instructions, you're going to be filling it back up with new data, because those instructions may never arrive.
  You don't like it? Try and convince people that (open file, erase all the data in it, do some stuff, write some data, do some more stuff, write some more data, write data to disk, close file) should be an uninterruptable atomic operation. You want a versioning filesystem? Take your pick.
  
  --
  If I have been able to see further than others, it is because I bought a pair of binoculars.
Re:Bull by Anonymous Coward · 2009-03-11 09:36 · Score: 5, Informative

This is NOT a bug. Read the POSIX documents.
Filesystem metadata and file contents is NOT required to be synchronous and a sync is needed to ensure they are syncronised.
It's just down to retarded programmers who assume they can truncate/rename files and any data pending writes will magically meet up a-la ext3 (which has a mount option which does not sync automatically btw).
RTFPS (Read The Fine POSIX Spec).
Re:Classic tradeoff by imsabbel · 2009-03-11 09:36 · Score: 3, Informative

Its even WORSE than just being asynchronous:
EXT4 reproducably delays write ops, but commits journal updates concerning this write.

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
Re:Bull by pc486 · 2009-03-11 09:50 · Score: 5, Informative

Ext3 doesn't write out immediately either. If the system crashes within the commit interval, you'll lose whatever data was written during that interval. That's only 5 seconds of data if you're lucky, much more data if you're unlucky. Ext4 simply made that commit interval and backend behavior different than what applications were expecting.
All modern fs drivers, including ext3 and NTFS, do not write immediately to disk. If they did then system performance would really slow down to almost unbearable speeds (only about 100 syncs/sec on standard consumer magnetic drives). And sometimes the sync call will not occur since some hardware fakes syncs (RAID controllers often do this).
POSIX doesn't define flushing behavior when writing and closing files. If your applications needs data to be in NV memory, use fsync. If it doesn't care, good. If it does care and it doesn't sync, it's a bad application and is flawed, plain and simple.
Re:Excuses are false. This is a severe flaw. by Anonymous Coward · 2009-03-11 10:11 · Score: 4, Informative

Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
You seem to misunderstand that's *exactly* what is happening.
KDE is *DELETING* all of its config files, then writing them back out again in two operations.
Three states now exist, the 'old old' state, where the original file existed, the 'old' state, where it is empty, and the 'new' state where it is full again.
The problem is getting caught between step #2 and step #3, which on ext3 was mostly mitigated by the write delay being only 5 seconds.
KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.
Re:Why SHOULD applications have to assume bad FSs? by gweihir · 2009-03-11 10:12 · Score: 3, Informative

Whats wrong with "After a file is closed, its synced to disk"?!?
What, you want people to have to delay/stagger/coordinate their file closes in order to avoid overloading the filesystem? That is the wrong approach. close() just means that the application is done with the file. The sync calls are not a joke, they are there precisely for the reason that close() already has an antirely sensible but different semantics. Anybody that wants close also to sync can code it that way without problem. Anybody else probably does not want this behaviour in the first place.
This is not hidden in any way. A simple "man close" not warns of this, it also refers the reader to the fsync call. Anybody getting bitten by this did not no their homework.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
man 2 fsync by Nicolas+MONNET · 2009-03-11 10:23 · Score: 5, Informative

The filesystem doesn't guarantee anything is written until you've called fsync and it has returned.
Re:To Anonymous Coward: by Bronster · 2009-03-11 10:44 · Score: 4, Informative

mount -o sync. Enjoy your slow returns and strictly ordered writes.
Re:Bull by LWATCDR · 2009-03-11 11:10 · Score: 4, Informative

It isn't a flaw. It is documented and the programmers didn't follow the docs. There is a specific command called fsync to flush the buffers to prevent the problem.
In fact here is a link to that call http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html
Yes if we had a prefect world we would have instant IO but we do not. The flaw is in the application plan and simple.
They didn't use the api properly and it really is just that simple.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Excuses are false. This is a severe flaw. by Tadu · 2009-03-11 11:22 · Score: 5, Informative

KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.
Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. While that may be correct from a POSIX lawyer pont of view, it is still heavily undesirable.
Re:Bull by amirulbahr · 2009-03-11 12:58 · Score: 3, Informative

Who modded this up? Jane Q. Public is completely clueless on this topic, but she manages to sound like she has an idea to fellow clueless moderators. She should be called out for the karma whoring ignoramus she is.
Some choice quotes from her on this thread.

Delayed allocation is like leading a moving target when shooting.
BadAnalogyGuy would be proud. Probably also worth mentioning that without delayed allocation, the system would be unbearably slow.

The longer you delay allocation after writing the journal (and Ext4 seems to take this to extremes), the more chance there is of something -- almost anything really -- going wrong
A kernel crash or power outage is certainly something that could go wrong. Modern journalling file-systems handle this gracefully by making sure the file-system is in a consistent state when it comes back up.

The filesystem is flawed, plain and simple.
You'll realize why that one is a gem when you read her next quote. As the discussion continues, she begins to realize how far off the mark she is and begins to correct...

It most definitely is a filesystem limitation. That is different from saying that it's the filesystem's fault.
Still off the mark, but perhaps she is beginning to figure out what a file system should offer and what the issue being discussed is.

If an application that reads and writes lots of small files fails under Ext4, then it is Ext4's fault, not the application. An application should be able to read and write lots of small files if it wants... I can think of a great many practical examples.
Go ahead and do that. But if you want to make sure you're data is written, in case of a kernel crash or power outage, then you had better understand what is going on at the FS level.

As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter.
No, but you should understand the API of the language you are dealing with. Since when does a compiler handle disk I/O anyway? As for your interpreter, it is free to call fsync whenever it wants, but what has that got to do with the FS again?

Excuse the heck out of me, but the issue being discussed was a failure that was NOT due to a power loss or other such system problem. It was a crash caused by this very issue. If it were simply missing data due to power loss or some such, there would be no point in this discussion at all.
The purpose of this quote is to demonstrate that she both has no regard for TFA and also has no idea what this issue being discussed is. I encourage anyone looking to give her mod points actually RTFA and also do a bit of background reading on file systems and in particular delayed writes.

My point was and still is: if the data is not flushed to disk yet, it should either be accessible from the buffer, or not at all.
This sentence alone deserves a -1 Huh? If you do a write, and it is successful, then you can do a read on the same file and it will return what you wrote, whether or not it had been flushed to disk. This is the way it is supposed to work. Think about it for like 10 seconds and you'll begin to get it.

not supposed to have to worry about OS-specific details
WE ARE TALKING ABOUT UNEXPECTED KERNEL CRASHED AND POWER OUTAGES. If you care about that situation then you should get a clue before you start coding. If not, then what is the problem, or was it fault... er, sorry limitation?

One should not have to know about syncing to do something like a few simple file writes
And one doesn't need to if she is not concerned with the rare possibility that the system CRASHES OR LOSES POWER in the next few minutes.
Anyway, I've never called out another poster like this before and now I feel dirty.