Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4

← Back to Stories (view on slashdot.org)

Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4

Posted by timothy on Wednesday March 11, 2009 @09:04AM from the heavy-trade-off dept.

cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.

5 of 830 comments (clear)

Min score:

Reason:

Sort:

Theory doesn't matter; practice does by microbee · 2009-03-11 09:28 · Score: 3, Interesting

So, POSIX never guarantees your data is safe unless you do fsync(). So, ext3 was not 100% safer either. So, it's the applications' fault that they truncate files before writing.
But it doesn't matter what POSIX says. It doesn't matter where the fault belongs to. To the users, a system either works nor not, as a whole.
EXT4 aims to replace EXT3 and becomes the next gen de-facto filesystem on Linux desktop. So it has to compete with EXT3 in all regards; not just performance, but data integrity and reliability as well. If in the common scenarios people lose data on EXT4 but not EXT3, the blame is on EXT4. Period.
It's the same thing that a kernel does. You have to COPE with crappy hardware and user applications, because that's your job.
Excuses are false. This is a severe flaw. by rpp3po · 2009-03-11 09:28 · Score: 3, Interesting

There are several excuses circulating: 1. This is not a bug, 2. It's the apps' fault, 3. all modern filesystems are at risk.
This is all a bunch of BS! Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!
Re:Not a bug by Qzukk · 2009-03-11 09:29 · Score: 5, Interesting

I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
It is perfectly OK to read and write thousands of tiny files. Unless the system is going to crash while you're doing it and you somehow want the magic computer fairy to make sure that the files are still there when you reboot it. In that case, you're going to have to always write every single block out to the disk, and slow everything else down to make sure no process gets an "unreasonable" expectation that their is safe until the drive catches up.
Fortunately his patches will include an option to turn the magic computer fairy off.

--
If I have been able to see further than others, it is because I bought a pair of binoculars.
not mounted sync,dirsync? by dltaylor · 2009-03-11 09:40 · Score: 4, Interesting

When I write data to a file (either through a descriptor or FILE *), I expect it to be stored on media at the earliest practical moment. That doesn't mean "immediately", but 150 seconds is brain-damaged. Regardless of how many reads are pending, writes must be scheduled, at least in proportion of the overall file system activity, or you might as well run on a ramdisk.
While reading/writing a flurry of small files at application startup is sloppy from a system performance point of view, data loss is not the application developers' fault, it's the file system designers'.
BTW, I write drivers for a living, have tuned SMD drive format for performance, and written microkernels, so this is not a developer's rant.
Re:Bull by vadim_t · 2009-03-11 13:54 · Score: 4, Interesting

That's all very well, but in most cases, the machine is a single-user desktop, and it's sitting *idle* during the 5-40 seconds in which the filesystem writes are being batched up. It seems to me that the writes should be sent to disk immediately, unless the disk is already busy, in which case they may be delayed for (at most) 40 seconds, while other reads take place ahead of them.
Mount your filesystem with the "sync" option, that should do what you want I guess. Performance will be bad though.
There are only two ways to do this: either you do it completely synchronously, and get a guarantee of the write being done when the application is done writing, or you have a delay of arbitrary length. If you have a delay, even if it's 1ms, and you care about the possibility of something going wrong at that moment, the application has to deal with the possibility. Reducing the delay only makes it less likely, but given enough time it'll happen.
Even doing it fully synchronously you can run into problems. A file can be half written (it's written by the block, after all), and of those 40 files, perhaps one references data in another.
Point being, if such things are really a problem for the application, the application must do things correctly, by writing to temporary files, renaming, and writing in the right sequence so that even if something is interrupted in the middle the data on disk still makes sense.
Even if the FS does like you want and starts writing immediately, that won't save you from the fact that it has no clue how your file is internally structured, and will perform writes in fs-sized blocks. So your 10K sized file can be interrupted in the middle and get cut off at 4K in size after a crash. If your application then goes and chokes on that, there's no way the FS can fix that for you.

Also, with a modern SATA disk supporting Native Command Queuing, the OS should immediately write the data to the disk's buffer, and the disk's firmware gets to decide about re-ordering.
NCQ doesn't take care of half that's needed for safe writing to disk. Two problems for a start:
1. Your hard disk doesn't know about your filesystem's structure. Unless told otherwise, the HDD will happily reorder writes and update ondisk data first, journal second, leading to disk corruption. The hard disk can't magically figure out what's the right way to write the data so that it remains consistent, only the OS and the application can ensure that.
2. NCQ is limited to 32 commands anyway, the OS has to do handling on its own anyhow.

As for the argument about using sqlite - why have yet another abstraction? After all, the filesystem is already a sort of database!
Because it's a simpler abstration. If you're not willing to learn or deal with the POSIX semantics, such as fsync and rename, and checking the return code of every system call, you can use something like sqlite that does it internally and saves you the effort, and returns one unique value that tells you whether the whole update worked or not.