EXT4 Data Corruption Bug Hits Linux Kernel
An anonymous reader writes "An EXT4 file-system data corruption issue has reached the stable Linux kernel. The latest Linux 3.4, 3.5, 3.6 stable kernels have an EXT4 file-system bug described as an apparent serious progressive ext4 data corruption bug. Kernel developers have found and bisected the kernel issue but are still working on a proper fix for the stable Linux kernel. The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often."
They split it in half? I suspect you mean disected.
It's a good thing most stable releases are on 3.2 or 3.0 with commercial systems on even earlier versions.
I know he'd never do anything to harm me or my data.
It's a pity they can use ZFS instead of re-inventing the wheel. The other pity is that newest distro seems to force you to use EXT4 at installation (on your desktop).
The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often.
We're talking about Linux users here...move along.
The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often."
They're trying to boost the average uptime of all installations by making people keep their machines turned on. It's just a continuation of the uptime war waged with the BSD folks!
Ezekiel 23:20
Brilliant. Well, it certainly worries this Linux developer -- although I mostly rely on pre-3.0 kernels. Wasn't there a rule on Slashdot about mirroring articles before posting links to them ?
In Soviet Russia, our new overlords are belong to all your base.
From Ted Ts'o's commentary, it's an optimization ("jbd2: don't write superblock when if its empty") gone awry:
Basically, this optimization has the side effect of not updating the transaction log in this rare case. You can end up replaying old transactions after new ones, which will scramble metadata blocks. Given the rather unique conditions needed to hit this one, I'm not going to lose any sleep over any servers running without Ted's fix (though I'll certainly apply it once RedHat releases the patch).
Please define "too often" .... ?!?!
...and too deep. It awoke a being of segfaults and kernel panics.
At first I had mixed feelings of slight disappointment and concern, especially because it is the default filesystem in several distros, (including Android). Although, after some second thoughts, I have come to the following conclusions:
...please, guys, don't do it again!
1) it is part of the game of having a continuous development toward improvement (most of the times) and new features implies some pitfalls. So far, benefits are much larger than costs.
2) Despite the fact developers are still working on a fix, I wouldn't be surprised if it would be found soon.
3)
This is why I don't use file systems less than 10 years old.
What term do we get to use for ext4 now? It's unfortunate that Theodore Tso is actually a pretty decent guy instead of being a murderer (and a jerk). So there aren't any obviously negative terms that come to mind.
But clearly, something needs to be done along these lines, as well as a legion of people who forever more claim that ext4 corrupts your data and you should never use it and stick with ext3 instead.
Need a Python, C++, Unix, Linux develop
After recently discovering pm-suspend on my desktop, I have found I never need to turn off my computer again! Use "sudo visudo" to get rid of the annoying pw prompt.
grammar nazi's
grammar Nazis
The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often.
This is wrong. The problem occurs when the fs is unmounted too *soon*. Twice in a row. The bug only appears if the journal buffer does not wrap. You only get catastrophic results if this happens twice in a row.
We don't see the world as it is, we see it as we are.
-- Anais Nin
I have used BSD. I found it .... quite striking. There's a hell of a lot of performance enhancement in Linux, and it really shows when you try to boot BSD and find it's ass-slow from the get-go. I even tried slapping down Debian-kfreebsd to compare something roughly the same and ... yeah it's just slow as shit. Solaris (both Sun Solaris and Nexenta = Ubuntu/Solaris) wasn't that slow.
Support my political activism on Patreon.
This is what you get when you use a filesystem that wasn't developed by a real company.
Because if they had to worry about losing money, they would make damned sure that problem didn't exist. Or at least make it go away. I thought this "problem" existed with ext4 for years.
Yeah, Micro$oft is evil, but their FS works. And file corruption isn't a serious issue except when hard drives fail, and, well, in that case...DERP!
BSD died about 10 years ago.
The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often."
"You're just rebooting it wrong."
-Loonix filesystem developer
... can we get the words "stable", "linux", and "kernel" into a single summary? I like this game.
"Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"
They're mounting it wrong!
When you mount your disks, you need to be sure of proper head alignment. Make sure she's spun up properly as well, otherwise the disks could be surprised and jump away causing a crash. Lastly, my geek friends, mounting too often can cause burning friction which can destroy data and cause irritation and discomfort.
I said no... but I missed and it came out yes.
...I believe that it had problems with large files (I don't know all of the details) at one point, too.
This may still be an open issue.
I stick with EXT3, but it has the "forever to perform a mkdir" issue after your filesystem crosses
some file count threshold. But I've not had anything go sour with EXT3 even when the box has
gone down hard from a power failure.
Also, we're running Win 2008 server and this is the second time we've seen this where a whole
partition becomes unusable. We have to restore the entire image from backup; it can't be repaired.
CAPTCHA = sour grapes they're not!
And Netcraft conirmed it. I know. Everybody knows. You don't need to keep repeating it.
But, of course, zumbies are knwon to be slow... You may be up to something.
Rethinking email
... going to the Uptime War Battle Royale?
s/behind/ahead/
s/Fusion/Fussy/
People reboot linux?
Considering how those who manage the curve are rude, obstructive and just downright mean - I think Linux does a great job in keeping up.
FreeBSD has never been slow for me. (At least I have never noticed it - possibly because with ports you don't build in stuff you don't need).
(Other than when Mysql was only usable properly with linuxthreads (And the linuxulator didn't yet support them). but that really was a long time ago).
There is no good video card drivers other than Nvidia though. (Same with Solaris x86).
FreeBSD is not getting bloated to a ridiculous level either. (Stuff like DTRACE is worth it)
If you build your main ports with the latest gcc (Unfair to use gcc 4.2 in the base system) / use an Nvidia Video Card / AHCI / 64 bit arch.
Stuff that is supported works generally pretty well. (The usbaudio / envy24 just works unlike the combination of alsa and / or pulseaudio that always messes stuff up.)
Save yourself the extra write and extra opportunity for something to go wrong: disable the journal. worth considering in any case: http://pentabular.wordpress.com/ext4-on-laptop-ssd/
I have a Google+ post where I've posted my latest updates to this still-developing story:
https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7
Also, I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.
I have a Google+ post where I've posted my latest updates to this still-developing story:
https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7
Also, I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.
what's this mean about various versions of Android using ext4? I think I just flashed my tablet to use ext4 (ugh)... really don't want corruption my tablet...
Nobody smart reads gag+. Or failbook. lern2internet
I have many thumb drives formatted in ext4. I guess it will not be good idea to use it on my 3.5 kernel based distro, then?
The more recent patch at http://marc.info/?l=linux-kernel&m=135105626207228&w=2 fixes stuff.
Yes, it just won't receive the benefits of an Apple Fusion drive, but it does run fine.
Change is certain; progress is not obligatory.
After recently installing Slackware 14, I was a bit miffed that the distribution release had reverted at the last minute to kernel version 3.2.29. Now I am so grateful that Patrick played it safe.
Maybe it is a good idea to refuse patches with grammatical errors in the comments or descriptions.
When the submitter has not made the effort to make at least the comments grammatically correct,
probably the correctness of the code is questionable.
Um, if you're referring to the fairly recent Ubuntu/Fedora enhancements of speeding up boots by loading the daemons in parallel...
The BSDs probably will never include that feature, as they are conservative and value having a simple, debuggable design.
(But, well, your rant has no value if you give no concrete examples. I wonder who decided to moderate you upwards.)
"have you tried turning it off and on again" anymore?!?!?
I currently work on a product that uses fuse on top of xfs on top of LVM on top of RAID1. There are good solid reasons for the existence of each of those layers.
No filesystem is the best for all uses, and when ZFS tries to do everything it means that it doesn't play nice with the rest of the stack.
According to Ted Ts'o's latest update (https://plus.google.com/117091380454742934025/posts) this actually involved a combination of "umount -l" and shutting down while the filesystem was still mounted, and the user also had "nobarrier" set on the filesystem as well as "journal_async_commit".
So it sure looks like the user was playing fast and loose...this is not something that's going to hit your average person.
Perhaps the author of this summary could have been more precise. The bug is very unlikely to be triggered, here are some examples: https://lkml.org/lkml/2012/10/24/535 and http://phoronix.com/forums/showthread.php?74697-EXT4-Data-Corruption-Bug-Hits-Stable-Linux-Kernels&p=293446#post293446 . Indeed is a good measure to downgrade to a safe version and wait for a patch to come. I have been using the 3.6.2 on my two Gentoo boxes for a couple of days and nothing happened. As a precaution I will downgrade till they release such fix.
And this is why I wait before switching fs types. I waited almost 2 years after ext3 was considered stable, before I switched from ext2. I just rebuilt my machine 2 days ago, and I almost, almost went with ext4. But that little voice of caution(read, paranoid subconcious :P) told me to hold off, then someone points out this thread to me.
With that said, after reading the posts in the mailing lists, I am once again proud of the kernel developers and the hardcore linux geeks, for so quickly jumping on this problem, as well as the calm of the "victims". If a similar problem occurred in windows, hoo-boy, there would be an uprising.
--- Amateur musician: http://josh.morine.net/headbanger/
I guess it has come time to tell the truth.
First of all, the bug has never been bisected, and the whole story that hit Slashdot and some other news sites was based solely on Ted's speculation, which was never confirmed. In fact, at the of the same day, Ted admitted that his hypothesis was wrong.
After a few days of investigation, the problem was traced to an experimental mounting option, which is not turned on by default and was intended for developers only. Accidentally, this option was not marked as "experimental", so it is available to users. https://lkml.org/lkml/2012/10/26/570
Nah. To get the case I found you need not one experimental option, but *three*.
Specifically, you need nobarrier,journal_async_commit -- and the latter option implies journal_checksum, so it's really three options.
If you do all that, reboots / blockdev disconnections while an unmount is proceeding will not merely give you filesystem corruption on second mount (regardless of options the second time), but *silent* filesystem corruption on remount (journal_checksum and any other options will give you a journal abort and read-only remount, which is a pretty big clue that something is wrong, though the filesystem is still corrupted).
Fun stuff.
-- N.