EXT4 Data Corruption Bug Hits Linux Kernel
An anonymous reader writes "An EXT4 file-system data corruption issue has reached the stable Linux kernel. The latest Linux 3.4, 3.5, 3.6 stable kernels have an EXT4 file-system bug described as an apparent serious progressive ext4 data corruption bug. Kernel developers have found and bisected the kernel issue but are still working on a proper fix for the stable Linux kernel. The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often."
Nope - bisection is a common technique for tracking down the cause of a bug by doing a binary search through the code history.
https://en.wikipedia.org/wiki/Code_Bisection
No this means the kernel has bug-like tendencies from time to time, but is not exclusively buggy. For instance when it's in college, or if its at a bar, and has had a few drinks, well then it might be buggy, but normally at work and at home and to all its friends it acts stable.
I want to delete my account but Slashdot doesn't allow it.
I know he'd never do anything to harm me or my data.
The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often.
We're talking about Linux users here...move along.
The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often."
They're trying to boost the average uptime of all installations by making people keep their machines turned on. It's just a continuation of the uptime war waged with the BSD folks!
Ezekiel 23:20
From Ted Ts'o's commentary, it's an optimization ("jbd2: don't write superblock when if its empty") gone awry:
Basically, this optimization has the side effect of not updating the transaction log in this rare case. You can end up replaying old transactions after new ones, which will scramble metadata blocks. Given the rather unique conditions needed to hit this one, I'm not going to lose any sleep over any servers running without Ted's fix (though I'll certainly apply it once RedHat releases the patch).
...and too deep. It awoke a being of segfaults and kernel panics.
At first I had mixed feelings of slight disappointment and concern, especially because it is the default filesystem in several distros, (including Android). Although, after some second thoughts, I have come to the following conclusions:
...please, guys, don't do it again!
1) it is part of the game of having a continuous development toward improvement (most of the times) and new features implies some pitfalls. So far, benefits are much larger than costs.
2) Despite the fact developers are still working on a fix, I wouldn't be surprised if it would be found soon.
3)
What they actually split in half is a sequence of changesets (also known as commits).
The idea is you have a seqence of changesets that take you from the last known good revision to the first known bad revision. By splitting that sequence in half and determining if the revsion in the middle is good or bad you can in principle halve the number of revisions between last known good and first known bad until you find the revision that introduced the bug. Reality is messier because of nonlinear history, because some revisions may be "broken" such that it is not possible to determine if they are "good" or "bad" and because some bugs may be difficult to test for but still bisection is a useful tool for finding problem revisions among a long history relatively easill.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
So clearly the answer is General Tso's FS. Delicious, but you'll lose your data an hour later.
grammar nazi's
grammar Nazis
The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often.
This is wrong. The problem occurs when the fs is unmounted too *soon*. Twice in a row. The bug only appears if the journal buffer does not wrap. You only get catastrophic results if this happens twice in a row.
We don't see the world as it is, we see it as we are.
-- Anais Nin
I have to agree with you. This is one of the best demos of ZFS around :)
http://www.youtube.com/watch?v=QGIwg6ye1gE
ZFS solves 3 problems by taking a wholistic approach:
* Volume Management
* File System
* Data Integrity
Instead of fragmenting the problem into 3 layers which only have limited access and knowledge by using a unified layer you have more meta-information available to make smarter decisions.
Some interesting essays:
https://blogs.oracle.com/bonwick/entry/raid_z
https://blogs.oracle.com/bonwick/en_US/entry/rampant_layering_violation
> Windows has never had anything as serious as a file system corruption bug.
That you know of...
Since the Windows development process isn't open, there's no way for you to tell. You don't get to see Microsoft's development versions and you don't get to see Microsoft's bug database.
A Pirate and a Puritan look the same on a balance sheet.
http://answers.microsoft.com/en-us/windows/forum/windows_cp-files/bug-report-serious-filesystem-corruption-and-data/17f69e19-92ca-4e1e-b9d5-f78f1ac4e963
Bugs happen. The difference here is that Linux development is done in the open so people find out about them.
I have used BSD. I found it .... quite striking. There's a hell of a lot of performance enhancement in Linux, and it really shows when you try to boot BSD and find it's ass-slow from the get-go. I even tried slapping down Debian-kfreebsd to compare something roughly the same and ... yeah it's just slow as shit. Solaris (both Sun Solaris and Nexenta = Ubuntu/Solaris) wasn't that slow.
Support my political activism on Patreon.
Source?
Cuz I'm looking:
http://en.wikipedia.org/wiki/Ntfs#Microsoft_Windows
http://www.tomshardware.com/forum/1249-63-ntfs-win7-windows
http://en.wikipedia.org/wiki/Ntfs#Versions
And just not seeing "XP is incompatible with the newest version of NTFS"
On the Oregon Cost born and raised, On the beach is where I spent most of my days
This one occurred in october so pretty doubtful since none of the major distros are that up to date.
Perhaps, if disect is a real word, but dissect means "cut up/apart", not specifically into two parts.
If God forks the Universe every time you roll a die, he'd better have a damned good memory.
They're mounting it wrong!
When you mount your disks, you need to be sure of proper head alignment. Make sure she's spun up properly as well, otherwise the disks could be surprised and jump away causing a crash. Lastly, my geek friends, mounting too often can cause burning friction which can destroy data and cause irritation and discomfort.
I said no... but I missed and it came out yes.
> Blame SUN, they choose a license for ZFS to ensure it never had proper in kernel linux support.
That's a myth / blatant lie.
Fork Yeah! The Rise and Development of illumos
http://www.youtube.com/watch?feature=player_detailpage&v=-zRN7XLCRhc#t=1460s
Why You Need ZFS
http://www.youtube.com/watch?v=6F9bscdqRpo
@5:40 I just want to clarify you comment "It would be illegal to ship"
@5:45 I think there is a perception issue that we need to tackle.
@5:55 One point that I would like to make because I think said earlier that I think we have much more in common then that separates us.
@5:58 One of the most important things we all have in common is we are all open source systems.
@6:02 And we need to end this self inflicted madness of open source licensing compatibility.
@6:12 I think that it is a boogey man and we letting it us hold us back.
@6:19 You say it would be illegal to ship. I say no one has standing
@6:24 The GPL was never ever designed to counter-act other open source licenses.
@6:33 That is a complete rewrite of history to believe the GPL was designed to be at war with BSD or with Cuddle.
@6:39 The GPL was at war with properiety softwware. And thank the GPL and Stallman open source won.
@6:45 That is the whole point. Open source won.
@6:49 We are pissing on our own victory parade by not allowing these technologies to flow between systems.
They split it in half?
I know it's wrong but I just got this mental image of someone moving all the 0's to one side of a page and all the 1's to the other side...
You have the right to remain sentient. If you give up the right to remain sentient, you will be elected to public office
If God forks the Universe every time you roll a die, he'd better have a damned good memory.
Nah, He only needs the latest SHA1 for each roll outcome commit as that'll point up the GIT tree :-D
I think YOU are the one who didn't get the joke...
"That's right...I said it."
That isn't a file system bug, that is progress. Would you consider it a bug if a Linux system from 1998 caused corruption on an ext4 volume?
Hell yeah.
If it'd tell me it doesn't know the file system and has no idea what do do with it,
that would be perfectly fine.
But corrupting a file system just because it is unknown to/unsupported by the
system trying to read it would be a huge bug.
Still, for all of the shit that Linux users talk about Windows, WINDOWS has NEVER had anything as serious as a FILE system CORRUPTION bug.
Finally, someone talking sense ... oh wait.
http://www.computerworld.com/s/article/9054178/Microsoft_s_Windows_Home_Server_corrupts_files
"Microsoft's Windows Home Server CORRUPTS FILES"
"'Don't edit' list includes photos, as well as Quicken and QuickBooks files, warns Microsoft; no word on patch"
Never mind ...
People reboot linux?
Nah!
Your'e wrong!!
The 0's go to the top of the page, and the 1's to the bottom!!!
(As the 0's have air bubbles that make them float...)
[An irrelevant irrelevancy?]
The summary should say "bisected and found" not "found and bisected". Bisecting is a way of finding bugs.
No. They found the bug, then bisected the commits between "last known working" and HEAD to discover what patch caused it.
Dewey, what part of this looks like authorities should be involved?
It was the 486DX that brought the FPU on chip. The 386DX had a 32-bit wide data bus and the 386SX has a 16-bit wide data bus, as well as only 24-bits of the address bus hooked up externally.
>> Windows has never had anything as serious as a file system corruption bug.
>That you know of...
So what were all those chkdsk errors after BSODs?
Also FatPhil on SoylentNews, id 863
Nice try, but fail. That wasn't a bug in Windows, it was a bug in applications.
Really? Not according to Microsoft.
http://support.microsoft.com/kb/946676
"A BUG has been discovered in the way that the initial release of Windows Home SERVER manages FILE transfer and balancing across multiple hard drives. In certain cases, depending on application use patterns, timing, and the workload that is placed on the Windows Home Server-based computer, certain FILES could become CORRUPTED."
"... For distributing data across the different hard drives that are MANAGED by WINDOWS Home Server, the WINDOWS Home Server mini-filter driver REDIRECTS I/O ... A BUG has been discovered in the REDIRECTION mechanism which, in certain cases, depending on application use patterns, timing, and workload, may cause interactions between NTFS, the Memory Manager, and the Cache Manager to get out of sync. This causes CORRUPTED data to be written to FILES."
If it'd tell me it doesn't know the file system and has no idea what do do with it, that would be perfectly fine.
But corrupting a file system just because it is unknown to/unsupported by the system trying to read it would be a huge bug.
Windows did have this behaviour, by the way. In 2007 I had a Dell Inspiron laptop with two power buttons: one for Normal Windows and one for Media Center Windows. I had wiped the hard drive and installed Fedora on it. Powering with the normal button worked fine, but if by accident one were to power it on with the Media Center button then I would get the initial Media Center screen (I have no idea where that code was hiding, possibly in a hidden partition) and it would wipe all my ext3 filesystems.
It is dangerous to be right when the government is wrong.
Ah I see, we have ambiguity about what "find a bug" means. From the user's perspective, "finding a bug" means producing the buggy behavior. But from the developer's perspective, "finding a bug" means finding the erroneous code. And we are talking about developers here. From my perspective, until the bug was "found" by bisecting it was only "known to exist", not found. See?
By the way, I've actually bisected bugs, have you? No? OK.
When all you have is a hammer, every problem starts to look like a thumb.
I have a Google+ post where I've posted my latest updates to this still-developing story:
https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7
Also, I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.
I have a Google+ post where I've posted my latest updates to this still-developing story:
https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7
Also, I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.
I've had whole NTFS partitions get corrupted, twice. In both instances, the partitions were formatted under Linux, specifically Ubuntu.
Lesson learnt is, never format an NTFS partition under Linux. Personally, I think this functionality should be disabled. It's way too dangerous.
Dropbox drops it like it's hot.
> Windows has never had anything as serious as a file system corruption bug.
I'm going to assume that either you are joking, or you have only been using Windows for about 5 minutes.
On the off chance that you are actually serious, Geoff Chappell documented a case some years ago in which Windows would occasionally toggle a byte (might have been a word; can't remember now) on the hard drive. Just one byte in a random sector somewhere on the drive. Happy flower sunshine.
You should also Google "Windows disk corruption" and look at all the complaints and cries for help.
One reason why I tried Linux, then switched to it and have stuck with it, was because I was sick and tired of having to run scandisk and/or chkdsk at least once a week on my Windows systems just to keep them running. At the time, I was a contract programmer doing a ton of development, and believe me, if you were constantly working the hard drive (as I was), you WOULD have corruption issues. At random, no explanation. You learned to do constant backups and to be prepared for anything.
The only thing I've experienced even close to that under Linux is that the installer typically does a quick format instead of a full format. As a result, if you have a drive that's iffy and with bad sectors, the install will appear to complete successfully, but it won't work. The answer to that one is, "buy a new hard drive." :)
(I had to learn that one the hard way. If you get ANY errors on a hard drive, just replace the blasted thing. Don't wait, either. Do it now.)
Windows 7 seems to be fairly stable, but XP (just to name one) is notorious for just blowing things up at random. It might be a registry entry; it might be a corrupted executable image on disk. Who knows? But the standard cure is just to back up and reinstall.
Cogito, igitur comedam pizza.
OK, and now I'm probably off topic, but I'm an older guy and as we get older, we like to reminisce. (Between bellowed exhortations to remove ones feet from the lawn, of course.)
I remember a million years ago, when I was developing VxDs for Windows 95. I rigged up the debugger to go active early in the boot ... and had to disable it.
Windows 95 generated SO MANY faults during the boot, it took forever otherwise. I mean, it constantly klonged. Bang, bang, bang, one exception after another. They (mostly) went away when Windows 95 OSR2 appeared. :)
Ah, memories ... Blue Screens of Death .. .. random disk corruption ... it was a beautiful thing.
Cogito, igitur comedam pizza.
Windows 7 should not have automounted the partition once it detected it wasn't forward compatible with the partition formatting. Forced mounting and formatting would be possible user choices. The bug is in the detection (there may not be any) or the action after the detection.
Well, I might have a way, but it only works on a semi spherical planet in a vacuum.
Android is unaffected: the bug was introduced after Linux 3.6 and no Android kernel is anywhere near that recent.
The more recent patch at http://marc.info/?l=linux-kernel&m=135105626207228&w=2 fixes stuff.
I got bit by this one: http://support.microsoft.com/kb/925308 on volumes with hundreds of thousands of small files. All who had a size multiple of 4kb were corrupted.
They split it in half? I suspect you mean disected.
Actually, "Di" means two just as "Bi" does. Therefore, Bisected and Disected both mean "Cut into two pieces."
/endrant
I have bisected bugs, horizontally.
When I was in college the place we lived in had an infestation of 2 inch cockroaches.
Used to kill them with wax bullets.
Shoot at the floor at a low angle a few inches in front of the bug and the spray of wax would cut them in half.
Often the bottom half would run off and leave the top half.
End MGM. Get prospective parents of boys to Google: Men do complain
I guess it has come time to tell the truth.
First of all, the bug has never been bisected, and the whole story that hit Slashdot and some other news sites was based solely on Ted's speculation, which was never confirmed. In fact, at the of the same day, Ted admitted that his hypothesis was wrong.
After a few days of investigation, the problem was traced to an experimental mounting option, which is not turned on by default and was intended for developers only. Accidentally, this option was not marked as "experimental", so it is available to users. https://lkml.org/lkml/2012/10/26/570