Kernel Hackers On Ext3/4 After 2.6.29 Release
microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"
Mmmh, must be a big problem
We are the people our parents warned us about.
Mirror for the thread:
http://thread.gmane.org/gmane.linux.kernel/811167/focus=811699
this is what I get from http://lkml.org/lkml/2009/3/24/460:
"The server is taking too long to respond; please wait a minute or 2 and try again."
Considering that there is only one comment on this slashdot thread, that means that most people will comment without actually reading TFA.
Like me... :-)
When his defense asked, "Which computer has Jon Johansen trespassed upon?" the answer was: "His own."
The server is running linux.
Quote from Linus:
"...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."
In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.
How about ASKING them rather than calling the Morons?
(note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)
TDz.
FTA: "if you write your data _first_, you're never going to see corruption at all"
Agreed, but I think this still misses the point - Computers go down unexpectedly. Period.
Once upon a time, we all seemed to understand that, and considered writeback behavior (when rarely available) always a dangerous option only for use in non-production systems and with a good UPS connected. And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!
Yes, it gives a huge performance boost... But performance without reliability means absolutely nothing. Eventually every computer will go down without enough warning to flush the write buffers.
If I were to setup a new home spare-part-server using software RAID-5 and LVM today, using kernel 2.6.28 or 2.6.29 and I really care about not losing important data in case of a power outage or system crash but still want reasonable performance (not run with -o sync), what would be my best choice of filesystem (EXT4 or XFS), mkfs and mount options?
My other account has a 3-digit UID.
Tell us what you really think there Linus.
~I went home today knowing I made someone cry!~
I think he's sad because he never got that job at Microsoft he always wanted.
Maybe only a hug from Bill Gates would solve his problem.
Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.
from the FAQ: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html
"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.
"mount -o data=writeback"
Only journals metadata changes, and data updates are entirely
left to the normal "sync" process. After a crash, files will
may contain stale data blocks from old files: this mode is
exactly equivalent to running ext2 with a very fast fsck on reboot.
So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...
Deleted
Linux seriously needs to find a workaround to its licensing squabbles and find a way to get a rock-solid ZFS in the kernel. Right now, ZFS on OpenSolaris is simply wonderful, and this is what I am deploying for file service at all my customer sites now. The scary thing about file system corruption is that it is often silent, and can go on for a long time, until your system crashes, and you find that all of your backups are also crap. I've replaced a couple of linux servers (and more than a couple of Windows servers) after filesystem and disk corruption compounded by naive RAID implementations (RAID[1-5] without end-to-end checksumming can make your data *less* safe), and my customers couldn't be happier. Having hourly snapshots and a fast in-kernel CIFS server fully integrated with ZFS ACLS (and with support for NTFS-style mixed case naming) is jut icing on the cake. Now if only I could have an Opensolaris desktop with all the nice linux userland apps available. Oh wait, I can!
Disclaimer: Evolution comes with NO WARRANTY, except for the IMPLIED WARRANTY of FITNESS FOR A PARTICULAR PURPOSE.
Somebody's going to mention it so here it is: there was a BSD unix research project that ended as the soft-updates implementation (currently present in all modern free BSDs). It deals precisely with the ordering of metadata and data writes. The paper is here: http://www.ece.cmu.edu/~ganger/papers/softupdates.pdf. Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata. It has proven to be very resilient (up to hardware problems).
Here's an excerpt:
We refer to this requirement as an update dependency, because safely writing the direc- tory entry depends on first writing the inode. The ordering constraints map onto three simple rules: (1) Never point to a structure before it has been initialized (e.g., an inode must be initialized before a directory entry references it). (2) Never reuse a resource before nullifying all previous pointers to it (e.g., an inode's pointer to a data block must be nullified before that disk block may be reallocated for a new inode). (3) Never reset the last pointer to a live resource before a new pointer has been set (e.g., when renaming a file, do not remove the old name for an inode until after the new name has been written). The metadata update problem can be addressed with several mecha- nisms. The remainder of this section discusses previous approaches and the characteristics of an ideal solution.
There's some quote about this... something about those who don't know unix and about reinventing stuff, right :P ?
-- Sig down
Yeah, I have to second this... all the journalling filesystems in the world can't compete with a bog-standard, home-based UPS. You just need to make ABSOLUTELY sure that the system shuts down when the battery STARTS going (don't try and be fancy about getting it to run until the battery lifetime) and that the system WILL shut down, no questions asked.
A UPS costs, what, £50 for a cheap, home-based one? Batteries might cost you £20 a year or so on average (and probably a lot less if you just need "shutdown safely" rather than "carry on running"). You don't need it to give a lot of power (run ONLY the base unit off it... anything else and you could hit overloads, etc... you *won't* be operating the PC when it's on battery, you just want it to shut down and, optionally, give you a beep or two when it has shut down successfully), or for very long at all. You just need a fail-safe way of detecting when the power is out so that you can safely shutdown. You also want to check that your cabling is good (nothing more embarassing than having a UPS and then pulling the wrong cable out).
Above and beyond that, filesystem and/or data corruption is one of those things that are almost guaranteed to happen unless you put a lot of effort into it (battery-backed RAID controllers, filesystems with slow-but-sure settings, integrity checking etc.). Make it easy on yourself - use a UPS to stop the problem happening ever, rather than try to have something *might* clean up nicely if it does happen. Even Google don't bother with journalling - if a PC loses power, it's rebuilt from an image. It's not worth faffing about to see if/when/how a filesystem can be repaired, just ensure you have adequate backups and try to stop it happening in the first place.
Is the person responsible going to pull a classic political step-down where they resign "in order to spend more time with their family"?
Maybe it was Hans Reiser? Sure the guy is locked up in San Quentin, but nobody knows how to hack a filesystem to bits better than Reiser. Bada ba ching! Thank you, thank you... I'll be here all night.
You specifically have to choose writeback mode in the full knowledge that the datablocks will almost certainly be written after the metadata journal.
I think Ted Tso etc are probably perfectly aware of how it works.
Except that ext4 loses data in ordered mode for exactly the same reason, and we had a big fuss about that the last few weeks, because *someone* (cough) said that it's the application developers fault for not fsync()-ing.
> We need a gradual level of tiers ranging from a database that does its own journaling
> and needs to know that data is fully written to disk to an application swapfile that if
> it never hits the disk isn't a big deal (granted, such an app should just use kernel swap,
> but that is another issue).
Actually there already is a syscall for telling the kernel how the file will be used.
posix_fadvise (int fd, off_t offset, off_t len, int advice)
POSIX_FADV_DONTNEED sounds like what you would use for your swapfile case.
I don't know if the kernel actually does anything with this information, but it looks like
this would be a good place to implement any new interfaces for what you are suggesting.
UPS are nice, and I use one too. It won't protect you from kernel crashes or direct hardware failures. It would still result in corrupted discs if some filesystem decided it did not yet have to write that 2 GB of cached data. Ext3 in ordered mode is still much preferred.
Comment removed based on user account deletion
What we need is more, and not less, of such an aggressive attitude. A real man can take it.
That depends if you're trying to construct a team of "real men" or a team of skilled developers.
People sometimes confuse the idea or the act with the person that is associated with. If I propose a stupid idea or commit a stupid act, then by all means call me out and tell me that it's stupid and why. But save the ad hominem attacks. Calling somebody a moron accomplishes no good thing, and doing it in public is an extremely quick and effective way of destroying team morale.
I am literally 3000 tokens away from the chaotic crossbow --Stephen
I think it's more a matter of dealing with divas all day. It's pretty clear that the two sides of this issue are the side with technical people convinced that the correctness of the journaling system overcomes any difficulties with integrity, and people who think that integrity should be paramount. For most users, disk integrity IS the number one priority. It seems to me that this is a case of some people not being able to see that they're wrong.
In a corporation, it's as simple as saying, "do it our way or hit the street." With Linux development the leaders don't have that power, so they may replace it with forcefulness. Besides, the honesty is kind of refreshing. Linus lays out a clear argument and only then starts insulting the other person. He's being brutal, but he's giving them more information than a more polite person might.
Maybe Linus should just fixit instead of whining about it. It's open source, dammit.
Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle
Linus is not clueless in this case. I think it is a case of you misinterpreting the issue he was discussing.
Journaling is, as you say NOT about data integrity/prevention of data loss. That is what RAID and UPSes are for. However, it IS about data CONSISTENCY. Even if a file is overwritten, truncated or otherwise corrupted in a system failure (i.e. loss of data integrity) the journal is supposed to accurately describe things like "file X is Y bytes in length and resides in blocks 1,2,3...." (data/metadata consistency). Why would you update that information before you are sure the data was actually changed? A consistent journal is the WHOLE REASON why you can "alleviate the delay caused by fscking".
Linus rightly pointed out, with a degree of tact that Theo de Raadt would be proud of, that writing meta-data before the actual data is committed to disk is a colossally stupid idea. If the journal doesn't accurately describe the actual data on the drive then what is the point of the journal? In fact, it can be LESS than useless if you implicitly trust the inconsistent journal and have borked data that is never brought to your attention.
ZFS is production ready my ass. ZFS will be production ready when I can take a disk out the filesystem, when I can set quota's when it supports HSM and when it supports clustering.
Finally it will be production ready when it has a decade of hardening in the real world.
In the meantime both JFS and XFS offer better alternatives, and for me only GPFS (which admittedly is closed source but does run under Linux) ticks all the boxes.
The crazy thing is that ext4 offers nothing that we don't get with XFS or JFS, and if RedHat would stop pussy footing about, and support either one (and I don't care which) the whole ext? could die.
The ext2/3 line had a place and a time, and that place and time has long gone. It needs to die...