Kernel Hackers On Ext3/4 After 2.6.29 Release
microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"
Quote from Linus:
"...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."
In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.
How about ASKING them rather than calling the Morons?
(note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)
TDz.
FTA: "if you write your data _first_, you're never going to see corruption at all"
Agreed, but I think this still misses the point - Computers go down unexpectedly. Period.
Once upon a time, we all seemed to understand that, and considered writeback behavior (when rarely available) always a dangerous option only for use in non-production systems and with a good UPS connected. And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!
Yes, it gives a huge performance boost... But performance without reliability means absolutely nothing. Eventually every computer will go down without enough warning to flush the write buffers.
Well this is just my meta comment. I'll be writing my real comment later...
You forgot to include a link to the comment you'll be writing later.
Solaris 10 with ZFS, if you actually care about your data.
Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.
from the FAQ: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html
"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.
"mount -o data=writeback"
Only journals metadata changes, and data updates are entirely
left to the normal "sync" process. After a crash, files will
may contain stale data blocks from old files: this mode is
exactly equivalent to running ext2 with a very fast fsck on reboot.
So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...
Deleted
JFS
Yeah, I have to second this... all the journalling filesystems in the world can't compete with a bog-standard, home-based UPS. You just need to make ABSOLUTELY sure that the system shuts down when the battery STARTS going (don't try and be fancy about getting it to run until the battery lifetime) and that the system WILL shut down, no questions asked.
A UPS costs, what, £50 for a cheap, home-based one? Batteries might cost you £20 a year or so on average (and probably a lot less if you just need "shutdown safely" rather than "carry on running"). You don't need it to give a lot of power (run ONLY the base unit off it... anything else and you could hit overloads, etc... you *won't* be operating the PC when it's on battery, you just want it to shut down and, optionally, give you a beep or two when it has shut down successfully), or for very long at all. You just need a fail-safe way of detecting when the power is out so that you can safely shutdown. You also want to check that your cabling is good (nothing more embarassing than having a UPS and then pulling the wrong cable out).
Above and beyond that, filesystem and/or data corruption is one of those things that are almost guaranteed to happen unless you put a lot of effort into it (battery-backed RAID controllers, filesystems with slow-but-sure settings, integrity checking etc.). Make it easy on yourself - use a UPS to stop the problem happening ever, rather than try to have something *might* clean up nicely if it does happen. Even Google don't bother with journalling - if a PC loses power, it's rebuilt from an image. It's not worth faffing about to see if/when/how a filesystem can be repaired, just ensure you have adequate backups and try to stop it happening in the first place.
Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking. All the filesystem can normally promise in the event of a crash is that the metadata will describe a valid filesystem somewhere between the last returned synchronization call and the state at the event of the crash. If you need more than that -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes and putting a special capacitor in the system so the OS can flush everything to disk before the computer loses power in an outage.
vi ~/.emacs # I'm probably going to Hell for this.
Oh come off it. You must be an American, because in America excessive gentleness and tenderness in dealing with even the most outrageous and inexcusable problems seems to the present cultural norm.
Linus, perhaps, is a taskmaster and perfectionist. The Linux OS is his baby and any major difficulties will ultimately be a bad reflection on him alone.
It is not inappropriate to sometimes rudely castigate one's associates. It is a kind of shaming game that is intended to inspire better performance. I recall that during the Intel ethernet fiasco involving the e1000e driver, Torvalds was equally brusque toward the Intel developers for their "stupid" oversights.
What we need is more, and not less, of such an aggressive attitude. A real man can take it. Indeed, real men will welcome it, because the end result, in spite of any hurt feelings, is an overall higher quality of craftsmanship.
Comment removed based on user account deletion
(1) Never point to a structure before it has been initialized
Which surely includes writing data before meta-data (and write the data someplace other than where the old meta-data is pointing), which is what Linus was saying.
What we need is more, and not less, of such an aggressive attitude. A real man can take it.
That depends if you're trying to construct a team of "real men" or a team of skilled developers.
People sometimes confuse the idea or the act with the person that is associated with. If I propose a stupid idea or commit a stupid act, then by all means call me out and tell me that it's stupid and why. But save the ad hominem attacks. Calling somebody a moron accomplishes no good thing, and doing it in public is an extremely quick and effective way of destroying team morale.
I am literally 3000 tokens away from the chaotic crossbow --Stephen
I think it's more a matter of dealing with divas all day. It's pretty clear that the two sides of this issue are the side with technical people convinced that the correctness of the journaling system overcomes any difficulties with integrity, and people who think that integrity should be paramount. For most users, disk integrity IS the number one priority. It seems to me that this is a case of some people not being able to see that they're wrong.
In a corporation, it's as simple as saying, "do it our way or hit the street." With Linux development the leaders don't have that power, so they may replace it with forcefulness. Besides, the honesty is kind of refreshing. Linus lays out a clear argument and only then starts insulting the other person. He's being brutal, but he's giving them more information than a more polite person might.
Some of us have discovered the 'shutdown' command. [...]Anyhow, I suggest you use it occasionally. Then perhaps you can only fsck when something bad has happened.
Don't be too smug - a "shutdown" doesn't always guarantee a clean startup. I remember a bug (hopefully fixed now) where "shutdown" was completing so quickly that it powered off the computer while data was still sitting in the hard drive's volatile write cache. Even though the OS had unmounted the filesystem, the on-disk blocks were still dirty.
p.s. If any OS/kernel developers are listening - how about implementing a standard API through which drive write-caches can be flushed+disabled whenever a system starts a shutdown procedure, gets a signal that the UPS is running on battery power, or otherwise concludes that it is in a state where a temporarily-increased risk of data loss justifies slowing down I/O?
ZFS is production ready my ass. ZFS will be production ready when I can take a disk out the filesystem, when I can set quota's when it supports HSM and when it supports clustering.
Finally it will be production ready when it has a decade of hardening in the real world.
In the meantime both JFS and XFS offer better alternatives, and for me only GPFS (which admittedly is closed source but does run under Linux) ticks all the boxes.
The crazy thing is that ext4 offers nothing that we don't get with XFS or JFS, and if RedHat would stop pussy footing about, and support either one (and I don't care which) the whole ext? could die.
The ext2/3 line had a place and a time, and that place and time has long gone. It needs to die...
FreeBSD has ZFS. My understanding is while ZFS is a good filesystem, it isn't without issues. It doesn't work well on 32-bit architectures because of the memory requirements, isn't reliable enough to host a swap partition, and can't be used as a boot partition when part of a pool. Here's FreeBSD's rundown of known problems: http://wiki.freebsd.org/ZFSKnownProblems.
On the other hand, the new filesystems in the Linux kernel - ext4 and btrfs - are taking the lessons learned from ZFS. I'm excited about next-generation filesystems, and I don't think ZFS is the only way to go.
Yes, but in this case ext3 and ext4 keep (convenient, fast) consistency of the filesystem at the cost of worse behavior regarding the user experience (and user data).