Kernel Hackers On Ext3/4 After 2.6.29 Release
microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"
Mmmh, must be a big problem
We are the people our parents warned us about.
Mirror for the thread:
http://thread.gmane.org/gmane.linux.kernel/811167/focus=811699
this is what I get from http://lkml.org/lkml/2009/3/24/460:
"The server is taking too long to respond; please wait a minute or 2 and try again."
Considering that there is only one comment on this slashdot thread, that means that most people will comment without actually reading TFA.
Like me... :-)
When his defense asked, "Which computer has Jon Johansen trespassed upon?" the answer was: "His own."
The server is running linux.
Quote from Linus:
"...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."
In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.
How about ASKING them rather than calling the Morons?
(note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)
TDz.
FTA: "if you write your data _first_, you're never going to see corruption at all"
Agreed, but I think this still misses the point - Computers go down unexpectedly. Period.
Once upon a time, we all seemed to understand that, and considered writeback behavior (when rarely available) always a dangerous option only for use in non-production systems and with a good UPS connected. And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!
Yes, it gives a huge performance boost... But performance without reliability means absolutely nothing. Eventually every computer will go down without enough warning to flush the write buffers.
If I were to setup a new home spare-part-server using software RAID-5 and LVM today, using kernel 2.6.28 or 2.6.29 and I really care about not losing important data in case of a power outage or system crash but still want reasonable performance (not run with -o sync), what would be my best choice of filesystem (EXT4 or XFS), mkfs and mount options?
My other account has a 3-digit UID.
Tell us what you really think there Linus.
~I went home today knowing I made someone cry!~
Andi Kleen, the l is missing.
I think he's sad because he never got that job at Microsoft he always wanted.
Maybe only a hug from Bill Gates would solve his problem.
Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.
from the FAQ: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html
"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.
"mount -o data=writeback"
Only journals metadata changes, and data updates are entirely
left to the normal "sync" process. After a crash, files will
may contain stale data blocks from old files: this mode is
exactly equivalent to running ext2 with a very fast fsck on reboot.
So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...
Deleted
Sometimes I get the impression that Linus says things the way he says because the other 'powerful' guys who are really important and active in the Linux community don't say nothing or even agree with him when he talks like that. I remember a similar episode some time ago when a guy wanted to port GIT to C++ or something like that. I think he cried.
I can't imagine a reason to be this rude.
Any life is made up of a single moment, the moment in which a man finds out, once and for all, who he is.
Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.
You specifically have to choose writeback mode in the full knowledge that the datablocks will almost certainly be written after the metadata journal.
I think Ted Tso etc are probably perfectly aware of how it works.
Frankly I think Linus is trolling.
Deleted
Linux seriously needs to find a workaround to its licensing squabbles and find a way to get a rock-solid ZFS in the kernel. Right now, ZFS on OpenSolaris is simply wonderful, and this is what I am deploying for file service at all my customer sites now. The scary thing about file system corruption is that it is often silent, and can go on for a long time, until your system crashes, and you find that all of your backups are also crap. I've replaced a couple of linux servers (and more than a couple of Windows servers) after filesystem and disk corruption compounded by naive RAID implementations (RAID[1-5] without end-to-end checksumming can make your data *less* safe), and my customers couldn't be happier. Having hourly snapshots and a fast in-kernel CIFS server fully integrated with ZFS ACLS (and with support for NTFS-style mixed case naming) is jut icing on the cake. Now if only I could have an Opensolaris desktop with all the nice linux userland apps available. Oh wait, I can!
Disclaimer: Evolution comes with NO WARRANTY, except for the IMPLIED WARRANTY of FITNESS FOR A PARTICULAR PURPOSE.
Somebody's going to mention it so here it is: there was a BSD unix research project that ended as the soft-updates implementation (currently present in all modern free BSDs). It deals precisely with the ordering of metadata and data writes. The paper is here: http://www.ece.cmu.edu/~ganger/papers/softupdates.pdf. Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata. It has proven to be very resilient (up to hardware problems).
Here's an excerpt:
We refer to this requirement as an update dependency, because safely writing the direc- tory entry depends on first writing the inode. The ordering constraints map onto three simple rules: (1) Never point to a structure before it has been initialized (e.g., an inode must be initialized before a directory entry references it). (2) Never reuse a resource before nullifying all previous pointers to it (e.g., an inode's pointer to a data block must be nullified before that disk block may be reallocated for a new inode). (3) Never reset the last pointer to a live resource before a new pointer has been set (e.g., when renaming a file, do not remove the old name for an inode until after the new name has been written). The metadata update problem can be addressed with several mecha- nisms. The remainder of this section discusses previous approaches and the characteristics of an ideal solution.
There's some quote about this... something about those who don't know unix and about reinventing stuff, right :P ?
-- Sig down
Yeah, I have to second this... all the journalling filesystems in the world can't compete with a bog-standard, home-based UPS. You just need to make ABSOLUTELY sure that the system shuts down when the battery STARTS going (don't try and be fancy about getting it to run until the battery lifetime) and that the system WILL shut down, no questions asked.
A UPS costs, what, £50 for a cheap, home-based one? Batteries might cost you £20 a year or so on average (and probably a lot less if you just need "shutdown safely" rather than "carry on running"). You don't need it to give a lot of power (run ONLY the base unit off it... anything else and you could hit overloads, etc... you *won't* be operating the PC when it's on battery, you just want it to shut down and, optionally, give you a beep or two when it has shut down successfully), or for very long at all. You just need a fail-safe way of detecting when the power is out so that you can safely shutdown. You also want to check that your cabling is good (nothing more embarassing than having a UPS and then pulling the wrong cable out).
Above and beyond that, filesystem and/or data corruption is one of those things that are almost guaranteed to happen unless you put a lot of effort into it (battery-backed RAID controllers, filesystems with slow-but-sure settings, integrity checking etc.). Make it easy on yourself - use a UPS to stop the problem happening ever, rather than try to have something *might* clean up nicely if it does happen. Even Google don't bother with journalling - if a PC loses power, it's rebuilt from an image. It's not worth faffing about to see if/when/how a filesystem can be repaired, just ensure you have adequate backups and try to stop it happening in the first place.
Is the person responsible going to pull a classic political step-down where they resign "in order to spend more time with their family"?
Maybe it was Hans Reiser? Sure the guy is locked up in San Quentin, but nobody knows how to hack a filesystem to bits better than Reiser. Bada ba ching! Thank you, thank you... I'll be here all night.
You specifically have to choose writeback mode in the full knowledge that the datablocks will almost certainly be written after the metadata journal.
I think Ted Tso etc are probably perfectly aware of how it works.
Except that ext4 loses data in ordered mode for exactly the same reason, and we had a big fuss about that the last few weeks, because *someone* (cough) said that it's the application developers fault for not fsync()-ing.
Oh come off it. You must be an American, because in America excessive gentleness and tenderness in dealing with even the most outrageous and inexcusable problems seems to the present cultural norm.
Linus, perhaps, is a taskmaster and perfectionist. The Linux OS is his baby and any major difficulties will ultimately be a bad reflection on him alone.
It is not inappropriate to sometimes rudely castigate one's associates. It is a kind of shaming game that is intended to inspire better performance. I recall that during the Intel ethernet fiasco involving the e1000e driver, Torvalds was equally brusque toward the Intel developers for their "stupid" oversights.
What we need is more, and not less, of such an aggressive attitude. A real man can take it. Indeed, real men will welcome it, because the end result, in spite of any hurt feelings, is an overall higher quality of craftsmanship.
Of course, the truth mus be somewhere on the middle :)
Rethinking email
ext4 by default had the equivalent of ext3 writeback mode on.
> We need a gradual level of tiers ranging from a database that does its own journaling
> and needs to know that data is fully written to disk to an application swapfile that if
> it never hits the disk isn't a big deal (granted, such an app should just use kernel swap,
> but that is another issue).
Actually there already is a syscall for telling the kernel how the file will be used.
posix_fadvise (int fd, off_t offset, off_t len, int advice)
POSIX_FADV_DONTNEED sounds like what you would use for your swapfile case.
I don't know if the kernel actually does anything with this information, but it looks like
this would be a good place to implement any new interfaces for what you are suggesting.
That writing to a hard disk is slower than writing to RAM?
Deleted
UPS are nice, and I use one too. It won't protect you from kernel crashes or direct hardware failures. It would still result in corrupted discs if some filesystem decided it did not yet have to write that 2 GB of cached data. Ext3 in ordered mode is still much preferred.
Instead of giving apps the ability to tag "critical" data, give them the ability to inspect the write status of data. This can be done by adding adding another fd_set to select() (which currently has readfds, writefds, and exceptfds). Add one called "flushedfds" that will return when all data for that file descriptor has been flushed to disk. The kernel can prioritize flushes for all files that have an active select(...flushedfds...) call pending, but otherwise it can still do writes in the optimal order. And the app can have its guarantee that critical data has been written.
Comment removed based on user account deletion
As long as ZFS licensing is incompatible with the GPL it's never going in. The person from that blog you linked understood something you clearly did not.
"The only way I'm seeing ZFS on the Linux kernel is to convince Sun to dual-license ZFS under the GPL and the CDDL."
You might not like the GPL but suggesting Linux developers should ignore it is not informative, it's completely retarded.
What we need is more, and not less, of such an aggressive attitude. A real man can take it.
That depends if you're trying to construct a team of "real men" or a team of skilled developers.
People sometimes confuse the idea or the act with the person that is associated with. If I propose a stupid idea or commit a stupid act, then by all means call me out and tell me that it's stupid and why. But save the ad hominem attacks. Calling somebody a moron accomplishes no good thing, and doing it in public is an extremely quick and effective way of destroying team morale.
I am literally 3000 tokens away from the chaotic crossbow --Stephen
...if you want the state of the art in data integrity. (Checksumming, transactional copy on write, self healing, simple pool management, snapshots, filesystems, etc.) Read more: Solaris 10, OpenSolaris.
you had me at #!
So far Linux has nothing even close.
you had me at #!
Must be a nice world where the only cause for an unclean shutdown is power interupption. And where the power supply itself never goes tits up.
I think it's more a matter of dealing with divas all day. It's pretty clear that the two sides of this issue are the side with technical people convinced that the correctness of the journaling system overcomes any difficulties with integrity, and people who think that integrity should be paramount. For most users, disk integrity IS the number one priority. It seems to me that this is a case of some people not being able to see that they're wrong.
In a corporation, it's as simple as saying, "do it our way or hit the street." With Linux development the leaders don't have that power, so they may replace it with forcefulness. Besides, the honesty is kind of refreshing. Linus lays out a clear argument and only then starts insulting the other person. He's being brutal, but he's giving them more information than a more polite person might.
ZFS, on the other hand, is production ready today.
you had me at #!
Sorry, but no, it isn't. You will hear them screaming utter murder, when their OS needs half an hour to boot, and a file copy only goes with a few kB/s.
Users want integrity AND speed. Most won't even know there's a difference. So it's always a trade off between safety and speed. At least til we get copy-on-write filesystems and fast, big SSDs on a large scale.
Maybe Linus should just fixit instead of whining about it. It's open source, dammit.
Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle
Linus is not clueless in this case. I think it is a case of you misinterpreting the issue he was discussing.
Journaling is, as you say NOT about data integrity/prevention of data loss. That is what RAID and UPSes are for. However, it IS about data CONSISTENCY. Even if a file is overwritten, truncated or otherwise corrupted in a system failure (i.e. loss of data integrity) the journal is supposed to accurately describe things like "file X is Y bytes in length and resides in blocks 1,2,3...." (data/metadata consistency). Why would you update that information before you are sure the data was actually changed? A consistent journal is the WHOLE REASON why you can "alleviate the delay caused by fscking".
Linus rightly pointed out, with a degree of tact that Theo de Raadt would be proud of, that writing meta-data before the actual data is committed to disk is a colossally stupid idea. If the journal doesn't accurately describe the actual data on the drive then what is the point of the journal? In fact, it can be LESS than useless if you implicitly trust the inconsistent journal and have borked data that is never brought to your attention.
I don't know much about linux file systems, but now I know more than I want to. What idiot writes pointers to data that's not there yet?
The last non-trivial file system I worked on was on the Sigma 7, circa 1969, and its update sequence carefully avoided doing that; it's not like this is a new discovery. It's a basic engineering principle: "Make before Break."
And these guys have the effrontery to call themselves "software engineers."
On the other hand, they're working for free, so gift-horse and all that.
I'm a Programmer. That's one level above Software Engineer and one level below Engineer.
Does Slashdot have ato
If you don't like the way disks work in a power outage, just switch to drum storage. Its angular momentum means that it would keep turning long enough to dump the entire core (OK, this is a bit ancient) to the drum. Sometimes, the "UPS" was a generator attached to the drum, so it powered the cpu. The drum was spun by separate motors, and had a read/write head on each track: no seek time, read & write in parallel to all tracks, great for virtual memory. They were noisy power-hogs, however.
http://en.wikipedia.org/wiki/Drum_memory
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
someone speaks some sense. POSIX simply currently lacks fbarrier(...).
HAND.
If Linus et al don't like the way ext3 works, they shouldn't complain about the developer, they should change it. After all, they have the source code.
Ah, that felt good!
It doesn't go into the kernel. The database sits as part of the app layer. The kernel itself just contains very basic file systems which boot up the database + other stuff.
"Oh come off it. You must be an American, because in America excessive gentleness and tenderness in dealing with even the most outrageous and inexcusable problems seems to the present cultural norm."
Where is this gentle programming territory in the US? Remember, the Daily WTF was started in the USA - not exactly a font of tenderness.
He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking.
Well I was unaware of it, too.
And when I did a journaling system back in the mid '80s the whole POINT of it was to maintain a consistent ("though not necessarily current") filesystem on the disk at all times. ("Not necessarily current" means transactions that haven't yet hit the disk get lost in a crash. So if you want to build a reliable transaction processor on top of it you have a bit more to do.)
The idea behind it: Servers are intended to run continuously. So the commonest mode of shutdown will be system crash. Thus the server needs to:
1) Always be able to recover from a crash.
2) Do it very quickly.
(Once you have that you don't even need a shutdown mechanism. Just kill it. Kick off the clients first if you're really concerned about not reversing transactions.)
I had THOUGHT that the journaling file systems we've come to know and deploy were also based on this set of ideas. If they AREN'T, it's time to build one that IS.
(And if I'd known earlier that they weren't I might have gone and done it. B-( )
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
I have had 1 power outage in 8 years (in Switzerland) but various kernel oopses over the years. So at least in my case, the chance on a sudden lockup due to a kernel bug is much much higher than a power problem. Therefore a UPS won't really help.
I have a wonderful data point here - by sheer coincidence, I have a computer that's been running for 8 years, with plenty of power outages and not a single kernel oops ever (it's on its five or sixth kernel upgrade, at least) - in fact, it would worry me if I saw a kernel oops on a machine I was relying on to store my data, as suggested by the OP, and I would probably want to integrity-check the whole damn computer. Similarly for power-supply failures, or anything else. Once you get those sorts of problems, you have bigger problems than "was the fs journalled?". I *have* seen a journalled fs that quite happily passed fsync after a power failure and had lost data - it's much easier than you think, and you can't trust it.
And what makes you think that the journalling in the case of kernel oops would help you escape a corrupt filesystem? Almost by definition, if the kernel oopses, it has messed up and done something it should NEVER have done (like trawled data across memory etc.), and that might well be in the filesytem code. Maybe I could expand my suggestion and say "UPS + a backup", but that much is obvious if you care about your data. And I do use journalling FS, but I don't *rely* on them, precisely because of things like the recent fsync() discussion... even if you THINK it's working, it doesn't mean it is. A UPS is worth MORE than a journalling fs, because it negates the need for one to a certain extent in a much simpler fashion. However, if you care about your data, the only way to be sure is to have UPS + journalling + backups + integrity check.
I certainly wouldn't use LVM, RAID-5, ext4, XFS, or Linux. I'd use Solaris 10 or OpenSolaris and ZFS.
you had me at #!