Kernel Hackers On Ext3/4 After 2.6.29 Release
microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"
Mmmh, must be a big problem
We are the people our parents warned us about.
Mirror for the thread:
http://thread.gmane.org/gmane.linux.kernel/811167/focus=811699
this is what I get from http://lkml.org/lkml/2009/3/24/460:
"The server is taking too long to respond; please wait a minute or 2 and try again."
Considering that there is only one comment on this slashdot thread, that means that most people will comment without actually reading TFA.
Like me... :-)
When his defense asked, "Which computer has Jon Johansen trespassed upon?" the answer was: "His own."
http://thread.gmane.org/gmane.linux.kernel/811167/focus=811228
If a developer has a difficult time justifying his choices, that could be an indication that the choice is not well thought out.
If a developer, failing to explain a choice, hunkers down and refuses to change, that could be an indication of excessive ego.
A fsck would seem to be in order,
The server is running linux.
Quote from Linus:
"...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."
In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.
How about ASKING them rather than calling the Morons?
(note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)
TDz.
FTA: "if you write your data _first_, you're never going to see corruption at all"
Agreed, but I think this still misses the point - Computers go down unexpectedly. Period.
Once upon a time, we all seemed to understand that, and considered writeback behavior (when rarely available) always a dangerous option only for use in non-production systems and with a good UPS connected. And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!
Yes, it gives a huge performance boost... But performance without reliability means absolutely nothing. Eventually every computer will go down without enough warning to flush the write buffers.
If I were to setup a new home spare-part-server using software RAID-5 and LVM today, using kernel 2.6.28 or 2.6.29 and I really care about not losing important data in case of a power outage or system crash but still want reasonable performance (not run with -o sync), what would be my best choice of filesystem (EXT4 or XFS), mkfs and mount options?
My other account has a 3-digit UID.
Tell us what you really think there Linus.
~I went home today knowing I made someone cry!~
after all
Epic fail, eh?
Andi Kleen, the l is missing.
some people think it doesnt matter, some people think it does.
I think he's sad because he never got that job at Microsoft he always wanted.
Maybe only a hug from Bill Gates would solve his problem.
Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.
from the FAQ: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html
"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.
"mount -o data=writeback"
Only journals metadata changes, and data updates are entirely
left to the normal "sync" process. After a crash, files will
may contain stale data blocks from old files: this mode is
exactly equivalent to running ext2 with a very fast fsck on reboot.
So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...
Deleted
n/t
Sometimes I get the impression that Linus says things the way he says because the other 'powerful' guys who are really important and active in the Linux community don't say nothing or even agree with him when he talks like that. I remember a similar episode some time ago when a guy wanted to port GIT to C++ or something like that. I think he cried.
I can't imagine a reason to be this rude.
Any life is made up of a single moment, the moment in which a man finds out, once and for all, who he is.
Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.
You specifically have to choose writeback mode in the full knowledge that the datablocks will almost certainly be written after the metadata journal.
I think Ted Tso etc are probably perfectly aware of how it works.
Frankly I think Linus is trolling.
Deleted
Linux seriously needs to find a workaround to its licensing squabbles and find a way to get a rock-solid ZFS in the kernel. Right now, ZFS on OpenSolaris is simply wonderful, and this is what I am deploying for file service at all my customer sites now. The scary thing about file system corruption is that it is often silent, and can go on for a long time, until your system crashes, and you find that all of your backups are also crap. I've replaced a couple of linux servers (and more than a couple of Windows servers) after filesystem and disk corruption compounded by naive RAID implementations (RAID[1-5] without end-to-end checksumming can make your data *less* safe), and my customers couldn't be happier. Having hourly snapshots and a fast in-kernel CIFS server fully integrated with ZFS ACLS (and with support for NTFS-style mixed case naming) is jut icing on the cake. Now if only I could have an Opensolaris desktop with all the nice linux userland apps available. Oh wait, I can!
Disclaimer: Evolution comes with NO WARRANTY, except for the IMPLIED WARRANTY of FITNESS FOR A PARTICULAR PURPOSE.
Maybe, but compared to Theo De Raadt he's positively polite ...
Somebody's going to mention it so here it is: there was a BSD unix research project that ended as the soft-updates implementation (currently present in all modern free BSDs). It deals precisely with the ordering of metadata and data writes. The paper is here: http://www.ece.cmu.edu/~ganger/papers/softupdates.pdf. Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata. It has proven to be very resilient (up to hardware problems).
Here's an excerpt:
We refer to this requirement as an update dependency, because safely writing the direc- tory entry depends on first writing the inode. The ordering constraints map onto three simple rules: (1) Never point to a structure before it has been initialized (e.g., an inode must be initialized before a directory entry references it). (2) Never reuse a resource before nullifying all previous pointers to it (e.g., an inode's pointer to a data block must be nullified before that disk block may be reallocated for a new inode). (3) Never reset the last pointer to a live resource before a new pointer has been set (e.g., when renaming a file, do not remove the old name for an inode until after the new name has been written). The metadata update problem can be addressed with several mecha- nisms. The remainder of this section discusses previous approaches and the characteristics of an ideal solution.
There's some quote about this... something about those who don't know unix and about reinventing stuff, right :P ?
-- Sig down
Ummm, the piece of code Linus called idiotic, he may written himself. While Linus is well known for not holding back his feelings with colorful language, he's also got a strange sense of humour.
well fsck you too... let me see you do a better job..
Is the person responsible going to pull a classic political step-down where they resign "in order to spend more time with their family"?
Maybe it was Hans Reiser? Sure the guy is locked up in San Quentin, but nobody knows how to hack a filesystem to bits better than Reiser. Bada ba ching! Thank you, thank you... I'll be here all night.
You specifically have to choose writeback mode in the full knowledge that the datablocks will almost certainly be written after the metadata journal.
I think Ted Tso etc are probably perfectly aware of how it works.
Except that ext4 loses data in ordered mode for exactly the same reason, and we had a big fuss about that the last few weeks, because *someone* (cough) said that it's the application developers fault for not fsync()-ing.
Oh come off it. You must be an American, because in America excessive gentleness and tenderness in dealing with even the most outrageous and inexcusable problems seems to the present cultural norm.
Linus, perhaps, is a taskmaster and perfectionist. The Linux OS is his baby and any major difficulties will ultimately be a bad reflection on him alone.
It is not inappropriate to sometimes rudely castigate one's associates. It is a kind of shaming game that is intended to inspire better performance. I recall that during the Intel ethernet fiasco involving the e1000e driver, Torvalds was equally brusque toward the Intel developers for their "stupid" oversights.
What we need is more, and not less, of such an aggressive attitude. A real man can take it. Indeed, real men will welcome it, because the end result, in spite of any hurt feelings, is an overall higher quality of craftsmanship.
ext4 by default had the equivalent of ext3 writeback mode on.
> We need a gradual level of tiers ranging from a database that does its own journaling
> and needs to know that data is fully written to disk to an application swapfile that if
> it never hits the disk isn't a big deal (granted, such an app should just use kernel swap,
> but that is another issue).
Actually there already is a syscall for telling the kernel how the file will be used.
posix_fadvise (int fd, off_t offset, off_t len, int advice)
POSIX_FADV_DONTNEED sounds like what you would use for your swapfile case.
I don't know if the kernel actually does anything with this information, but it looks like
this would be a good place to implement any new interfaces for what you are suggesting.
That writing to a hard disk is slower than writing to RAM?
Deleted
Instead of giving apps the ability to tag "critical" data, give them the ability to inspect the write status of data. This can be done by adding adding another fd_set to select() (which currently has readfds, writefds, and exceptfds). Add one called "flushedfds" that will return when all data for that file descriptor has been flushed to disk. The kernel can prioritize flushes for all files that have an active select(...flushedfds...) call pending, but otherwise it can still do writes in the optimal order. And the app can have its guarantee that critical data has been written.
Comment removed based on user account deletion
ext4 doesn't really *have* an ordered mode yet.
As long as ZFS licensing is incompatible with the GPL it's never going in. The person from that blog you linked understood something you clearly did not.
"The only way I'm seeing ZFS on the Linux kernel is to convince Sun to dual-license ZFS under the GPL and the CDDL."
You might not like the GPL but suggesting Linux developers should ignore it is not informative, it's completely retarded.
What we need is more, and not less, of such an aggressive attitude. A real man can take it.
That depends if you're trying to construct a team of "real men" or a team of skilled developers.
People sometimes confuse the idea or the act with the person that is associated with. If I propose a stupid idea or commit a stupid act, then by all means call me out and tell me that it's stupid and why. But save the ad hominem attacks. Calling somebody a moron accomplishes no good thing, and doing it in public is an extremely quick and effective way of destroying team morale.
I am literally 3000 tokens away from the chaotic crossbow --Stephen
...if you want the state of the art in data integrity. (Checksumming, transactional copy on write, self healing, simple pool management, snapshots, filesystems, etc.) Read more: Solaris 10, OpenSolaris.
you had me at #!
So far Linux has nothing even close.
you had me at #!
I think it's more a matter of dealing with divas all day. It's pretty clear that the two sides of this issue are the side with technical people convinced that the correctness of the journaling system overcomes any difficulties with integrity, and people who think that integrity should be paramount. For most users, disk integrity IS the number one priority. It seems to me that this is a case of some people not being able to see that they're wrong.
In a corporation, it's as simple as saying, "do it our way or hit the street." With Linux development the leaders don't have that power, so they may replace it with forcefulness. Besides, the honesty is kind of refreshing. Linus lays out a clear argument and only then starts insulting the other person. He's being brutal, but he's giving them more information than a more polite person might.
ZFS, on the other hand, is production ready today.
you had me at #!
Sorry, but no, it isn't. You will hear them screaming utter murder, when their OS needs half an hour to boot, and a file copy only goes with a few kB/s.
Users want integrity AND speed. Most won't even know there's a difference. So it's always a trade off between safety and speed. At least til we get copy-on-write filesystems and fast, big SSDs on a large scale.
Maybe Linus should just fixit instead of whining about it. It's open source, dammit.
Yes, if only Linus was more like Theo de Raadt then Linux would see much more adoption...
Do you even know any Americans?
BTW, Theo is Canadian.
Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle
Linus is not clueless in this case. I think it is a case of you misinterpreting the issue he was discussing.
Journaling is, as you say NOT about data integrity/prevention of data loss. That is what RAID and UPSes are for. However, it IS about data CONSISTENCY. Even if a file is overwritten, truncated or otherwise corrupted in a system failure (i.e. loss of data integrity) the journal is supposed to accurately describe things like "file X is Y bytes in length and resides in blocks 1,2,3...." (data/metadata consistency). Why would you update that information before you are sure the data was actually changed? A consistent journal is the WHOLE REASON why you can "alleviate the delay caused by fscking".
Linus rightly pointed out, with a degree of tact that Theo de Raadt would be proud of, that writing meta-data before the actual data is committed to disk is a colossally stupid idea. If the journal doesn't accurately describe the actual data on the drive then what is the point of the journal? In fact, it can be LESS than useless if you implicitly trust the inconsistent journal and have borked data that is never brought to your attention.
I don't know much about linux file systems, but now I know more than I want to. What idiot writes pointers to data that's not there yet?
The last non-trivial file system I worked on was on the Sigma 7, circa 1969, and its update sequence carefully avoided doing that; it's not like this is a new discovery. It's a basic engineering principle: "Make before Break."
And these guys have the effrontery to call themselves "software engineers."
On the other hand, they're working for free, so gift-horse and all that.
I'm a Programmer. That's one level above Software Engineer and one level below Engineer.
This is actually funny to anyone who read through the threads and understands the problem.
Does Slashdot have ato
If you don't like the way disks work in a power outage, just switch to drum storage. Its angular momentum means that it would keep turning long enough to dump the entire core (OK, this is a bit ancient) to the drum. Sometimes, the "UPS" was a generator attached to the drum, so it powered the cpu. The drum was spun by separate motors, and had a read/write head on each track: no seek time, read & write in parallel to all tracks, great for virtual memory. They were noisy power-hogs, however.
http://en.wikipedia.org/wiki/Drum_memory
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
Databases are insanely complex for a variety of reasons and nobody wants all that complexity in a kernel. Databases also add quite a lot of of overhead for stuff which most (you see what I did there?) applications simply don't need.
someone speaks some sense. POSIX simply currently lacks fbarrier(...).
HAND.
If Linus et al don't like the way ext3 works, they shouldn't complain about the developer, they should change it. After all, they have the source code.
Ah, that felt good!
"Oh come off it. You must be an American, because in America excessive gentleness and tenderness in dealing with even the most outrageous and inexcusable problems seems to the present cultural norm."
Where is this gentle programming territory in the US? Remember, the Daily WTF was started in the USA - not exactly a font of tenderness.
He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking.
Well I was unaware of it, too.
And when I did a journaling system back in the mid '80s the whole POINT of it was to maintain a consistent ("though not necessarily current") filesystem on the disk at all times. ("Not necessarily current" means transactions that haven't yet hit the disk get lost in a crash. So if you want to build a reliable transaction processor on top of it you have a bit more to do.)
The idea behind it: Servers are intended to run continuously. So the commonest mode of shutdown will be system crash. Thus the server needs to:
1) Always be able to recover from a crash.
2) Do it very quickly.
(Once you have that you don't even need a shutdown mechanism. Just kill it. Kick off the clients first if you're really concerned about not reversing transactions.)
I had THOUGHT that the journaling file systems we've come to know and deploy were also based on this set of ideas. If they AREN'T, it's time to build one that IS.
(And if I'd known earlier that they weren't I might have gone and done it. B-( )
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
Whatever you decide in terms of the filesystem, if you're serious about not losing data:
* RAID your disks to protect yourself when one of them goes south. It will. Linux software RAID is perfectly serviceable for this.
* Spend $50 or $100 for a monitorable UPS and run nut, so that Murphy doesn't have to prove to you that any filesystem can fail horribly if power is removed at exactly the right moment.
* Back your data up. If it's not worth backing up, it's not worth keeping. Murphy will happily prove that to you, too, sooner or later.
Ted T'so, as Linus knew perfectly well.
Ted's a great guy, and he can take a (metaphorical) punch in the chops, and Linus knows that too.
I call my male friends morons sometimes. We're friends, and we're guys, so it's OK.
Also, note that while Ted did the dirty deed, he was following pretty well accepted thinking in the fs theory world. (It's wrong thinking in my opinion, but I'm a dinosaur who has always run all his filesystems fully sync'ed so don't listen to me.) So you could say that Linus was calling the "common wisdom" moronic, and castigating Ted for implementing it, rather than calling Ted a moron.
Either way, Ted's a big guy, he's got major geek chops (he's the guy behind capabilities after all) so he can take it. Yay Ted! We love ya, ya big moron!!!
Reiser3 is far superior to Ext3. This point has just been proven over and over again.
With the latests results from Phoronix where they stated they used generic builds from Ubuntu, you will note that Reiser3 tended to have a close race with Ext3. But on the contrary should you build those generic kernel sources to the specifications of the hardware being used, which was a dual-core Intel (Xenon/Core2), you'll see 50% gains in performance. To show how slow Ext3 is just copy a huge file between two partitions. Uncompress the kernel. Compress the kernel. Load up a game with a lot of resources.
I don't think there are any better alternatives out there at the moment. JFS is too slow, XFS is slow. BRTFS? Reminds me of the Magrathean, Slartybartfast. Maybe it will be the Resier4 we all wanted but it's alpha.
Best test is to install an OS. Time the install.
If I can get a distribution installed it's Reiserfs and nothing else.
Crash testing was something my children were good at; turning off the computer without shutting down.
Sometimes I find myself so tired I don't bother waiting for the torrent's to finish or the process to wind down; flip the switch.
If you're always looking ahead you're never looking behind nor are you looking around. You're just acting like a clown.
2.6.29 is a moist wet disappointment. .26/28 versions of what I needed, more like copied into my source tree. Madwifi's release still worked great on the .24 series. But later versions changed too much.
I stand firm that 2.6.24.5 is a great kernel. I backported the
I'm still mad at Slackware for including .27 with 12.2. It's almost like none of those guy run Atheros wireless.
Which brangs me to Debian, last time I installed it I couldn't find xorgconfig, xorgsetup, nothing. Why no default for Reiserfs partitioning schemes. It's always manual. apt-get install this, that, and the other. Wow what an anal distribution.
Something else that really bothers me is why we can't get a really great distribution of Linux anymore.
You can't even Warez one these days. I mean even the comerical versions don't include DVD support, the 100's of codec's you need, nor do you find the propriotary sub-hinting enabled for freetype. Bullshit
I mean I can find all the copies of Left4Dead I want, 360 iso's, and every new piece of adobe software made. But it doesn't benefit me.
So tired of all this mess. Time to play on the one compuer product Microsoft got right, The Xbox 360. ... I guess they frakked that up to.
Wait
Hope everybody enjoyed the last fraking eppisode of BattleShit Gynochologitca. More women had their panties down on that show then any other 70's space tv show. Even Old Series Star Trek paled in comparison. Not that I hated it, I loved it for a time. It was like watching Bleach anime. You get fraking pissed off quit and then have to go download all the episodes you fraking missed. So many fillers so little time.
But Kattie Slackware was worth the next to last episode, I mean to see her on the toliet.!!! wow and she barly washed up. Dirty bird.
I certainly wouldn't use LVM, RAID-5, ext4, XFS, or Linux. I'd use Solaris 10 or OpenSolaris and ZFS.
you had me at #!