Kernel Hackers On Ext3/4 After 2.6.29 Release

← Back to Stories (view on slashdot.org)

Kernel Hackers On Ext3/4 After 2.6.29 Release

Posted by timothy on Wednesday March 25, 2009 @12:18AM from the good-things-come-from-certain-clashes dept.

microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"

242 of 316 comments (clear)

Slow performance by rootnl · 2009-03-25 00:23 · Score: 4, Funny

The server is taking too long to respond; please wait a minute or 2 and try again.
Mmmh, must be a big problem

--

We are the people our parents warned us about.
1. Re:Slow performance by morgan_greywolf · 2009-03-25 00:51 · Score: 5, Funny
  
  Well, they had to switch the lkml server to ext3 because posts kept getting killed and cut into pieces with their old filesystem and the admins just kept saying "Well, they must've gone to Russia."
  
  --
  My blog
2. Re:Slow performance by markov_chain · 2009-03-25 01:56 · Score: 1
  
  The web server tried to fsync() on the logs and keeps waiting for 2+ minutes. Good luck.
  
  --
  Tsunami -- You can't bring a good wave down!
3. Re:Slow performance by ultranova · 2009-03-25 03:38 · Score: 1
  
  You sure that wasn't an ad for Viagra targeted specifically to the over-the-hill nerd community?
  
  What would a nerd need Viagra for ?^)
  
  --
  Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
4. Re:Slow performance by SIR_Taco · 2009-03-25 05:51 · Score: 1
  
  I wasn't implying it was a well thought out marketing campaign.
  
  --
  I say don't drink and drive, you might spill your drink. Before you get behind the wheel just stop and think.
5. Re:Slow performance by mrsteveman1 · 2009-03-25 06:07 · Score: 4, Funny
  
  You sure that wasn't an ad for Viagra targeted specifically to the over-the-hill nerd community?
  What would a nerd need Viagra for ?^)
  Longer uptime of course
6. Re:Slow performance by Zero__Kelvin · 2009-03-25 07:16 · Score: 1
  
  Here is your word of the day.
  
  --
  Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
Idiotic by baadger · 2009-03-25 00:27 · Score: 5, Informative

Mirror for the thread:
http://thread.gmane.org/gmane.linux.kernel/811167/focus=811699
lkml.org server is slashdotted. by javilon · 2009-03-25 00:28 · Score: 4, Funny

this is what I get from http://lkml.org/lkml/2009/3/24/460:
"The server is taking too long to respond; please wait a minute or 2 and try again."
Considering that there is only one comment on this slashdot thread, that means that most people will comment without actually reading TFA.
Like me... :-)

--

When his defense asked, "Which computer has Jon Johansen trespassed upon?" the answer was: "His own."
1. Re:lkml.org server is slashdotted. by FernandoTorres · 2009-03-25 00:44 · Score: 5, Funny
  
  Well this is just my meta comment. I'll be writing my real comment later...
2. Re:lkml.org server is slashdotted. by Tei · 2009-03-25 00:44 · Score: 1
  
  I doubt it. I suppose has been preemtively put offline. Now is not the slashdot effect, is the slash-lepper.
  
  --
  -Woof woof woof!
3. Re:lkml.org server is slashdotted. by hesaigo999ca · 2009-03-25 01:08 · Score: 2, Funny
  
  I actually read it, and the emails from Linus, really good read, his performance was as usual,
  quite outstanding.
4. Re:lkml.org server is slashdotted. by Anonymous Coward · 2009-03-25 01:34 · Score: 5, Insightful
  
  Well this is just my meta comment. I'll be writing my real comment later...
  You forgot to include a link to the comment you'll be writing later.
5. Re:lkml.org server is slashdotted. by digitalunity · 2009-03-25 01:42 · Score: 1
  
  This would be one of those posts where a score over 5 is appropriate.
  Would have been funnier though if it was Linus saying it in lkml.
  
  --
  You can't legislate goodness. Let each to his own destiny, by will of his freely made choices.
6. Re:lkml.org server is slashdotted. by linuxrocks123 · 2009-03-25 02:30 · Score: 5, Insightful
  
  Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking. All the filesystem can normally promise in the event of a crash is that the metadata will describe a valid filesystem somewhere between the last returned synchronization call and the state at the event of the crash. If you need more than that -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes and putting a special capacitor in the system so the OS can flush everything to disk before the computer loses power in an outage.
  
  --
  vi ~/.emacs # I'm probably going to Hell for this.
7. Re:lkml.org server is slashdotted. by AigariusDebian · 2009-03-25 02:45 · Score: 5, Informative
  
  On-disk state must always be consistent. That was the point of journalig, so that you do not have to do a fsck to get to a consistent state. You write to a journal, what you are planing to do, then you do it, then you activate it and mark done in the journal. At any point in time, if power is lost, the filesystem is in a consistant state - either the state before the operation or the state after the operation. You might get some half-written blocks, but that is perfectly fine, because they are not referenced in the directory structure until the final activation step is written to disk and those half-written bloxk are still considered empty by the filesystem.
8. Re:lkml.org server is slashdotted. by thomasdz · 2009-03-25 02:48 · Score: 4, Interesting
  
  You forgot to include a link to the comment you'll be writing later.
  Maybe the power failed in the middle of him writing his comment?
  Don't worry...it'll appear in some other Slashdot thread until CmdrTaco does a fsck.
  
  --
  Karma: Excellent. 15 moderator points expire sometime.
9. Re:lkml.org server is slashdotted. by jnetsurfer · 2009-03-25 03:17 · Score: 1
  
  All the filesystem can normally promise in the event of a crash is that the metadata will describe a valid filesystem somewhere between the last returned synchronization call and the state at the event of the crash. If you need more than that -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes and putting a special capacitor in the system so the OS can flush everything to disk before the computer loses power in an outage.
  What about ZFS? Doesn't ZFS have a bunch of checksumming and hardware failure tolerance functionality which you "probably need"?
10. Re:lkml.org server is slashdotted. by gclef · 2009-03-25 03:37 · Score: 2, Informative
  
  Actually, he has a valid point: the user doesn't give a damn about whether their disk's metadata is consistent. They care about their actual data. If a filesystem is sacrificing user data consistency in favor of metadata consistency, then it's made the wrong tradeoff.
11. Re:lkml.org server is slashdotted. by Anonymous Coward · 2009-03-25 03:39 · Score: 4, Informative
  
  No, you're the one who's clueless.
  The issue (as Linus said) isn't that the journalling is providing data integrity, it's that doing the journalling the wrong way causes *MORE* data loss.
  Basically, you're sacrificing data integrity for speed, when you don't need to.
  Perhaps you should work on your reading comprehension.
12. Re:lkml.org server is slashdotted. by dotancohen · 2009-03-25 04:24 · Score: 1
  
  If you need more than that -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes and putting a special capacitor in the system so the OS can flush everything to disk before the computer loses power in an outage.
  Why _isn't_ there such a capacitor in the system? How about doing the caching in the hard drive's memory, and having a large capacitor on the drive itself. For all the system knows, once the data's gone over the SATA connection it's as good as written. And with a small redundant power supply on the drive, that may well be possible.
  
  --
  It is dangerous to be right when the government is wrong.
13. Re:lkml.org server is slashdotted. by raddan · 2009-03-25 04:25 · Score: 1
  
  Accidentally put a user's file in the wrong directory, or don't link it at all, and I can assure you that they will care about metadata consistency. A file is pretty much useless in a modern filesystem unless the directory points to its inode.
  
  Point being-- it is much faster to buffer the write, write the metadata first, unblock the process, and do the rest of the writing in the background. At this point, everyone is well aware of the tradeoffs. Journaling just makes sure you don't end up in an inconsistent state as far as the FS is concerned-- this does NOT mean that your data won't be corrupted! Along the lines of what GP says, if you need certain characteristics, like atomicity, then you probably also need the rest of the ACID stuff, too, so you're probably better off with a real database.
14. Re:lkml.org server is slashdotted. by inasity_rules · 2009-03-25 04:56 · Score: 1
  
  Why _isn't_ there such a capacitor in the system? [...]
  Cost. Btw a capacitor big enough to run your hard drive long enough to practically write your memory + disk cache is not realistic. We're talking batteries here. And guess what? Batteries wear out (how long does a laptop battery last...?). Just get a UPS and configure your system to halt/suspend to disk immediately on mains power loss. (and yes I am an Electronic engineering student.)
  
  --
  I have determined that my sig is indeterminate.
15. Re:lkml.org server is slashdotted. by Jonner · 2009-03-25 05:03 · Score: 1
  
  I think this idea has been used on some high end server hardware for some years and is called "transactional RAM." I certainly think it would make sense to apply to all non-volatile storage devices. Maybe it would be easier to use in a hybrid disk/flash design, since it might require less energy to guarantee writes to the flash would complete than to keep a platter spinning. The flash would be used by the drive as part of its write-back cache.
16. Re:lkml.org server is slashdotted. by drinkypoo · 2009-03-25 05:43 · Score: 1
  
  He cares that he doesn't have to fsck almost every boot.
  Some of us have discovered the 'shutdown' command. It's a big upgrade from the days when you had to run 'sync ; sync ; halt' because sync was non-blocking, and typing the second 'sync' gave the first one time to finish. (sync isn't non-blocking any more...) Anyhow, I suggest you use it occasionally. Then perhaps you can only fsck when something bad has happened.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
17. Re:lkml.org server is slashdotted. by ultranova · 2009-03-25 06:21 · Score: 1
  
  Point being-- it is much faster to buffer the write, write the metadata first, unblock the process, and do the rest of the writing in the background.
  
  It's even faster to buffer both data and metadata, unblock the process, and write the data and metadata in the background in that order.
  This sounds suspiciously similar to the recent ext4 bug; if it is, the fix is the same.
  
  --
  Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
18. Re:lkml.org server is slashdotted. by mmontour · 2009-03-25 06:56 · Score: 3, Insightful
  
  Some of us have discovered the 'shutdown' command. [...]Anyhow, I suggest you use it occasionally. Then perhaps you can only fsck when something bad has happened.
  Don't be too smug - a "shutdown" doesn't always guarantee a clean startup. I remember a bug (hopefully fixed now) where "shutdown" was completing so quickly that it powered off the computer while data was still sitting in the hard drive's volatile write cache. Even though the OS had unmounted the filesystem, the on-disk blocks were still dirty.
  p.s. If any OS/kernel developers are listening - how about implementing a standard API through which drive write-caches can be flushed+disabled whenever a system starts a shutdown procedure, gets a signal that the UPS is running on battery power, or otherwise concludes that it is in a state where a temporarily-increased risk of data loss justifies slowing down I/O?
19. Re:lkml.org server is slashdotted. by Darinbob · 2009-03-25 07:23 · Score: 1
  
  There is a confusion about "data integrity" too, which is where I think the ext3 vs ext4 problem came in. There is data loss as well as data corruption. All file systems will have to deal with some form of data loss after a crash or power failure (or being thrown out a window during a stressful week). Users and applications have to accept this. However data corruption is much more severe and can be catastrophic and not detected for a very long time (months or years).
  Ext3 has a potential for data corruption as I read it; whereas initially people complained about ext4 because of data loss (and applications that were sloppy with fsyncs or made assumptions about file system behavior).
20. Re:lkml.org server is slashdotted. by Carewolf · 2009-03-25 07:39 · Score: 1
  
  As the GP said that only applies to metadata. In ext3 you configure it to journal data as well, but you don't want to do that, it is very slow.
21. Re:lkml.org server is slashdotted. by Ungrounded+Lightning · 2009-03-25 09:29 · Score: 1
  
  If you need more than [metadata consistency] -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes ...
  Garbage.
  There are schemes for writing the data and metadata, along with an indication of transaction boundaries, that provide consistency on the disk at all times.
  The simplest ones have the downside that they write the data twice. That can be worked around - usually at the cost of slowing crash recovery a bit.
  
  --
  Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
22. Re:lkml.org server is slashdotted. by Anonymous Coward · 2009-03-25 10:27 · Score: 1, Insightful
  
  Yes, but in this case ext3 and ext4 keep (convenient, fast) consistency of the filesystem at the cost of worse behavior regarding the user experience (and user data).
23. Re:lkml.org server is slashdotted. by Methlin · 2009-03-25 13:33 · Score: 1
  
  p.s. If any OS/kernel developers are listening - how about implementing a standard API through which drive write-caches can be flushed+disabled whenever a system starts a shutdown procedure, gets a signal that the UPS is running on battery power, or otherwise concludes that it is in a state where a temporarily-increased risk of data loss justifies slowing down I/O?
  You mean something like ATAPI and apcupsd? Welcome to 1996.
24. Re:lkml.org server is slashdotted. by dirtyhippie · 2009-03-25 14:20 · Score: 1
  
  According to whom?
  http://en.wikipedia.org/wiki/Journaling_file_system
  
  A journaling file system is a file system that logs changes to a journal (usually a circular log in a dedicated area) before committing them to the main file system. Such file systems are less likely to become corrupted in the event of power failure or system crash.
  
  Linux has been known to be a jerk, but I think I'll trust him on this over you.
25. Re:lkml.org server is slashdotted. by Hal_Porter · 2009-03-25 18:09 · Score: 1
  
  Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking.
  Most Linux users don't fsck very often though. Fscking is a very rare case where you must to be correct, but you only should be performant. It's an issue if the fscking fscks your data, which is what happens if the metadata is more recent than the data.
  
  --
  echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
26. Re:lkml.org server is slashdotted. by Hal_Porter · 2009-03-25 18:19 · Score: 1
  
  FAT doesn't have a journal, but it does guarantee metadata consistency. When a transaction starts on disk it sets a bit in the second FAT entry, then it starts the update, then it clears the bit. If you mount the volume and the bit is set you should run a chkdsk. Chkdsk is painfully slow in the absence of a journal because you have to scan every directory entry and check them against the FAT for consistency but at the end of the process the metadata would be consistent again.
  Of course for the sort of volume sizes FAT was designed for, even this sort of chkdsk isn't too bad. And on Windows chkdsk does a lot of consistency checks even on NTFS - it doesn't just undo any transactions which are marked as pending in the journal, it actually checks all the indexes too.
  
  --
  echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
27. Re:lkml.org server is slashdotted. by Cheapy · 2009-03-26 03:05 · Score: 1
  
  it'll appear in some other Slashdot thread until CmdrTaco does a fsck.
  So it's going to be in another thread for eternity? :)
  
  --
  Would you kindly mod me +1 insightful?
28. Re:lkml.org server is slashdotted. by dotancohen · 2009-03-26 03:41 · Score: 1
  
  Really? I figure that a few seconds would be enough for get a measly cache written to disk. Even if it is to an "emergency sector" specifically designated for the role, which could be used to write the data properly later.
  
  --
  It is dangerous to be right when the government is wrong.
29. Re:lkml.org server is slashdotted. by inasity_rules · 2009-03-26 04:27 · Score: 1
  
  Except that the cache is not "measly"(running into multiple megabytes) and a capacitor that could seriously power a driver for a few seconds is not small(in any way). It takes quite a lot of power to maintain the drive spin and move the drive head around. A laptop drive chews about 2.5 - 5W or more doing this. Remember that you can't use all the energy stored in a capacitor; even with a step up converter (which adds inefficiency) there comes a point where the voltage drops too low.
  You could probably get one or two sectors with a fairly large capacitor, but it simply isn't very practical or cost effective. Or surprisingly it would have been done and be mainstream technology by now...
  No, batteries are the only way out, except that batteries (especially the popular LiIon) tend to die after a relatively short period of time. You could argue that the drive would be obsolete by that time, but not everyone works that way. Besides, its too expensive for too little gain. Software has shown that it can deal with this acceptably. Nothing is 100%, if you use a computer chances are high you will lose data at some point.
  If you're really paranoid, those flash hybrid hard disks everyone was talking about some time back may be an option. But I suspect they'll have similar issues.
  
  --
  I have determined that my sig is indeterminate.
30. Re:lkml.org server is slashdotted. by Thinboy00 · 2009-03-26 13:55 · Score: 1
  
  I'd say ~34 mounts until it forces it... unless he played with tune2fs, in which case, yes it will be stuck for eternity.
  
  --
  $ make available
31. Re:lkml.org server is slashdotted. by Thinboy00 · 2009-03-26 14:11 · Score: 1
  
  ext3 "forces" a fsck every ~37 mounts, but you can override it with tune2fs.
  
  --
  $ make available
Let me guess... by Puls4r · 2009-03-25 00:43 · Score: 5, Funny

The server is running linux.
1. Re:Let me guess... by UnRDJ · 2009-03-25 00:46 · Score: 5, Funny
  
  too much karma for your tastes?
2. Re:Let me guess... by Anonymous Coward · 2009-03-25 00:53 · Score: 4, Informative
  
  According to Netcraft, yes. Ubuntu.
  
  Wait, this is Slashdot... I need a cliche... uh...
  
  Netcraft confirms is, that server is dying?
OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · 2009-03-25 00:47 · Score: 5, Insightful

Quote from Linus:

"...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."
In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.
How about ASKING them rather than calling the Morons?
(note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)
TDz.
1. Re:OK, then... *WHO* is the official ext3 "moron"? by morgan_greywolf · 2009-03-25 01:03 · Score: 3, Informative
  
  Most likely Ted T'so, based on the git commit logs. I say most likely because someone more familiar with the kernel git repo than myself should probably confirm or deny this statement.
  
  --
  My blog
2. Re:OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · 2009-03-25 01:09 · Score: 5, Insightful
  
  Torvalds exactly knows who it is and most people following the discussion will probably know it, too.
  Also, there has been a fairly public discussion including a statement by the responsible person in question.
  Not saying the name is Torvalds attempt at saving grace. Similar to a parent of two children saying, I don't know who did the mess, but if I come back, it better be cleaned up.
  Yes, Mr. Torvalds is fairly outspoken.
3. Re:OK, then... *WHO* is the official ext3 "moron"? by morgan_greywolf · 2009-03-25 01:16 · Score: 2, Informative
  
  I can see you've never written any filesystem drivers ;). It's not quite that simple, but more or less that's the type of change you'd make.
  
  --
  My blog
4. Re:OK, then... *WHO* is the official ext3 "moron"? by Skuto · 2009-03-25 01:17 · Score: 4, Interesting
  
  Well, some Linux filesytem developers (and some fanboys) have been chastising other (higher-performance) filesytems for not providing the guarantees that ext3 ordered move provides.
  Application developers hence were indirectly educated to not use fsync(), because apparently a filesystem giving anything other than the ext3 ordered mode guarantees is just unreasonable, and ext3 fsync() performance really sucks. (The reason why you don't actually *want* what fsync implies has been explained in the previous ext4 data-loss posts).
  Some of those developers are now complaining that their "new" filesystem (designed to do away with the bad performance of the old one) is disliked by users who are losing data due to applications being encouraged to be written in a bad way, and telling the developers that they now should add fsync() anyway (instead of fixing the actual problem with the filesystem).
  Moreover, they are complaining that the application developers are "weird" because of expecting to be able to write many files to the filesystem and not having them *needlessly* corrupted. IMAGINE THAT!
  As an aside joke, the "next generation" btrfs which was supposed to solve all problems has ordered mode by default, but its an ordered mode that will erase your data in exactly the same way as ext4 does.
  Honestly, the state of filesystems in Linux is SO f***d that just blaming whoever added writeback mode is irrelevant.
5. Re:OK, then... *WHO* is the official ext3 "moron"? by 644bd346996 · 2009-03-25 01:24 · Score: 5, Informative
  
  ext3 was merged to the mainline kernel in 2001. Git was created in 2005. I wouldn't trust any authorship evidence in a git repo for code predating the repo.
  The journalling behavior of ext3 was probably decided by Stephen Tweedie
6. Re:OK, then... *WHO* is the official ext3 "moron"? by red_dragon · 2009-03-25 01:25 · Score: 4, Funny
  
  they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus
  He's following Ext3 writeback semantics. You'll have to wait for a patch to fix his behaviour.
  
  --
  In Soviet Russia, Jesus asks: "What Would You Do?"
7. Re:OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · 2009-03-25 01:34 · Score: 2, Funny
  
  nonsense. we can always use a good pillorying
  That's why I'm a OpenBSD developer as well... I like the abuse and scorn that Theo throws at me.
  It's good to see that Linus is becoming more like Theo. What's the quote: "that which doesn't kill me only makes me stronger"
8. Re:OK, then... *WHO* is the official ext3 "moron"? by morgan_greywolf · 2009-03-25 01:34 · Score: 2, Informative
  
  Right, but this problem doesn't go back to 2001.
  
  --
  My blog
9. Re:OK, then... *WHO* is the official ext3 "moron"? by houghi · 2009-03-25 01:35 · Score: 5, Insightful
  
  Knowing the humor that Linus has, it could be himself.
  
  --
  Don't fight for your country, if your country does not fight for you.
10. Re:OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · 2009-03-25 01:46 · Score: 1, Funny
  
  It is that simple, but the GPL explicitly forbids to change the write order of file system operations without the written consent of the author who lives currently in a hut on an isolated Pacific island.
11. Re:OK, then... *WHO* is the official ext3 "moron"? by Ecuador · 2009-03-25 02:03 · Score: 5, Funny
  
  Yep, we urgently need some kind of killer FS for Linux...
  Oh, wait...
  
  --
  Violence is the last refuge of the incompetent. Polar Scope Align for iOS
12. Re:OK, then... *WHO* is the official ext3 "moron"? by BigBuckHunter · 2009-03-25 02:06 · Score: 2, Informative
  
  Honestly, the state of filesystems in Linux is SO f***d that just blaming whoever added writeback mode is irrelevant.
  I agree that the who-dun-it part is irrelevant. I disagree on the "SO f***d" part. We have three filesystems that write the journal prior to the data. Basically, we know the issue, and a similar fix can be shared amongst the three affected filesystems. We've had far more "f***d" situations than this (think etherbrick-1000) where hardware was being destroyed without a good understanding of what was happening. Everything will work out as it seems to have everyone's attention.
  
  BBH
13. Re:OK, then... *WHO* is the official ext3 "moron"? by SpinyNorman · 2009-03-25 02:19 · Score: 4, Insightful
  
  fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.
  I think sometimes programmers do fsync() when they really want fflush() (flush library buffers to driver) which is about program behavior ("I want this data written to disk real-soon-now", not hanging around in the library buffer indefinitely) rather than a data-on-disk guarantee.
  IMO telling programmers to flatly avoid fsync is almost as bad as having a borked meta-data/data write order - progammers should be educated about what fsync does and when they really want/need it and when they don't. I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.
14. Re:OK, then... *WHO* is the official ext3 "moron"? by gbjbaanb · 2009-03-25 02:28 · Score: 2, Interesting
  
  hm. Similar to a parent of two children ranting at them without taking time to think first. Calling them morons is just going to get them growing up to be dysfunctional at best. No wonder the world has a dim view of the "geek" community.
  It seems to me that, as usual, the issue is not as clear cut as it first appears
15. Re:OK, then... *WHO* is the official ext3 "moron"? by Skuto · 2009-03-25 02:29 · Score: 2, Insightful
  
  I agree that the who-dun-it part is irrelevant. I disagree on the "SO f***d" part. We have three filesystems that write the journal prior to the data. Basically, we know the issue, and a similar fix can be shared amongst the three affected filesystems.
  I would be very surprised if the fix can be shared between the filesystems. At least the most serious among those involved, XFS, sits on a complete intermediate compatibility layer that makes Linux looks like IRIX.
  Linux filesytems are seriously in a bad state. You simply cannot pick a good one. Either you get one that does not actively kill your data (ext3 ordered/journal) or you pick one which actually gives decent performance (anything besides ext3).
  Obviously, we should have both. It's not like that is impossible. But it's surprising how long those problems lasted. It's not like filesystems are a MINOR part of the entire OS.
  Probably part of the reason is that we have JFS, XFS, ext3/4, reiser3/4, tux3, btrfs... Filesytem developers suffer very heavily from NIH syndrome. Instead of one good we have 8 that "almost" work.
  But almost is not good for something so essential. This is not the kind of choice that is good. It's time one filesystem wins, gets fixed, and the rest is left dead.
16. Re:OK, then... *WHO* is the official ext3 "moron"? by Kjella · 2009-03-25 02:38 · Score: 1
  
  Would you care to make an educated guess on how many run one of said three filesystems - particularly ext3, compared to using an etherbrick-1000? Scale matters, even if it sucks equally much if *your* data was eaten by a one-in-a-billion freak bug or a common one.
  
  --
  Live today, because you never know what tomorrow brings
17. Re:OK, then... *WHO* is the official ext3 "moron"? by Bill,+Shooter+of+Bul · 2009-03-25 02:40 · Score: 1
  
  Ahh... That link explains a lot. However, I have a different parenting strategy. If the kid does something wrong, let him know it. If he does something good let him know it too. Calling them a moron is ok, as long as its balanced out with genius every now and then. Of course, don't actually use the word, if the kid is a moron. Like Linus that should only be used to indicate a temporary lapse of judgment in an otherwise intelligent person.
  
  --
  Well.. maybe. Or Maybe not. But Definitely not sort of.
18. Re:OK, then... *WHO* is the official ext3 "moron"? by David+Greene · 2009-03-25 02:43 · Score: 1
  
  How about ASKING them rather than calling the Morons?
  Ah, but that would mean that Linus would have to grow up and actually lead.
  
  --
19. Re:OK, then... *WHO* is the official ext3 "moron"? by Rich0 · 2009-03-25 02:49 · Score: 3, Interesting
  
  I agree. What we need is a mechanism for an application to indicate to the OS what kind of data is being written (in terms of criticality/persistance/etc). If it is the gimp swapfile chances are you can optimize differently for performance than if it is a file containing innodb tables.
  Right now app developers are having to be concerned with low-level assumptions about how data is being written at the cache level, and that is not appropriate.
  I got burned by this when my mythtv backend kept losing chunks of video when the disk was busy. Turns out the app developers had a tiny buffer in ram, which they'd write out to disk, and then do an fsync every few seconds. So, if two videos were being recorded the disk is contantly thrashing between two huge video files while also busy doing whatever else the system is supposed to be doing. When I got rid of the fsyncs and upped the buffer a little all the issues went away. When I record video to disk I don't care if when the system goes down that in addition to losing the next 5 minutes of the show during the reboot I also lose the last 20 seconds as well. This is just bad app design, but it highlights the problems when applications start messing with low-level details like the cache.
  Linux filesystems just aren't optimal. I think that everybody is more interested in experimenting with new concepts in file storage, and they're not as interested in just getting files reliably stored to disk. Sure, most of this is volunteer-driven, so I can't exactly put a gun to somebody's head to tell them that no, they need to do the boring work before investing in new ideas. However, it would be nice if things "just worked".
  We need a gradual level of tiers ranging from a database that does its own journaling and needs to know that data is fully written to disk to an application swapfile that if it never hits the disk isn't a big deal (granted, such an app should just use kernel swap, but that is another issue). The OS can then decide how to prioritize actual disk IO so that in the event of a crash chances are the highest priority data is saved and nothing is actually corrupted.
  And I agree completely regarding transaction support. That would really help.
20. Re:OK, then... *WHO* is the official ext3 "moron"? by Lord+Ender · 2009-03-25 02:50 · Score: 1
  
  I think it's safe to say that anyone capable of writing a filesystem module at all is far above the "moron" level on the human intelligence scale. Furthermore, anyone willing to volunteer their time by writing such software and donating it to the ungrateful world should be thanked, mistakes or not.
  Linus seems to have the wrong temperament for managing a project of humans.
  
  --
  A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
21. Re:OK, then... *WHO* is the official ext3 "moron"? by AigariusDebian · 2009-03-25 02:50 · Score: 1
  
  The best ways to have person improve are positive and negative stimulation. Working systems are the positive stimulation, fellow programmers commenting on the dumb points of the design is the negative one.
  And you need both at all times, regardless what the politically correct view on education is floating currently.
22. Re:OK, then... *WHO* is the official ext3 "moron"? by Skuto · 2009-03-25 03:08 · Score: 3, Insightful
  
  fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.
  The two issues are very closely related, not "an entirely different issue". What the apps want is not "put this data on the disk, NOW", but "put this data on the disk sometime, but do NOT kill the old data until that is done".
  Applications don't want to be sure that the new version is on disk. They want to be sure that SOME version is on disk after a crash. This is exactly what some people can't seem to understand.
  fsync() ensures the first at a huge performance cost. rename() + ext3 ordered gives you the latter. The problem is that ext4 breaks this BECAUSE of the journal ordering. The "consistent state" is broken for application data.
  
  I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.
  Yes. But they are assuming this exists and the API is called rename() :)
23. Re:OK, then... *WHO* is the official ext3 "moron"? by gbjbaanb · 2009-03-25 03:23 · Score: 1
  
  sure, +ve and -ve stimulation are necessary, but you have to consider the amount of over-stimulation in this case. Several people have commented that the fs writer did nothing wrong, that the non-default option is a fast-but-dangerous option if you like living dangerously. Nothing wrong with that if you know you're making those choices (and it was documented, so good for the FS writer).
  Now, if Linus wanted to comment how he thought the option could be dropped, be replaced with something different, or otherwise improved upon then that's fine. If he thought it was a poor choice and wanted to say he thought it wasn't a good thing, that's fine too (the negative stimulation!), but as he shouted out someone for being a moron (when they obviously aren't) this negative stimulation is just an attack.
  That does nothing to improve a person, that solely makes them withdraw and defend themselves from future attacks. That's not productive or useful to society (you see it in the damaged children who damage themselves and others in turn) or the linux community. Would you contribute if you thought Linus was going to criticise you so vocally and publicly, instead of with a reasoned argument, calmly delivered?
24. Re:OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · 2009-03-25 03:47 · Score: 2, Insightful
  
  Umm... If this was Microsoft's filesystem, we wouldn't be following a conversation between the filesystem developer and the lead kernel developer. And no matter how curious or knowledgeable we were, no one outside of Redmond would know the details.
  We would be privileged to know about any issue at all, and any knowledge of it would be filtered through Microsoft marketing and thousands of paid and unpaid Microsoft apologists (like yourself); the developers themselves would be gagged by NDAs (I'm not even going to talk about the fact that we are all able to customize the kernel, filesystem, and even the applications causing the problems for our own requirements).
  If it were Microsoft's filesystem, we likely wouldn't be having this discussion at all.
25. Re:OK, then... *WHO* is the official ext3 "moron"? by ultranova · 2009-03-25 03:48 · Score: 1
  
  Similar to a parent of two children ranting at them without taking time to think first. Calling them morons is just going to get them growing up to be dysfunctional at best.
  
  See, they're not really children. They're grown men (or women), and should be able to handle being called idiots when they are. The grandparent merely used an analogy to explain why Torvalds didn't refer to them by name: to let them save face.
  Did this clear it up? Feel free to ask if it's still unclear.
  
  No wonder the world has a dim view of the "geek" community.
  
  It doesn't, actually. Just the idiot geeks who write a filesystem which corrupts their files.
  
  --
  Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
26. Re:OK, then... *WHO* is the official ext3 "moron"? by Dog-Cow · 2009-03-25 03:54 · Score: 1
  
  Morons like you say that, but you're so obviously wrong you just come across as having negative intelligence.
  Linus is the defacto program manager for a very large project involving thousands of people in hundreds of countries all over the world.
  You many not like his style, but to say that he has the wrong temperament contradicts all the evidence to the contrary.
27. Re:OK, then... *WHO* is the official ext3 "moron"? by XanC · 2009-03-25 04:11 · Score: 1
  
  I'm often bitten by this mythtv problem too! Can you submit a patch upstream?
28. Re:OK, then... *WHO* is the official ext3 "moron"? by bradkittenbrink · 2009-03-25 04:12 · Score: 2, Funny
  
  Whoah, the first time I scanned that I read: "ext3 was merged to the mainline kernel in 2001. ext3 was created in 2005"
  my reaction was, "Wow, well that makes sense..."
29. Re:OK, then... *WHO* is the official ext3 "moron"? by Skuto · 2009-03-25 04:14 · Score: 2, Interesting
  
  ...btrfs is starting from the ground up rather than try to fight those camped on their domains and won't play ball,.... So why don't you stop talking shit, or come up with specific cases to back up your claims.
  Didn't you just do that for me?
  Things like XFS or JFS are badly maintained and supported because they are too complex and were lumped in from other systems. This is a problem if, for example, XFS is the only serious option for really big volumes.
  Reiser3 receives no more improvements, Reiser4 is dead. That doesn't leave much besides ext3. Funnily, ext3 has been catching up in performance just because the other FS are dead. Ok, maybe funny isn't the right word...
  
  Unlike other OSes, Linux has several filesystems to chose for whatever the users' needs are, and new ones will appear from other proprietary systems at a later date. You think NTFS or HFS+ is any better?
  Choice is fine when all choices are good. When all choices have serious and different issues, that just means effort has been wasted.
  As for NTFS: At least from the application side you know which problems will hit you and which ones not.
30. Re:OK, then... *WHO* is the official ext3 "moron"? by hawk · 2009-03-25 04:27 · Score: 1
  
  >See, they're not really children.
  Gee, when I've referred to FreeBSD as "like Linux for grownups," I was referring to the more conservative and structured development model, but if the cheap shot fits . . . :)
  hawk
31. Re:OK, then... *WHO* is the official ext3 "moron"? by RubberDuckie · 2009-03-25 04:34 · Score: 1
  
  I completely agree; I can think of very few reasons to call someone a 'moron', especially in a public forum. I'll assume that the problem code passed some kind of peer review? If so, we have plenty of folks to 'blame'.
32. Re:OK, then... *WHO* is the official ext3 "moron"? by SpinyNorman · 2009-03-25 04:39 · Score: 1
  
  What the apps want is not "put this data on the disk, NOW", but "put this data on the disk sometime, but do NOT kill the old data until that is done".
  I kind of agree, but only partly. The latter is really what journalling should be providing - making IO operations atomic so they either succeed or, if interupted, get undone so that the prior consistent state is restored.
  The reason I only partly agree is because I think what most apps really want is transaction support, not just atomic writes - in the simplest case they're the same, but in general a transaction (an atomic application level operation) may consist of a number of file system operations, not just one.
  The reason people want journalling/atomic writes isn't really because they're useful at the application level (you may still be left with an application-level inconsistent file state - a partially comitted transaction) but rather because it ensures the integrity of your file system after a crash.
33. Re:OK, then... *WHO* is the official ext3 "moron"? by Big+Boss · 2009-03-25 04:54 · Score: 1
  
  Has that fsync patch been added to mythtv either in .21-fixes or trunk? I've been having a problem very like that and really need to fix it.
34. Re:OK, then... *WHO* is the official ext3 "moron"? by Tetsujin · 2009-03-25 05:40 · Score: 1
  
  Just in case anybody takes the previous AC seriously: That's funny, but not actually true, and probably not trolling in any case.
  Thank you for telling me what to think. I have trouble with that sometimes.
  
  --
  Bow-ties are cool.
35. Re:OK, then... *WHO* is the official ext3 "moron"? by stevied · 2009-03-25 05:42 · Score: 1
  
  I'm pretty sure FFS did this back in the days before journalled filesystems, so "the people who did this" probably did it decades ago.
  Ext2/3 just inherited the behaviour.
36. Re:OK, then... *WHO* is the official ext3 "moron"? by Tetsujin · 2009-03-25 05:45 · Score: 1
  
  Knowing the humor that Linus has, it could be himself.
  That was my first thought:
  "OK, he just said that whoever came up with that idea is a real idiot... So the punchline must be that it's him."
  But I don't actually know if it was him. It'd be nice to know, so I know whether he's being funny and kind of self-deprecating, or if he's being a bit of a jerk. :)
  
  --
  Bow-ties are cool.
37. Re:OK, then... *WHO* is the official ext3 "moron"? by Rich0 · 2009-03-25 05:56 · Score: 1
  
  Not that I'm aware of. This has been discussed on the lists. I suspect the devs consider it a "feature". The buffer sizes can be adjusted to taste - 32MB seemed like a good compromise between memory use and disk thrashing. I suspect you could get by with much less once you get rid of the sync - you can dump lots of data into the cache and as long as the kernel can reorder writes it should be able to keep up.
  However, here is a patch:
  
  --- ThreadedFileWriter.cpp.orig 2009-03-25 13:53:18.113584590 -0400 +++ ThreadedFileWriter.cpp 2009-03-25 13:53:21.396300186 -0400 @@ -26,9 +26,9 @@ #define LOC QString("TFW: ") #define LOC_ERR QString("TFW, Error: ") -const uint ThreadedFileWriter::TFW_DEF_BUF_SIZE = 2*1024*1024; +const uint ThreadedFileWriter::TFW_DEF_BUF_SIZE = 32*1024*1024; const uint ThreadedFileWriter::TFW_MAX_WRITE_SIZE = TFW_DEF_BUF_SIZE / 4; -const uint ThreadedFileWriter::TFW_MIN_WRITE_SIZE = TFW_DEF_BUF_SIZE / 32; +const uint ThreadedFileWriter::TFW_MIN_WRITE_SIZE = TFW_DEF_BUF_SIZE / 128; /** \class ThreadedFileWriter * \brief This class supports the writing of recordings to disk. @@ -340,7 +340,7 @@ while (!in_dtor) { bufferSyncWait.wait(written > tfw_min_write_size ? 1000 : 100); - Sync(); +// Sync(); } }
38. Re:OK, then... *WHO* is the official ext3 "moron"? by TheLink · 2009-03-25 05:56 · Score: 4, Funny
  
  Yeah, the metadata was written first, then only ext3 was actually created.
  
  A filesystem that writes the metadata before the actual data, is a "Duke Nukem Forever" Filesystem.
  --
  
  Too many replies beneath your current threshold
39. Re:OK, then... *WHO* is the official ext3 "moron"? by Rich0 · 2009-03-25 05:59 · Score: 1
  
  See this.
40. Re:OK, then... *WHO* is the official ext3 "moron"? by jbolden · 2009-03-25 06:11 · Score: 1
  
  This seems to me exactly what databases do. Make sure there is a consistent and usable version of the data even if the system fails during transaction. More and more I don't see why we don't move to database file systems for applications like the iSeries uses.
41. Re:OK, then... *WHO* is the official ext3 "moron"? by shutdown+-p+now · 2009-03-25 06:16 · Score: 1
  
  More and more I don't see why we don't move to database file systems for applications like the iSeries uses.
  You don't need to move to databases just to get transactions in filesystems. You just need transactions in filesystems :)
42. Re:OK, then... *WHO* is the official ext3 "moron"? by ultranova · 2009-03-25 06:30 · Score: 1
  
  Application developers hence were indirectly educated to not use fsync(), because apparently a filesystem giving anything other than the ext3 ordered mode guarantees is just unreasonable, and ext3 fsync() performance really sucks. (The reason why you don't actually *want* what fsync implies has been explained in the previous ext4 data-loss posts).
  
  I agree. POSIX needs filesystem transactions API; failing that, it at the very least needs a fbarrier() call which ensures that all modifications made by the thread calling it before it was called are logged to permanent storage before any of the modifications done after it are, but doesn't guarantee or even suggest that any of the modifications are written when fbarrier() returns.
  This would allow application writers to requests explicit ordering when needed without being forced to also request a performance-killing cache flush simultaneously, and allow the filesystem to reorder the writes for performance at other times.
  POSIX API is simply incomplete here.
  
  --
  Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
43. Re:OK, then... *WHO* is the official ext3 "moron"? by jbolden · 2009-03-25 06:33 · Score: 1
  
  True. And I think that's a good idea too. But once you have transactions why not go all out and get all the other advantages of a database filesystem: terrific metadata, flexible organization, built in shared databases for all apps, system configuration on a database, a single shared engine handling the disk writes (i.e. low level filesystems are database native)....
44. Re:OK, then... *WHO* is the official ext3 "moron"? by Darinbob · 2009-03-25 07:38 · Score: 1
  
  Part of the problem is that this is the high level view. It's trivial to make data consistent, by just locking out all file system activity during an fsync() call. But this can kill performance for many systems. Instead this fsync() has to run in an environment where reads and writes are occuring while the fsync() is busy trying to get things consistent. Some of the solutions suggested involved priorities, which is another mess.
  In my view, the ext3 mess is because there are competing types of applications that want different behavior from the file system. There are those who want it to be as fast as possible, with tolerance for corruption. There are those who want it as data consistent as possible, with tolerance for lower speed. There are those who and it fast and safe (which ain't gonna happen). The home user is not the same as the desktop developer, who is not the same as the back end server user. And ext3 has options to try and support a variety of users here, which means it gets even uglier inside.
  The Linux distributions should default to options for the most data safety (data=ordered) and leave it to the user to explicitly switch to less safe options if they want more speed.
45. Re:OK, then... *WHO* is the official ext3 "moron"? by PRMan · 2009-03-25 07:45 · Score: 1
  
  Yep, we urgently need some kind of killer FS for Linux...
  Try ReiserFS.
  Oh, I thought you said, "killer's FS".
  
  --
  Peter predicted that you would "deliberately forget" creation 2000 years ago...
46. Re:OK, then... *WHO* is the official ext3 "moron"? by gbjbaanb · 2009-03-25 07:50 · Score: 1
  
  its not them who are the idiots, you mindfucked moron. Its you that's frigging stupid, you idiot.
  (I am assuming you're ok with the insults because you say grown men should be able to handle it. Personally, I'm offended by your patronising comment)
  The geek in question didn't write a filesystem that corrupts files, the ordered option is on by default, everything works fine. As you'd realise when you understand how many ext3 filesystems are being used out there for such a long time, but if you had half a brain cell you'd realise this. He did write a different, faster option for the FS which is dangerous to use, as your limited intellect has grasped, but he did document it fully, including the dangers in its use. check the man page, if your mind can cope with putting the necessary letters in the right order in a shell prompt.
  What next, you're going to criticise the author of rm for writing something so system-destroying as it (when using the right options, of course).
  PS. I'm not offended by your post really, enjoy :)
47. Re:OK, then... *WHO* is the official ext3 "moron"? by Darinbob · 2009-03-25 08:01 · Score: 1
  
  This isn't a matter of kernel people having educated others that fsync() isn't necessary. Many application programmers just don't know about fsync or the equivalent for non-posix systems, or what it really does. Your C instruction manuals don't usually discuss it, class rooms don't discuss it, etc. And these programmers have usually grown up on systems where there's a background task flushing things every few seconds so that they never had to learn about this.
  And once they learn about fsync(), then the next step is to learn where and when to use it. That's very application dependent.
48. Re:OK, then... *WHO* is the official ext3 "moron"? by Darinbob · 2009-03-25 08:05 · Score: 1
  
  Yes. But they are assuming this exists and the API is called rename() :)
  But this is non-portable. If you know for sure that you're always on ext3, and it's always mounted ordered, then go for it. Otherwise you can't rely on rename being an ordered operation without knowing system details.
49. Re:OK, then... *WHO* is the official ext3 "moron"? by lgw · 2009-03-25 08:16 · Score: 1
  
  Databases become corrupt in all sorts of ways that file systems don't (and vice versa, obviously). Databases have a variety of scalaing issues that file systems don't (and vice versa, obviously). It's far easier to back up, restore, and replicate a file system, stick a file system on a cloud, across a slow WAN, etc. They are different tools for different jobs, and an optionally-transactional file system is still a good file system, but a database is not.
  
  --
  Socialism: a lie told by totalitarians and believed by fools.
50. Re:OK, then... *WHO* is the official ext3 "moron"? by lgw · 2009-03-25 08:20 · Score: 1
  
  If you write data to a drive and the system crashes before you have run an fsync(), then you have no right to be upset that your data isn't on the platters yet. That's what fsync() is for!
  fsync() is for flushing *all* data to disk. That's often the wrong thing to do! If the application just needs to flush it's own writes to disk, or even just one specific write, and not incur the HUGE performance hit of fsync(), it shouldn't need to call fsync().
  It's not 1983 any longer. fsync() is not the answer.
  
  --
  Socialism: a lie told by totalitarians and believed by fools.
51. Re:OK, then... *WHO* is the official ext3 "moron"? by mmontour · 2009-03-25 08:51 · Score: 2, Informative
  
  fsync() is for flushing *all* data to disk. That's often the wrong thing to do! If the application just needs to flush it's own writes to disk, or even just one specific write, and not incur the HUGE performance hit of fsync(), it shouldn't need to call fsync().
  sync() is for flushing *all* data to disk.
  fsync() and the related fdatasync() operate on a single file descriptor. There is also a finer-grained, non-portable "sync_file_range()" introduced in kernel 2.6.17 (according to the man page).
  fsync() is the correct function call for an application to use when it wants to flush its writes (for a particular fd) to disk. It is unfortunate if the implementation cannot do so without having to also flush unrelated writes to disk, but that's beyond the control of a userspace application.
52. Re:OK, then... *WHO* is the official ext3 "moron"? by ThePhilips · 2009-03-25 09:46 · Score: 1
  
  As for NTFS: At least from the application side you know which problems will hit you and which ones not.
  
  ... yeah and Linux sadly doesn't work this way.
  Because in Linux, problems - given enough community exposure - are getting eventually fixed.
  Hard-coding broken behavior and workarounds into applications on Linux - unlike on Windows and commercial *NIXs - doesn't work.
  
  --
  All hope abandon ye who enter here.
53. Re:OK, then... *WHO* is the official ext3 "moron"? by mvdwege · 2009-03-25 10:15 · Score: 1
  
  If you had paid any attention, you would have noticed that Theodore Tso provided patches to make the open/write/rename path safe (i.e. behave the same as in ext3) before he pointed out that relying on ext3 ordered mode is not safe.
  Think before you call others names. Otherwise you only show yourself up as a moron.
  Mart
  
  --
  "I know I will be modded down for this": where's the option '-1, Asking for it'?
54. Re:OK, then... *WHO* is the official ext3 "moron"? by CyberKrb · 2009-03-25 10:19 · Score: 1
  
  And I agree completely regarding transaction support. That would really help.
  After all you guys have got tired of joking about Hans Reiser, there is at least one thing he did foresee: the need for transactional interfaces (semantics in interaction) for Filesystems. It was in Reiser4/5/6's design document, "future vision" at least five years ago.
55. Re:OK, then... *WHO* is the official ext3 "moron"? by loxosceles · 2009-03-25 10:42 · Score: 1
  Isn't it only copy-on-write filesystems like zfs and btrfs that provide the kind of pseudo-transaction behavior you (and most people) are looking for?
  Case study:
  
  app repeatedly rewriting .whatever config files by an open() call that truncates and writes, or by explicitly using ftruncate().
  app modifying a file's data without changing filesize.
  In 1, it makes perfect sense that there's going to be some point in time during which a crash will result in a 0-byte file. That's what truncate means. The complaints (and I know how frustrating it is to see old config files replaced with useless 0-byte files) seem to want to redo truncate semantics. i.e. keep track of the NEW last-byte and truncate on file close. That can STILL result in corrupt files.
  In 2, a crash/hwfailure during write will result in some new data blocks, and some old.
  Copy on write with sane journaling solves both issues, doesn't it? That's why (AFAIK) everyone seems to recognize that btrfs or something like it is the ultimate goal. Ext4 is just a stopgap.
  I encourage everyone with some spare disk space to get 2.6.29, make a test btrfs partition, and test it with non-critical data. The more people provide feedback to the devs, the faster btrfs will progress.
56. Re:OK, then... *WHO* is the official ext3 "moron"? by mortonda · 2009-03-25 13:30 · Score: 2, Interesting
  
  Torvalds exactly knows who it is and most people following the discussion will probably know it, too....
  Yes, Mr. Torvalds is fairly outspoken.
  Yes, and the folks in that conversation are very thick skinned and are used to such statements, it's just they way they communicate. Having Linus call you a moron is nothing. (and he's probably right) ;)
  How many times have I looked at my own code and asked, "What MORON came up with this junk?"
57. Re:OK, then... *WHO* is the official ext3 "moron"? by Hal_Porter · 2009-03-25 18:27 · Score: 1
  
  I'd use Linux if Linus force choked Ted T'so to death and then appointed a replacement at random, like Darth Vader did in Star Wars.
  In fact from what I've heard that's how David Cutler would have handled the situation.
  
  --
  echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
58. Re:OK, then... *WHO* is the official ext3 "moron"? by Hal_Porter · 2009-03-25 18:51 · Score: 1
  
  Come and work with me. The pay is terrible, but the abuse is top notch.
  Maggot.
  
  --
  echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
59. Re:OK, then... *WHO* is the official ext3 "moron"? by sjames · 2009-03-26 10:15 · Score: 1
  
  For the most part, applications expect their own filesystem writes to happen in the same sequence they are issued (not an unreasonable request at all). The app writes a new file and then renames it over the old one, and that's what it expects to happen. It absolutely does NOT expect the file to actually be renamed over the old one and THEN (not) written. The latter may not technically violate POSIX, but it DOES violate the principle of least astonishment.
I would go further than Linus on this one... by pla · 2009-03-25 00:47 · Score: 4, Insightful

FTA: "if you write your data _first_, you're never going to see corruption at all"

Agreed, but I think this still misses the point - Computers go down unexpectedly. Period.

Once upon a time, we all seemed to understand that, and considered writeback behavior (when rarely available) always a dangerous option only for use in non-production systems and with a good UPS connected. And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!

Yes, it gives a huge performance boost... But performance without reliability means absolutely nothing. Eventually every computer will go down without enough warning to flush the write buffers.
1. Re:I would go further than Linus on this one... by Skuto · 2009-03-25 00:59 · Score: 5, Informative
  
  You are confusing writeback caching with ext3/4's writeback option, which is simply something different.
  The problem with all the ext3/ext4 discussions has been the ORDER in which things get written, not whether they are cached or not. (Hence the existance of an "ordered" mode)
  You want new data written first, and the references to that new data updated later, and most definitely NOT the other way around.
  Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.
2. Re:I would go further than Linus on this one... by AlterRNow · 2009-03-25 01:08 · Score: 3, Interesting
  
  Am I right believing that the new data is written elsewhere and then the metadata is updated in place to point to the new data? I don't know much about filesystems..
  
  --
  The disappearing pencil trick. Let me show you it.
3. Re:I would go further than Linus on this one... by Anonymous Coward · 2009-03-25 01:11 · Score: 4, Insightful
  
  Yes! This is the whole point. I am not a filesystem guy either. I don't even know that much about filesystems. But imagine you write a program with some common data storage. Imagine part of that common data is a pointer to some kind of matrix or whatever. Does anybody think it is a good idea to set that pointer first, and then initialize the data later?
  Sure, a realy robust program should be able to somehow recover from corrupt data. But that doesn't mean you can just switch your brain off when writing the data.
4. Re:I would go further than Linus on this one... by mysidia · 2009-03-25 01:12 · Score: 4, Interesting
  
  This is a potential problem when you are overwriting existing bytes or removing data.
  In that case, you've removed or overwritten the data on disk, but now the metadata is invalid.
  i.e. You truncated a file to 0 bytes, and wrote the data.
  You started re-using those bytes for a new file that another process is creating.
  Suddenly you are in a state where your metadata on disk is inconsistent, and you crash before that write completes.
  Now you boot back up.. you're ext3, so you only journal metadata, so that's the only thing you can revert, unfortunately, there's really nothing to rollback, since you haven't written any metadata yet.
  Instead of having a 0 byte file, you have a file that appears to be the size it was before you truncated it, but the contents are silently corrupt, and contain other-program-B's data
5. Re:I would go further than Linus on this one... by morgan_greywolf · 2009-03-25 01:29 · Score: 3, Insightful
  
  Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.
  It's common sense! Duh. Write data first, pointers to data second. If the system goes down, you're far less likely to lose anything. That's obvious. Those who think this is somehow not obvious don't have the right mentality to be writing kernel code.
  I think the problem is Ted T'so has had a slight 'works for me' attitude about it:
  
  All I can tell you is that *I* don't run into them, even when I was
  using ext3 and before I got an SSD in my laptop. I don't understand
  why; maybe because I don't get really nice toys like systems with
  32G's of memory. Or maybe it's because I don't use icecream (whatever
  that is).
  
  --
  My blog
6. Re:I would go further than Linus on this one... by AvitarX · 2009-03-25 01:37 · Score: 2, Informative
  
  It is by default, using the ordered journal type in Ext3.
  It is not an option yet in Ext4, and for now may not be the default, but an option to be set at mount time.
  Currently in Ext4, the meta data in journal is first updated, then the data written.
  When software assumes that it can send commands, and have them take place in the order sent this becomes problematic. Because without costly immediate writes there is a risk of losing very very old data, as the files metadata gets updated but the data not written to the new place yet.
  
  --
  Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
7. Re:I would go further than Linus on this one... by Chrisq · 2009-03-25 01:41 · Score: 1
  
  Is that you Linus?
8. Re:I would go further than Linus on this one... by hey · 2009-03-25 01:42 · Score: 2, Funny
  
  Well, its not ironic. It would be ironic if the ext3/4 authors lost their code in a crash because of the order that the data was written.
9. Re:I would go further than Linus on this one... by Hatta · 2009-03-25 01:55 · Score: 3, Insightful
  
  In that case, you've removed or overwritten the data on disk, but now the metadata is invalid.
  i.e. You truncated a file to 0 bytes, and wrote the data.
  Why on earth would you do that? Write the new data, update the metadata, THEN remove the old file.
  
  --
  Give me Classic Slashdot or give me death!
10. Re:I would go further than Linus on this one... by Spazmania · 2009-03-25 02:15 · Score: 4, Informative
  
  Here's what Linus had to say, and I think he hit the nail on the head:
  The point is, if you write your metadata earlier (say, every 5 sec) and
  the real data later (say, every 30 sec), you're actually MORE LIKELY to
  see corrupt files than if you try to write them together.
  And if you write your data _first_, you're never going to see corruption
  at all.
  This is why I absolutely _detest_ the idiotic ext3 writeback behavior. It
  literally does everything the wrong way around - writing data later than
  the metadata that points to it. Whoever came up with that solution was a
  moron. No ifs, buts, or maybes about it.
  
  --
  Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
11. Re:I would go further than Linus on this one... by Logic+and+Reason · 2009-03-25 02:26 · Score: 1
  Why can't the filesystem just update data and metadata in given order for a particular file? For example, if you truncate a file and then write to it, the following should happen:
  
  Metadata for `foo' is updated (length=0)
  New data for `foo' is written elsewhere
  Metadata for `foo' is updated (contents=new_data)
  If, on the other hand, you're doing the create-write-close-rename trick to get an "atomic file replace", then the following should happen:
  
  Metadata for `foo.new' is created (length=0)
  New data for `foo.new' is written elsewhere
  Metadata for `foo.new' is updated (contents=new_data)
  Metadata for `foo.new' is updated (filename=foo), replacing old `foo'
  It seems like in both cases, ensuring that data and metadata are written in given order for a particular file would solve the problem, without imposing any performance penalties on I/O operations going on for other files. I assume I'm missing something-- does all metadata need to be written in order with respect to all other metadata or something?
12. Re:I would go further than Linus on this one... by Spazmania · 2009-03-25 02:30 · Score: 1
  
  It's also an easily solved problem:
  After a truncf(), you lock the deleted blocks against a write until after you've written the updated metadata for the file. Until then, anything you write to the file will have to be allocated elsewhere on the disk. But then that's part of what the reserve slack is for: to increase the probability that there is somewhere else on the disk that you can write it.
  
  --
  Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
13. Re:I would go further than Linus on this one... by marcosdumay · 2009-03-25 02:50 · Score: 1
  
  To be fair, at volatile memory, unless you have multi-threading and want non-blocking semantics (as if anybody actualy did that), it makes no difference.
  
  --
  Rethinking email
14. Re:I would go further than Linus on this one... by AlterRNow · 2009-03-25 02:53 · Score: 1
  
  Yes, my initial comment was to ask whether the writing was to different blocks ( free? ) and not over-writing the old blocks ( which to me sounds very, very bad ).
  Is this what happens?
  1) Write new data to free blocks
  2) Update metadata to point to newly written blocks
  3) Mark old blocks as free
  And I guess with ext4 it is 2, 1, 3?
  
  --
  The disappearing pencil trick. Let me show you it.
15. Re:I would go further than Linus on this one... by Spazmania · 2009-03-25 02:57 · Score: 1
  
  Ext4 is more like {2,3},1
  
  --
  Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
16. Re:I would go further than Linus on this one... by Rich0 · 2009-03-25 03:06 · Score: 2, Informative
  
  This is more of a response to the 5 other replies to this comment - but rather than post it 5 times I'll just stick it here...
  What everybody else has proposed is the obvious solution, which is essentially copy-on-write. When you modify a block, you write a new block and then deallocate the old block. This is the way ZFS works, and it will also be used in btrfs. Aside from the obvious reliability improvement, it also can allow better optimization in RAID-5 configurations, as if you always flush an entire stripe you don't need to do a read-before-write to update the checksum data. The algorithm is also very amenable to snapshotting - you just hold off on deallocating the old blocks. In fact, snapshots perform better than normal writes since there are fewer steps (of course you do waste disk space - but you usually don't keep snapshots around forever).
17. Re:I would go further than Linus on this one... by zaaj · 2009-03-25 03:23 · Score: 1
  
  Isn't the metadata where the filesystem would keep track of what blocks are available to be re-used? How could another process start using those data blocks if you haven't updated the meta-data to record them as avilable for re-use? For truncation, doesn't the truncation happen by writing the metadata?
18. Re:I would go further than Linus on this one... by Hatta · 2009-03-25 03:53 · Score: 1
  
  What everybody else has proposed is the obvious solution, which is essentially copy-on-write.
  It IS obvious isn't it. I'm shocked it hasn't worked this way all along. Is there a reason this hasn't been standard practice since the 70s?
  
  --
  Give me Classic Slashdot or give me death!
19. Re:I would go further than Linus on this one... by euxneks · 2009-03-25 04:27 · Score: 1
  
  Why even remove the old data? Why not mark it as old and maintain snapshots like ZFS? Reallocate the "old" data when you need it, but maintain old data for backups...?
  
  --
  in girum imus nocte et consumimur igni
20. Re:I would go further than Linus on this one... by Cassini2 · 2009-03-25 04:32 · Score: 4, Insightful
  
  When you have less than 64K of RAM, and a processor that barely has a modern memory management unit, then some of these "extras" like Copy-On-Write appear as advanced features. Additionally, when your computer costs $500,000, you tend not to scrimp on stuff like a UPS.
  Economics have changed much since the early days of UNIX. Many of the file system design principles still remain the same. Assumptions need to change with the times. Reasonable historical assumptions were:
  - Every UNIX machine has a UPS.
  - Production servers run UNIX. What's this Linux you are talking about?
  - Disk space is expensive. No one will pay for unused disk space.
  - RAM is expensive. As such, it can be quickly flushed to disk.
  - No one has enough disk space, RAM, or disk bandwidth to experience a random fault rate of 1 part in 1 quadrillion (1E-15).
  Times have changed, Linux is used on heavy servers now. UNIX (with deference to AIX and Solaris) is almost gone from the market place. RAM and disk space are cheap, so cheap that random data errors can big issue. A UPS can cost more than a hard drive, and sometimes more than the computer it is attached to. Disk capacities are huge.
  Unfortunately, the file system designers haven't kept pace. The Ext4 bug was detected, reproduced, and ultimately solved for a group desktop Ubuntu users. Linux is used in cheap embedded applications, like home NAS servers. Applications that don't have a UPS. Linux isn't a just server O/S anymore. The way to design and optimize a file system needs to change too.
  Additionally, even for servers, the times have changed, and this affects file systems. It used to be that accepting data loss was OK, since you would need to rebuild a server after a failure. Today, the disk arrays are so large, that if you attempted to restore the data from backups, it would take hours (sometimes days.) As such, capabilities like "snapshots" are becoming very important to servers. Server disk storage is increasingly bandwidth limited, and not disk size limited. Today, it is possible to have 1 TB of data on a single disk, while being unable to use that disk space effectively. Under many workloads, the users are capable of changing the data faster than a backup program can copy the data off the disk. In such a case, without a snapshot capability, it is impossible to make a valid backup.
21. Re:I would go further than Linus on this one... by raddan · 2009-03-25 04:42 · Score: 1
  
  It makes a filesystem more complicated, because now you need additional space to do the copy-on-write. In the past, there was a reasonable expectation that a user might keep a disk near full, but now that storage is cheap, that expectation is no longer valid. Also, making this fast is non-trivial. You probably want your write routine to write to the first available (and acceptable) blocks it finds, but now you need to go back to your original inode and tell it where all of the chunks of the file are. Reading that file back now could be very slow, because your file is no longer/less contiguous. So you want to be very careful about where you put things.
  
  Most filesystems are a balance of things done for data integrity and things done for speed. Copy-on-write was probably not considered to be worth the effort until people started to realize how important their data was...
22. Re:I would go further than Linus on this one... by raddan · 2009-03-25 04:46 · Score: 1
  
  "metadata" is a bit of a catch-all, but generally filesystem designers consider "metadata" to be the on-disk structures that tell the OS where to find files. These structures sometimes contain information about the file, but what's more important is that it tells the OS where it is in relation to a directory. The free list (what you're talking about) is also metadata. The interesting thing is that the free list itself is sometimes actually stored in the free blocks, which means that when you use a block, you take it out of the free list, so your storage cost for the list is essentially nothing.
23. Re:I would go further than Linus on this one... by Anonymous Coward · 2009-03-25 05:19 · Score: 1, Insightful
  
  The situation you describe doesn't occur with a journaled filesystem. The journal does not rollback, it is a to-do list. The metadata update is added to the journal first, so even if the data is written before the actual metadata update, the metadata update is not lost. After the crash, the journal ensures that the new metadata becomes the current state of the filesystem.
  The interesting case (the one which triggered this whole discussion) is when the metadata update is performed without the corresponding data update. This happens when data is not journaled and the filesystem doesn't ensure that metadata updates related to unwritten data are discarded. The described behavior is more likely in Ext4 because of the longer data write delay, but it exists just the same in Ext3.
24. Re:I would go further than Linus on this one... by stevied · 2009-03-25 05:49 · Score: 1
  
  I think the stock answers here are still appropriate:
  If you want a database, you know to where to get one.
  If you want orthogonal persistence, you know where to get it.
  There might be an argument for new e.g. Ubuntu installs to give the user the option of separate partitions, using e.g. ext3 and data=journaled for /home, and ext4 for /media/media, which could be used for storing large audio / video files (which will benefit nicely from the delayed allocation, and are less critical if they got lost on unclean shutdown.) LVM might be an idea to allow easy resizing.
25. Re:I would go further than Linus on this one... by Rich0 · 2009-03-25 06:08 · Score: 1
  
  Wouldn't that only be an issue if the data is written to space owned by a file? If you write to unallocated space then no user should be able to access it unless they have direct access to the underlying device (which of course bypasses all security of any kind anyway).
  If you create a new file that happens to use the same allocated space the OS will wipe the block before allowing it to be read. Actually, I'm guessing that the blocks won't even be allocated until they are actually written to, and if you order the data write first then the sensitive data will be overwritten before anybody could read the file.
26. Re:I would go further than Linus on this one... by MikeBabcock · 2009-03-25 07:07 · Score: 1
  
  Lots of people do it and its stupid. Write, rename, flush should be required learning in grade-school.
  
  --
  - Michael T. Babcock (Yes, I blog)
27. Re:I would go further than Linus on this one... by Spazmania · 2009-03-25 07:43 · Score: 1
  
  That's only if you re-use deleted blocks before the meta-data deleting them has been committed. That's also an error, but it isn't circular: Writing the data before writing the metadata is not mutually exclusive with writing the metadata before reusing deleted blocks.
  
  --
  Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
28. Re:I would go further than Linus on this one... by kylemonger · 2009-03-25 08:01 · Score: 1
  
  Now you boot back up.. you're ext3, so you only journal metadata, so that's the only thing you can revert, unfortunately, there's really nothing to rollback, since you haven't written any metadata yet. Instead of having a 0 byte file, you have a file that appears to be the size it was before you truncated it, but the contents are silently corrupt, and contain other-program-B's data
  
  If this can happen, then a big data security hole (privacy) has been introduced into the system.
29. Re:I would go further than Linus on this one... by Darinbob · 2009-03-25 08:23 · Score: 1
  
  And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!
  This is also silently done in hardware... Storage devices usually have caching, and they don't all have ways to explicitly flush the cache, and they write from the cache in a different order than you put data into it. I've worked with USB hard drive enclosures that did not support the SCSI synchronize command (ugh).
30. Re:I would go further than Linus on this one... by ultranova · 2009-03-25 08:37 · Score: 1
  
  When you modify a block, you write a new block and then deallocate the old block. This is the way ZFS works, and it will also be used in btrfs. Aside from the obvious reliability improvement, it also can allow better optimization in RAID-5 configurations, as if you always flush an entire stripe you don't need to do a read-before-write to update the checksum data.
  
  What happens if I don't have a RAID-5 configuration? It seems to me that this is a sure way to get the file fragmented beyond believe.
  
  --
  Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
31. Re:I would go further than Linus on this one... by mysidia · 2009-03-25 11:31 · Score: 1
  
  This is an unacceptable practice for large files. For example, a 600MB file on a 2GB volume, and maybe there is at most 100MBs of spare space available.
  You accidentally copied the wrong ISO to the folder, so you truncate the bad one to zero bytes, and begin overwriting with new data. The copy finishes (i.e. as far as you can tell, the data is copied).
  But suppose your filesystem updates metadata last... *bam* your system crashes after the data got written, but 10 seconds [or so] haven't passed yet, so your metadata isn't written.
  Another process was also writing a lot of data, and after you truncated your file to 0 bytes, some of the ones that used to belong to your file got allocated for their writes.
  The system boots up with metadata in the state it was sometime in the middle of your copy operation.
  The other file being written now has some metadata that is totally meaningless
  Both files have some unrecoverable corruption.
32. Re:I would go further than Linus on this one... by marcosdumay · 2009-03-26 02:29 · Score: 1
  
  I don't code that way. Well, no programming language that I'm aware of will let me code that way unless I start making system calls by hand, even then, I would need to outsmart the OS :)
  But if makes no difference assuming it is bug-free.
  
  --
  Rethinking email
33. Re:I would go further than Linus on this one... by ConceptJunkie · 2009-03-26 15:53 · Score: 1
  
  It doesn't take a filesystem expert or kernel hacker to realize that will never work without the crucial "???" step leading to "Profit!".
  
  --
  You are in a maze of twisty little passages, all alike.
Safest mkfs/mount options? by Per+Wigren · 2009-03-25 00:58 · Score: 3, Interesting

If I were to setup a new home spare-part-server using software RAID-5 and LVM today, using kernel 2.6.28 or 2.6.29 and I really care about not losing important data in case of a power outage or system crash but still want reasonable performance (not run with -o sync), what would be my best choice of filesystem (EXT4 or XFS), mkfs and mount options?

--
My other account has a 3-digit UID.
1. Re:Safest mkfs/mount options? by mysidia · 2009-03-25 01:15 · Score: 1, Offtopic
  
  ISO9660 for most filesystems. (i.e. read-only)
  EXT3.
  EXT4 is bleeding edge.
  I wouldn't recommend XFS unless you have NVRAM-backed storage.
2. Re:Safest mkfs/mount options? by AvitarX · 2009-03-25 01:29 · Score: 2, Interesting
  
  Ext3 with an ordered (default) style journal.
  I believe XFS has a similar option, and Ext4 will with the next kernel, but for a home type system Ext3 should meet all of your needs, and Linux utilities still know it best.
  Of course you should probably use RAID-10 too, with data disk space so cheap it is well worth it. Using the "far" disk layout, you get very fast reads, and though it penalizes writes (vs RAID 0) in theory, the benchmarks I have seen show that penalty to be smaller than the theory.
  as for mkfs, large inodes probably, and when mounting use noatime.
  for some anti-raid 5 propaganda:
  http://www.baarf.com/
  
  --
  Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
3. Re:Safest mkfs/mount options? by remmelt · 2009-03-25 01:36 · Score: 4, Informative
  
  You could also look into Sun's RAID-z:
  http://en.wikipedia.org/wiki/Non-standard_RAID_levels#RAID-Z
4. Re:Safest mkfs/mount options? by Blackknight · 2009-03-25 01:37 · Score: 3, Insightful
  
  Solaris 10 with ZFS, if you actually care about your data.
5. Re:Safest mkfs/mount options? by larry+bagina · 2009-03-25 01:41 · Score: 3, Informative
  
  with lvm, you can easily try out the various file systems (don't forget jfs!). Personally, I've found linux XFS to corrupt itself beyond repair, so I use ext3.
  
  --
  Do you even lift?
  These aren't the 'roids you're looking for.
6. Re:Safest mkfs/mount options? by Anonymous Coward · 2009-03-25 01:44 · Score: 1, Insightful
  
  JFS
7. Re:Safest mkfs/mount options? by JayAEU · 2009-03-25 01:50 · Score: 1
  
  If I recall correctly, BtrFS also does checksumming of individual files and has become available in the latest kernel as well, so it's easier to use with Linux.
  I wouldn't use it on a server just yet, since there might still be some changes to the ondisk format.
8. Re:Safest mkfs/mount options? by mmontour · 2009-03-25 01:53 · Score: 4, Informative
  
  My advice:
  - Make regular backups; you'll need them eventually. Keep some off-site.
  - ext3 filesystem, default "data=ordered" journal
  - Disable the on-drive write-cache with 'hdparm'
  - "dirsync" mount option
  - Consider a "relatime" or "noatime" mount option to increase performance (depending on whether or not you use applications that care about atime)
  - If you don't want the performance hit from disabling the on-drive write-cache, add a UPS and set up software to shut down your system cleanly when the power fails. You are still vulnerable to power-supply failures etc. even if you have a UPS.
  - Schedule regular "smartctl" scans to detect low-level drive failures
  - Schedule regular RAID parity checks (triggered through a "/sys/.../sync_action" node) to look for inconsistencies. I have a software-RAID1 mirror and I've found problems here a few times (one of which was that 'grub' had written to only one of the disks of the md device for my /boot partition).
  - Periodically compare the current filesystem contents against one of your old backups. Make sure that the only files that are different are ones that you expected to be different.
  If you decide to use ext4 or XFS most of the above points will still apply. I don't have any experience with ext4 yet so I can't say how well it compares to ext3 in terms of data-preservation.
9. Re:Safest mkfs/mount options? by mikeee · 2009-03-25 02:51 · Score: 1
  
  Why not try "-o sync"?
  Honestly, if it's a spare-part-server running on a typical home LAN, and is read-mostly, odds are reasonable you won't notice the difference.
  If it is too slow, then you can always go back and screw around with this other nonsense.
10. Re:Safest mkfs/mount options? by swilver · 2009-03-25 03:31 · Score: 1
  
  Ext3 in (the default) ordered mode will do fine. You will only lose data that was written at the moment of the crash, but you will never run into odd inconsistencies where meta-data says one thing, and the actual data says something else.
  You may also want to turn of write caching on all hard disks.
11. Re:Safest mkfs/mount options? by Skuto · 2009-03-25 04:17 · Score: 1
  
  - Disable the on-drive write-cache with 'hdparm'
  Even better: use barriers (not enabled by default on ext3, not sure about ext4).
12. Re:Safest mkfs/mount options? by mmontour · 2009-03-25 05:03 · Score: 1
  
  Even better: use barriers (not enabled by default on ext3, not sure about ext4).
  Won't work through LVM or the software-RAID layer, unless this has changed recently.
13. Re:Safest mkfs/mount options? by Per+Wigren · 2009-03-25 05:47 · Score: 1
  
  Nope, it hasn't changed yet, but they are working on it.
  
  --
  My other account has a 3-digit UID.
14. Re:Safest mkfs/mount options? by stevied · 2009-03-25 05:52 · Score: 1
  
  Honestly, a cheap UPS might not be a bad idea. Then all this goes away (hopefully.)
  For really, really critical data, -o sync, data=journal or an actual grown up database is probably the best option. Can you segregate different types of data across different partitions? Important but relatively small files on one, large audio and video on another (where you win by using extents and delayed allocation) ..
15. Re:Safest mkfs/mount options? by drinkypoo · 2009-03-25 05:54 · Score: 1
  
  Personally, ext3 is one of the few filesystems with which I've lost data (never trusted Reiser due to all the other people losing data) so I use XFS for my long-term data storage. Zero problems so far, and I've used it for root filesystems (you need a /boot with some other fs for speed, but that's it - you can do root on XFS but grub will take FOREVER to boot) and for high throughputs. Your mileage may vary! Make backups on a different filesystem, just in case.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
16. Re:Safest mkfs/mount options? by Blackknight · 2009-03-25 08:33 · Score: 1
  
  I have 2 GB in my server and ZFS works fine. Besides, RAM is cheap now, I just put 24 GB in a server yesterday.
17. Re:Safest mkfs/mount options? by lewiscr · 2009-03-25 09:15 · Score: 1
  
  Veritas Storage Foundation Basic
  Free (beer) closed source volume manager and Filesystem. High performance and High reliability. The Enterprise version saved my ass more times than I can count, and I can count some extremely unlikely scenarios.
  The only drawback (the reason I don't run it at home) is the low limits:
  
  This free version is limited to 4 user-data volumes, and/or 4 user-data file systems, and/or 2 processor sockets in a single physical system.
18. Re:Safest mkfs/mount options? by Blackknight · 2009-03-25 12:53 · Score: 1
  
  If you're trying to say Linux never kernel panics you're dead wrong, we reboot servers all day long that have stopped responding or panicked for some reason. At least when Solaris panics it generates a crash dump so you can see WHY it crashed, on Linux servers I'm lucky if the console gives a clue.
19. Re:Safest mkfs/mount options? by evilviper · 2009-03-25 17:10 · Score: 1
  
  Solaris 10 with ZFS, if you actually care about your data.
  ...and don't care that you will still NEVER be able to avoid ZFS PANICS, even with endless tuning and unlimited RAM.
  ZFS is a great idea... It's a shame it isn't stable (on OpenSolaris or FreeBSD).
  
  --
  Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
20. Re:Safest mkfs/mount options? by Blackknight · 2009-03-26 00:11 · Score: 1
  
  I have yet to see a Solaris box crash, and it's not like Linux never crashes either. Just ask our monitoring department about that...
21. Re:Safest mkfs/mount options? by greg1104 · 2009-03-26 19:27 · Score: 1
  
  You are still vulnerable to power-supply failures etc. even if you have a UPS.
  And to UPS failures. The batteries in those typically die after some number of years. I guess you can then add "monitor the UPS battery quality" to the list of stuff to do, that might get you another fraction of a percent improvement in reliability here.
  I just buy a good disk controller with a battery-backed write cache when I care this much about getting reliable write caching.
Geez... by hesaigo999ca · 2009-03-25 01:06 · Score: 2, Funny

Tell us what you really think there Linus.
~I went home today knowing I made someone cry!~
mispelling by destiney · 2009-03-25 01:18 · Score: 1

Andi Kleen, the l is missing.
Re:Linus by Anonymous Coward · 2009-03-25 01:28 · Score: 2, Funny

I think he's sad because he never got that job at Microsoft he always wanted.
Maybe only a hug from Bill Gates would solve his problem.
Um. This doesn't make sense. by Colin+Smith · 2009-03-25 01:41 · Score: 4, Insightful

Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.
from the FAQ: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html
"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.
"mount -o data=writeback"
Only journals metadata changes, and data updates are entirely
left to the normal "sync" process. After a crash, files will
may contain stale data blocks from old files: this mode is
exactly equivalent to running ext2 with a very fast fsck on reboot.
So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...

--
Deleted
1. Re:Um. This doesn't make sense. by Skuto · 2009-03-25 02:33 · Score: 2, Insightful
  
  So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...
  The thread starts with someone having serious performance problems exactly because ext3 ordered mode is so slow in some circumstances...
  Like when you fsync().
2. Re:Um. This doesn't make sense. by MikeBabcock · 2009-03-25 07:05 · Score: 1
  
  Maybe because the hard drive can't write 12GB/s like RAM can?
  There's a reason we buffer writes to disk and commit them in chunks when necessary -- disks are SLOW. Linux does a lot of caching, and it creates a huge performance benefit, but it requires not having power faults or kernel panics causing a reboot before that data is flushed.
  Flush your buffers if you care about them.
  
  --
  - Michael T. Babcock (Yes, I blog)
3. Re:Um. This doesn't make sense. by trentfoley · 2009-03-25 07:59 · Score: 1
  
  After a crash, files will may contain stale data blocks from old files
  Apparently, the author had a crash while composing that sentence
Re:Linus by Andr+T. · 2009-03-25 01:49 · Score: 1

Sometimes I get the impression that Linus says things the way he says because the other 'powerful' guys who are really important and active in the Linux community don't say nothing or even agree with him when he talks like that. I remember a similar episode some time ago when a guy wanted to port GIT to C++ or something like that. I think he cried.
I can't imagine a reason to be this rude.

--
Any life is made up of a single moment, the moment in which a man finds out, once and for all, who he is.
Except ordered data mode is the (slower) default by Colin+Smith · 2009-03-25 01:57 · Score: 1

Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.
You specifically have to choose writeback mode in the full knowledge that the datablocks will almost certainly be written after the metadata journal.
I think Ted Tso etc are probably perfectly aware of how it works.
Frankly I think Linus is trolling.

--
Deleted
ZFS by chudnall · 2009-03-25 01:57 · Score: 4, Informative

Linux seriously needs to find a workaround to its licensing squabbles and find a way to get a rock-solid ZFS in the kernel. Right now, ZFS on OpenSolaris is simply wonderful, and this is what I am deploying for file service at all my customer sites now. The scary thing about file system corruption is that it is often silent, and can go on for a long time, until your system crashes, and you find that all of your backups are also crap. I've replaced a couple of linux servers (and more than a couple of Windows servers) after filesystem and disk corruption compounded by naive RAID implementations (RAID[1-5] without end-to-end checksumming can make your data *less* safe), and my customers couldn't be happier. Having hourly snapshots and a fast in-kernel CIFS server fully integrated with ZFS ACLS (and with support for NTFS-style mixed case naming) is jut icing on the cake. Now if only I could have an Opensolaris desktop with all the nice linux userland apps available. Oh wait, I can!

--
Disclaimer: Evolution comes with NO WARRANTY, except for the IMPLIED WARRANTY of FITNESS FOR A PARTICULAR PURPOSE.
1. Re:ZFS by Anonymous Coward · 2009-03-25 02:48 · Score: 2, Funny
  
  Linux seriously needs to find a workaround to its licensing squabbles and find a way to get a rock-solid ZFS in the kernel.
  
  You must have missed Linus's memo:
  To: Samuel J. Palmisano, IBM Business Guy
  From: Linus Torvalds, Super Genius
  Dear Sam:
  As you know, I've been trying to get a decent file system into Linux for a while. Let's face it, none of these johnny-come-lately open-source arseholes can write a file system to save their life; the last one to have a chance was Reiser, and I really don't want him hanging around here even if we can spring him; he creeps me out. And your guys are no better. JFS? It is to laugh. Sun has one called ZFS, but they are being utter dicks about licensing it. Since licensing seems to be more in your purview than mine, I thought you might be able to help me out. I can't help but notice that in this recession, IBM is doing relatively good and Sun's stock is in the crapper. Perhaps the easiest thing to do would be to just pick up the whole company. Just a thought
  -- Linus
2. Re:ZFS by mikeee · 2009-03-25 02:58 · Score: 3, Interesting
  
  It's similar (at least, a lot more similar than any other Linux filesystem), but less mature.
  In defense of the LK team on the whole ZFS issue, I understand that part of the reason they didn't pursue some ZFS-like features years ago was because of patents. Now that SUN has open-sourced (though not in a GPL-compatible way) ZFS and is defending that against Network Appliance in a lawsuit, the way looks a lot clearer for Btrfs and company to proceed.
  Actually, on that thought, the IBM acquisition of SUN should get NetApp to drop that lawsuit. Going up against SUN in a MAD patent dispute is a bit risky, but (as SCO discovered) aggressive IP lawsuits against IBM come in right behind "land war in Asia".
3. Re:ZFS by Ash-Fox · 2009-03-25 03:03 · Score: 1
  
  Linux seriously needs to find a workaround to its licensing squabbles and find a way to get a rock-solid ZFS in the kernel.
  I imagine you find FUSE is too slow then? Or is it the fact that Linux can't boot from ZFS?
  
  Right now, ZFS on OpenSolaris is simply wonderful, and this is what I am deploying for file service at all my customer sites now.
  I have also seen people deploying windows vista workstations as web servers for online stores, deployment at client sites doesn't mean much honestly.
  
  The scary thing about file system corruption is that it is often silent, and can go on for a long time, until your system crashes, and you find that all of your backups are also crap.
  I like LVM with snapshots for this reason.
  
  I've replaced a couple of linux servers (and more than a couple of Windows servers) after filesystem and disk corruption compounded by naive RAID implementations (RAID[1-5] without end-to-end checksumming can make your data *less* safe), and my customers couldn't be happier.
  Customers just want it to work, if it breaks or is very slow... Eventually something will always break, they will become unhappy when that happens.
  
  Having hourly snapshots and a fast in-kernel CIFS server fully integrated with ZFS ACLS (and with support for NTFS-style mixed case naming) is jut icing on the cake.
  
  Yes, I can do that too. Just permit all access for all users to filesystem in Samba's configuration (won't override filesystem permissions obviously), then use Linux ACLs on the filesystem to permit/deny people at will. It's not rocket science.
  
  Now if only I could have an Opensolaris desktop with all the nice linux userland apps available. Oh wait, I can!
  I liked playing Tetris in the installer of the early versions of Nexenta.
  
  --
  Change is certain; progress is not obligatory.
4. Re:ZFS by Ash-Fox · 2009-03-25 03:14 · Score: 1
  
  Does BrtFS not seem like an adequate ZFS replacement for Linux?
  It's still under heavy development and the developers do not recommend it for production usage yet.
  
  --
  Change is certain; progress is not obligatory.
5. Re:ZFS by spitzak · 2009-03-25 05:40 · Score: 1
  
  The "patch" method to evade the GPL restrictions on redistribution seems possible, but nobody really seems to be trying it for anything. I can think of two problems:
  1. The source code is still readable, so if the reason is to keep your implementation secret, the patch is useless for that. This probably eliminates the majority of people who want to redistribute something without obeying the exceptions to copyright the GPL allows.
  2. The "-" lines in the patch (or whatever the equivalent are in any scheme you come up with) can be considered derived works of the GPL code.
6. Re:ZFS by jabuzz · 2009-03-25 07:44 · Score: 1
  
  Just explain how you remove a disk from a ZFS filesystem?
  Or perhaps how you set a quota?
  Or perhaps how you do HSM?
  Or perhaps how you run it in a cluster environment?
  You see the thing is it does none of the above. ZFS seems to have a lot of fan boys who just don't do the real world high availability file serving if you ask me. ZFS currently lacks a lot or required real world features that I professionally use on a near daily basis.
7. Re:ZFS by Froggie · 2009-03-25 08:40 · Score: 1
  
  The reference for that patent issue: http://lkml.indiana.edu/hypermail/linux/kernel/0010.0/0343.html - the filesystem was Daniel Phillips' Tux2, and it was making some progress before development was halted due to the risk of treading on Network Appliance's toes.
8. Re:ZFS by Bill,+Shooter+of+Bul · 2009-03-25 08:45 · Score: 1
  
  Set a quota
  
  but the others seem to be planned for the future, not available right now.
  
  --
  Well.. maybe. Or Maybe not. But Definitely not sort of.
9. Re:ZFS by Froggie · 2009-03-25 08:48 · Score: 1
  
  Actually, reading up on this, I wonder about my conclusion: looks more like development dried up without an initial release, since mailing list posts go on for another year after that patent post...
10. Re:ZFS by Mr.Ned · 2009-03-25 08:55 · Score: 3, Insightful
  
  FreeBSD has ZFS. My understanding is while ZFS is a good filesystem, it isn't without issues. It doesn't work well on 32-bit architectures because of the memory requirements, isn't reliable enough to host a swap partition, and can't be used as a boot partition when part of a pool. Here's FreeBSD's rundown of known problems: http://wiki.freebsd.org/ZFSKnownProblems.
  On the other hand, the new filesystems in the Linux kernel - ext4 and btrfs - are taking the lessons learned from ZFS. I'm excited about next-generation filesystems, and I don't think ZFS is the only way to go.
11. Re:ZFS by rackserverdeals · 2009-03-27 07:25 · Score: 1
  
  Or perhaps how you set a quota?
  You can set a quota on a per filesystem basis. If you mean how to set a per user quota, you can't really do that yet but it's coming. There's nothing stopping you from creating a filesystem for each user and then assigning a quota to that filesystem.
  
  Or perhaps how you do HSM?
  How's this on ZFS and HSM?.
  
  Or perhaps how you run it in a cluster environment?
  If you're interested in high availability there are options with Sun Cluster (which is free) and ZFS. If you need a cluster file system that's a whole different beast. Might want to read this ZFS for Lustre information.
  It looks like you're in the UK. Did they start censoring websites such as Google so you couldn't answer your own questions?
  
  --
  Dual Opteron < $600
12. Re:ZFS by bill_mcgonigle · 2009-04-06 13:19 · Score: 1
  
  Now that SUN has open-sourced (though not in a GPL-compatible way) ZFS and is defending that against Network Appliance in a lawsuit, the way looks a lot clearer for Btrfs and company to proceed.
  
  \
  Sun's patent grants are only conferred if you're deriving your code from their code. So, a re-implementation doesn't get the patent grants. And you can't incorporate the CDDL code into GPL'ed code. Bummer.
  
  --
  My God, it's Full of Source!
  OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Data - metadata ordering: softupdates by ivoras · 2009-03-25 02:14 · Score: 5, Informative

Somebody's going to mention it so here it is: there was a BSD unix research project that ended as the soft-updates implementation (currently present in all modern free BSDs). It deals precisely with the ordering of metadata and data writes. The paper is here: http://www.ece.cmu.edu/~ganger/papers/softupdates.pdf. Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata. It has proven to be very resilient (up to hardware problems).
Here's an excerpt:
We refer to this requirement as an update dependency, because safely writing the direc- tory entry depends on first writing the inode. The ordering constraints map onto three simple rules: (1) Never point to a structure before it has been initialized (e.g., an inode must be initialized before a directory entry references it). (2) Never reuse a resource before nullifying all previous pointers to it (e.g., an inode's pointer to a data block must be nullified before that disk block may be reallocated for a new inode). (3) Never reset the last pointer to a live resource before a new pointer has been set (e.g., when renaming a file, do not remove the old name for an inode until after the new name has been written). The metadata update problem can be addressed with several mecha- nisms. The remainder of this section discusses previous approaches and the characteristics of an ideal solution.
There's some quote about this... something about those who don't know unix and about reinventing stuff, right :P ?

--
-- Sig down
1. Re:Data - metadata ordering: softupdates by Anonymous Coward · 2009-03-25 02:44 · Score: 1, Interesting
  
  Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata.
  Maybe I misinterpret something here but doesn't that sound like the exact opposite of what you claim:
  
  Block Allocation. When a new block or fragment is allocated for a file,
  the new block pointer (whether in the inode or an indirect block) should not
  be written to stable storage until after the block has been initialized.
  So first initialize the data, then update the pointer in the metadata. If I am not totally mistaken that is exactly what Linus argues for.
2. Re:Data - metadata ordering: softupdates by LizardKing · 2009-03-25 03:14 · Score: 4, Informative
  
  It has proven to be very resilient (up to hardware problems).
  No it hasn't, which is why it has been removed from NetBSD and replaced by a journaled filesystem. I've also heard grumblings from OpenBSD people about corrupted filesystems with softdep enabled.
3. Re:Data - metadata ordering: softupdates by SpinyNorman · 2009-03-25 04:05 · Score: 2, Insightful
  
  (1) Never point to a structure before it has been initialized
  Which surely includes writing data before meta-data (and write the data someplace other than where the old meta-data is pointing), which is what Linus was saying.
4. Re:Data - metadata ordering: softupdates by ifrag · 2009-03-25 04:18 · Score: 1
  
  I've also heard grumblings from OpenBSD people about corrupted filesystems with softdep enabled.
  Interesting... I've never experienced corruption on any OBSD box I've setup, and that includes a fair bit of random power cycling. I figured it was stable enough that if I didn't feel like ssh'ing in for power operations I'd actually just use the switch. I suppose it's possible I just got lucky and that might have been a bad idea. I never really researched the whole softdep thing since everything seemed robust enough in my use.
  
  --
  Fear is the mind killer.
5. Re:Data - metadata ordering: softupdates by LizardKing · 2009-03-25 04:25 · Score: 1
  
  Your experience mirrors mine - I enabled softdep on most of my machines and never had an issue with it. However, there are plenty of postings on the NetBSD mailing lists discussing actual breakages tracked down to softdeps, the known technical shortcomings and how the complexity of the code made them hard to fix with any confidence that something else wouldn't break in equally subtle ways.
6. Re:Data - metadata ordering: softupdates by Anonymous Coward · 2009-03-25 04:55 · Score: 1, Informative
  
  It is quite stable in FreeBSD; might have been an error in the port to NetBSD and OpenBSD?
  I know Kirk (McKusick) had to work really hard to get it properly stable on FreeBSD.
7. Re:Data - metadata ordering: softupdates by Anonymous Coward · 2009-03-25 06:42 · Score: 1, Informative
  
  It's still present in 4.0.1 which is the latest release and, as usual, I have not heard *any* OS related grumblings from OpnenBSD people, ever.
8. Re:Data - metadata ordering: softupdates by GreyFish · 2009-03-25 09:13 · Score: 1
  
  Same here, i've been using softdep on netbsd for as long as it's existed and i've had one filesystem corruption problem (which i'm not sure is softdep related).
  There are people on the NetBSD lists who are keen to get rid of it tho, and WAPBL (the new logging stuff) is pretty good.
9. Re:Data - metadata ordering: softupdates by cstdenis · 2009-03-25 10:02 · Score: 1
  
  FreeBSD has also added a journaling filesystem layer in 7.x - gjournal tho It's not used by default.
  The biggest problem I've had with softupdates is on a multi-hundred gig drive it still takes hours to fsck. Sure it's done in the background, but hours of 100% disk io usage doesn't lead to a very usable server.
  
  --
  1984 was not supposed to be an instruction manual.
10. Re:Data - metadata ordering: softupdates by uid8472 · 2009-03-25 10:37 · Score: 1
  
  It'll be deprecated in NetBSD 5.0 (which will be out Real Soon Now; it's well into the release candidate stage), and removed in 6.0; it's already been removed from -current (the CVS HEAD). 5.0 is also the first release to have WAPBL, the new journaling scheme.
Re:A UPS by ledow · 2009-03-25 02:15 · Score: 2, Insightful

Yeah, I have to second this... all the journalling filesystems in the world can't compete with a bog-standard, home-based UPS. You just need to make ABSOLUTELY sure that the system shuts down when the battery STARTS going (don't try and be fancy about getting it to run until the battery lifetime) and that the system WILL shut down, no questions asked.
A UPS costs, what, £50 for a cheap, home-based one? Batteries might cost you £20 a year or so on average (and probably a lot less if you just need "shutdown safely" rather than "carry on running"). You don't need it to give a lot of power (run ONLY the base unit off it... anything else and you could hit overloads, etc... you *won't* be operating the PC when it's on battery, you just want it to shut down and, optionally, give you a beep or two when it has shut down successfully), or for very long at all. You just need a fail-safe way of detecting when the power is out so that you can safely shutdown. You also want to check that your cabling is good (nothing more embarassing than having a UPS and then pulling the wrong cable out).
Above and beyond that, filesystem and/or data corruption is one of those things that are almost guaranteed to happen unless you put a lot of effort into it (battery-backed RAID controllers, filesystems with slow-but-sure settings, integrity checking etc.). Make it easy on yourself - use a UPS to stop the problem happening ever, rather than try to have something *might* clean up nicely if it does happen. Even Google don't bother with journalling - if a PC loses power, it's rebuilt from an image. It's not worth faffing about to see if/when/how a filesystem can be repaired, just ensure you have adequate backups and try to stop it happening in the first place.
Saving grace by coryking · 2009-03-25 02:38 · Score: 4, Funny

Not saying the name is Torvalds attempt at saving grace
Is the person responsible going to pull a classic political step-down where they resign "in order to spend more time with their family"?
Maybe it was Hans Reiser? Sure the guy is locked up in San Quentin, but nobody knows how to hack a filesystem to bits better than Reiser. Bada ba ching! Thank you, thank you... I'll be here all night.
1. Re:Saving grace by dumb_jedi · 2009-03-25 04:39 · Score: 1
  
  Maybe it was Hans Reiser? Sure the guy is locked up in San Quentin, but nobody knows how to hack a filesystem to bits better than Reiser. Bada ba ching! Thank you, thank you... I'll be here all night.
  Thinking about it, the FOSS community could make a petition so Hans Reiser could continue collaborating with reiserfs4. It's not like he doesn't have the time to do it.
2. Re:Saving grace by MikeBabcock · 2009-03-25 07:01 · Score: 1
  
  If you go back and read the LKML, you'll see that Reiser made a lot of enemies by wanting to make drastic changes for the sake of FS consistency. He wanted a way to atomically create data on the disk and to log data and metadata in the proper order in a guaranteed way.
  He got in trouble as I recall from the other devs because these things "shouldn't be in the filesystem" but in a higher layer and for being a prick in general (common issue among programmers it seems).
  At any rate, take a look at the goals of Reiser4 and tell me they're not what you want.
  
  --
  - Michael T. Babcock (Yes, I blog)
3. Re:Saving grace by Hal_Porter · 2009-03-25 18:30 · Score: 1
  
  Maybe Hans Reiser only ended up in San Quentin because he argued with Linus.
  
  --
  echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Re:Except ordered data mode is the (slower) defaul by Skuto · 2009-03-25 02:41 · Score: 2, Funny

You specifically have to choose writeback mode in the full knowledge that the datablocks will almost certainly be written after the metadata journal.
I think Ted Tso etc are probably perfectly aware of how it works.
Except that ext4 loses data in ordered mode for exactly the same reason, and we had a big fuss about that the last few weeks, because *someone* (cough) said that it's the application developers fault for not fsync()-ing.
Re:Linus by Anonymous Coward · 2009-03-25 02:48 · Score: 1, Insightful

Oh come off it. You must be an American, because in America excessive gentleness and tenderness in dealing with even the most outrageous and inexcusable problems seems to the present cultural norm.
Linus, perhaps, is a taskmaster and perfectionist. The Linux OS is his baby and any major difficulties will ultimately be a bad reflection on him alone.
It is not inappropriate to sometimes rudely castigate one's associates. It is a kind of shaming game that is intended to inspire better performance. I recall that during the Intel ethernet fiasco involving the e1000e driver, Torvalds was equally brusque toward the Intel developers for their "stupid" oversights.
What we need is more, and not less, of such an aggressive attitude. A real man can take it. Indeed, real men will welcome it, because the end result, in spite of any hurt feelings, is an overall higher quality of craftsmanship.
Re:the new linux destroys your data by marcosdumay · 2009-03-25 02:58 · Score: 1

Of course, the truth mus be somewhere on the middle :)

--
Rethinking email
Re:Except ordered data mode is the (slower) defaul by AigariusDebian · 2009-03-25 03:03 · Score: 1

ext4 by default had the equivalent of ext3 writeback mode on.
Use fadvise by Chemisor · 2009-03-25 03:14 · Score: 2, Informative

> We need a gradual level of tiers ranging from a database that does its own journaling
> and needs to know that data is fully written to disk to an application swapfile that if
> it never hits the disk isn't a big deal (granted, such an app should just use kernel swap,
> but that is another issue).
Actually there already is a syscall for telling the kernel how the file will be used.
posix_fadvise (int fd, off_t offset, off_t len, int advice)
POSIX_FADV_DONTNEED sounds like what you would use for your swapfile case.
I don't know if the kernel actually does anything with this information, but it looks like
this would be a good place to implement any new interfaces for what you are suggesting.
1. Re:Use fadvise by greg1104 · 2009-03-26 19:13 · Score: 1
  
  A look at the man page suggests that POSIX_FADVISE_DONTNEED mainly impacts the caching behavior associated with the file only after it's been written. The suggestion in order to get the behavior it helps with as early as possible? Use fsync...
Ah, I see... So what you're saying is by Colin+Smith · 2009-03-25 03:27 · Score: 1

That writing to a hard disk is slower than writing to RAM?

--
Deleted
1. Re:Ah, I see... So what you're saying is by Skuto · 2009-03-25 05:04 · Score: 1
  
  I'm saying that when you fsync() a 1k file on ext3 ordered, you can have 10G of data written out before the call returns.
Re:A UPS by swilver · 2009-03-25 03:33 · Score: 2, Informative

UPS are nice, and I use one too. It won't protect you from kernel crashes or direct hardware failures. It would still result in corrupted discs if some filesystem decided it did not yet have to write that 2 GB of cached data. Ext3 in ordered mode is still much preferred.
What about... by tchuladdiass · 2009-03-25 03:47 · Score: 1

Instead of giving apps the ability to tag "critical" data, give them the ability to inspect the write status of data. This can be done by adding adding another fd_set to select() (which currently has readfds, writefds, and exceptfds). Add one called "flushedfds" that will return when all data for that file descriptor has been flushed to disk. The kernel can prioritize flushes for all files that have an active select(...flushedfds...) call pending, but otherwise it can still do writes in the optimal order. And the app can have its guarantee that critical data has been written.
1. Re:What about... by ThePhilips · 2009-03-25 09:50 · Score: 1
  
  Do not litter into deprecated interfaces like select().
  There is whole Async I/O API especially for that purpose.... apropos for aio.
  
  --
  All hope abandon ye who enter here.
Comment removed by account_deleted · 2009-03-25 03:48 · Score: 2, Insightful

Comment removed based on user account deletion
Linux is GPL, period ... by Pinky's+Brain · 2009-03-25 04:01 · Score: 1

As long as ZFS licensing is incompatible with the GPL it's never going in. The person from that blog you linked understood something you clearly did not.
"The only way I'm seeing ZFS on the Linux kernel is to convince Sun to dual-license ZFS under the GPL and the CDDL."
You might not like the GPL but suggesting Linux developers should ignore it is not informative, it's completely retarded.
Re:Linus by clarkn0va · 2009-03-25 04:14 · Score: 2, Insightful

What we need is more, and not less, of such an aggressive attitude. A real man can take it.
That depends if you're trying to construct a team of "real men" or a team of skilled developers.
People sometimes confuse the idea or the act with the person that is associated with. If I propose a stupid idea or commit a stupid act, then by all means call me out and tell me that it's stupid and why. But save the ad hominem attacks. Calling somebody a moron accomplishes no good thing, and doing it in public is an extremely quick and effective way of destroying team morale.

--
I am literally 3000 tokens away from the chaotic crossbow --Stephen
Skip Linux, use [Open]Solaris and ZFS. by toby · 2009-03-25 04:16 · Score: 1

...if you want the state of the art in data integrity. (Checksumming, transactional copy on write, self healing, simple pool management, snapshots, filesystems, etc.) Read more: Solaris 10, OpenSolaris.

--
you had me at #!
1. Re:Skip Linux, use [Open]Solaris and ZFS. by jabuzz · 2009-03-25 08:01 · Score: 1
  
  But you can't remove a disk from a file system.
  So please explain to me what you do when you want to migrate that 50TB file system to new disks (because the old ones are out of support say) with no or minimal downtime?
2. Re:Skip Linux, use [Open]Solaris and ZFS. by thanasakis · 2009-03-25 10:05 · Score: 1
  
  I've been waiting this for ages. When OpenSolaris gets it, it will blow away many supposedly "Enterprise" storage systems.
  zpool remove will eventually support the removal of any vdev, not only hotspares like it does now.
  It has been filed as a bug by the OpenSolaris developers.
  Admittedly it looks rather old, but they say work is under way to do it.
3. Re:Skip Linux, use [Open]Solaris and ZFS. by Methlin · 2009-03-25 14:48 · Score: 1
  
  But you can't remove a disk from a file system.
  So please explain to me what you do when you want to migrate that 50TB file system to new disks (because the old ones are out of support say) with no or minimal downtime?
  ??? You do it just like you would any other raid system;
  1. swap one drive (or more depending on your pool layout). Which you can do hot.
  2. let it sync (resilver)
  3. repeat until disks replaced.
  That's assuming you have data redundancy on that 50TB file system.
  
  If what you're after is expanding capacity, you have two options, add another raidz set to the pool, or replace all drives in a raidz set with larger capacity drives. That's assuming you followed best practices and didn't make your pool out of one raidz2 of 52 1TB drives.
+1 by toby · 2009-03-25 04:17 · Score: 1

So far Linux has nothing even close.

--
you had me at #!
1. Re:+1 by rackserverdeals · 2009-03-25 08:21 · Score: 1
  
  I remember when was it, 3 years ago or so, when ZFS was first introduced, people kept saying it's no big deal, and just a matter of time before Linux comes up with something equivalent or better.
  That kind of attitude bugs me. You read about early pc history and while there's some rivalry, there's also quite a bit of collaboration and mutual respect that seems to be lacking today.
  Maybe it was a myth, but it would still be nice to see more of it.
  
  --
  Dual Opteron < $600
Re:A UPS by Wdomburg · 2009-03-25 04:19 · Score: 1

Must be a nice world where the only cause for an unclean shutdown is power interupption. And where the power supply itself never goes tits up.
Re:Linus by moderatorrater · 2009-03-25 04:19 · Score: 2, Insightful

I think it's more a matter of dealing with divas all day. It's pretty clear that the two sides of this issue are the side with technical people convinced that the correctness of the journaling system overcomes any difficulties with integrity, and people who think that integrity should be paramount. For most users, disk integrity IS the number one priority. It seems to me that this is a case of some people not being able to see that they're wrong.

In a corporation, it's as simple as saying, "do it our way or hit the street." With Linux development the leaders don't have that power, so they may replace it with forcefulness. Besides, the honesty is kind of refreshing. Linus lays out a clear argument and only then starts insulting the other person. He's being brutal, but he's giving them more information than a more polite person might.
Maybe when it matures. by toby · 2009-03-25 04:21 · Score: 1

ZFS, on the other hand, is production ready today.

--
you had me at #!
1. Re:Maybe when it matures. by jabuzz · 2009-03-25 07:40 · Score: 2, Insightful
  
  ZFS is production ready my ass. ZFS will be production ready when I can take a disk out the filesystem, when I can set quota's when it supports HSM and when it supports clustering.
  Finally it will be production ready when it has a decade of hardening in the real world.
  In the meantime both JFS and XFS offer better alternatives, and for me only GPFS (which admittedly is closed source but does run under Linux) ticks all the boxes.
  The crazy thing is that ext4 offers nothing that we don't get with XFS or JFS, and if RedHat would stop pussy footing about, and support either one (and I don't care which) the whole ext? could die.
  The ext2/3 line had a place and a time, and that place and time has long gone. It needs to die...
2. Re:Maybe when it matures. by Methlin · 2009-03-25 15:08 · Score: 1
  
  ZFS will be production ready when I can take a disk out the filesystem
  ??? What do you mean, shrinking a filesystem? Yeah, not supported yet.
  
  when I can set quota's
  Ragu (it's in there)
  
  when it supports HSM
  Also Ragu
  
  and when it supports clustering.
  Why does it need to support clustering? That's what the clustering filesystems are for.
  
  In the meantime both JFS and XFS offer better alternatives, and for me only GPFS (which admittedly is closed source but does run under Linux) ticks all the boxes.
  Guess those are all right out as well and not ready for production with your other requirement of...
  
  Finally it will be production ready when it has a decade of hardening in the real world.
  So basically it's not production ready, in your opinion, because you can't easily shrink a storage pool without migrating the data off the vdev(s) first. Gasp! Just like other storage systems. Stop the presses! No storage solutions are production ready because you can't just rip a vdev out without moving the data off first!
Re:Linus by BadLittleGuy · 2009-03-25 04:26 · Score: 1

For most users, disk integrity IS the number one priority
Sorry, but no, it isn't. You will hear them screaming utter murder, when their OS needs half an hour to boot, and a file copy only goes with a few kB/s.
Users want integrity AND speed. Most won't even know there's a difference. So it's always a trade off between safety and speed. At least til we get copy-on-write filesystems and fast, big SSDs on a large scale.
Fix it by Frankie70 · 2009-03-25 04:27 · Score: 4, Funny

Maybe Linus should just fixit instead of whining about it. It's open source, dammit.
Integrity vs. consistency. by WebCowboy · 2009-03-25 05:22 · Score: 4, Informative

Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle
Linus is not clueless in this case. I think it is a case of you misinterpreting the issue he was discussing.
Journaling is, as you say NOT about data integrity/prevention of data loss. That is what RAID and UPSes are for. However, it IS about data CONSISTENCY. Even if a file is overwritten, truncated or otherwise corrupted in a system failure (i.e. loss of data integrity) the journal is supposed to accurately describe things like "file X is Y bytes in length and resides in blocks 1,2,3...." (data/metadata consistency). Why would you update that information before you are sure the data was actually changed? A consistent journal is the WHOLE REASON why you can "alleviate the delay caused by fscking".
Linus rightly pointed out, with a degree of tact that Theo de Raadt would be proud of, that writing meta-data before the actual data is committed to disk is a colossally stupid idea. If the journal doesn't accurately describe the actual data on the drive then what is the point of the journal? In fact, it can be LESS than useless if you implicitly trust the inconsistent journal and have borked data that is never brought to your attention.
1. Re:Integrity vs. consistency. by drinkypoo · 2009-03-25 05:46 · Score: 1
  
  This is one of the great things that using SSDs is going to buy us in the long run when more of the goblins have been worked out. There is no longer any need to rewrite over a given piece of a file to try to avoid fragmentation; in fact, you explicitly don't do that, because you have to erase flash memories in chunks and that goes for rewriting them, too. So it makes perfect sense to actually have a filesystem which writes the data to a new section of flash, and which then updates the journal. Updating the journal and then writing data makes sense if you're on a disk, and you can update the journal again later to confirm success. But again, rewriting the same sectors on a SSD doesn't work well...
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
2. Re:Integrity vs. consistency. by DamnStupidElf · 2009-03-25 06:28 · Score: 1
  
  Linus rightly pointed out, with a degree of tact that Theo de Raadt would be proud of, that writing meta-data before the actual data is committed to disk is a colossally stupid idea. If the journal doesn't accurately describe the actual data on the drive then what is the point of the journal? In fact, it can be LESS than useless if you implicitly trust the inconsistent journal and have borked data that is never brought to your attention.
  A definite advantage of writing metadata to the journal first is that you could theoretically retain metadata for both the old and new versions of a file. If the order is metadata->journal, file data -> disk, journal-> disk, then once the metadata is in the journal all the new writes to the file system will have valid metadata pointers to them. Once the file data is written, the disk is updated with the journal.
  Not that ext3 necessarily does things this way; I think it applies the journal before doing anything else, which would remove the old metadata pointing to the old version of the file. The journal, data, metadata order works best for things like log based file systems or any file system that doesn't overwrite existing data.
How dumb can you get. by thethibs · 2009-03-25 06:19 · Score: 1

I don't know much about linux file systems, but now I know more than I want to. What idiot writes pointers to data that's not there yet?
The last non-trivial file system I worked on was on the Sigma 7, circa 1969, and its update sequence carefully avoided doing that; it's not like this is a new discovery. It's a basic engineering principle: "Make before Break."
And these guys have the effrontery to call themselves "software engineers."
On the other hand, they're working for free, so gift-horse and all that.

--
I'm a Programmer. That's one level above Software Engineer and one level below Engineer.
1. Re:How dumb can you get. by thethibs · 2009-03-25 09:14 · Score: 1
  
  Chuckle
  
  --
  I'm a Programmer. That's one level above Software Engineer and one level below Engineer.
Atomic Posts? by bobbuck · 2009-03-25 07:11 · Score: 1

Does Slashdot have ato
Banging the old drum by AliasMarlowe · 2009-03-25 07:20 · Score: 1

If you don't like the way disks work in a power outage, just switch to drum storage. Its angular momentum means that it would keep turning long enough to dump the entire core (OK, this is a bit ancient) to the drum. Sometimes, the "UPS" was a generator attached to the drum, so it powered the cpu. The drum was spun by separate motors, and had a read/write head on each track: no seek time, read & write in parallel to all tracks, great for virtual memory. They were noisy power-hogs, however.
http://en.wikipedia.org/wiki/Drum_memory

--
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
1. Re:Banging the old drum by dotancohen · 2009-03-26 03:42 · Score: 1
  
  Thanks, I'll try that when the time comes to upgrade the hard drive in my netbook.
  
  --
  It is dangerous to be right when the government is wrong.
*Finally* by warrax_666 · 2009-03-25 07:25 · Score: 1

someone speaks some sense. POSIX simply currently lacks fbarrier(...).

--
HAND.
1. Re:*Finally* by Hal_Porter · 2009-03-25 18:49 · Score: 1
  
  That's not transactions though, in the ACID sense. A transaction is atomic, either it completes or is rolled back. fbarrier() doesn't quite cut it because it some file operations may complete.
  http://en.wikipedia.org/wiki/ACID
  Atomicity - a bunch of file operations ending with an fbarrier is not atomic because some operations will complete before the fbarrier cal.
  Consistency - it is consistent.
  Isolation - it's not isolated either, for the same reason it is not atomic.
  Durability - ok, it is durable
  NTFS has real, ACID compliant, transactions spanning several files.
  http://msdn.microsoft.com/en-us/library/aa365738(VS.85).aspx
  It's not surprising really NTFS is built like a database and has always had per file transactions. VMS was tradionally a transaction processing environment - I think that's where the culture comes from.
  
  --
  echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
2. Re:*Finally* by warrax_666 · 2009-03-25 23:52 · Score: 1
  
  Sure, it's not transactions per se, but fbarrier() + atomic rename() is enough for what almost all programs want to do. Programs that need actual transactions can use fsync() as they always have -- though there may certainly still be a case for providing an app-specified grouping of I/O operations (i.e. lightweight "transactions") to avoid sync'ing unrelated blocks during fsync().
  
  --
  HAND.
Just give them the Standard Slashdot Response by ClosedSource · 2009-03-25 08:43 · Score: 1

If Linus et al don't like the way ext3 works, they shouldn't complain about the developer, they should change it. After all, they have the source code.
Ah, that felt good!
Re:Because it's a terrible idea? by jbolden · 2009-03-25 08:44 · Score: 1

It doesn't go into the kernel. The database sits as part of the app layer. The kernel itself just contains very basic file systems which boot up the database + other stuff.
Re:Linus by ClosedSource · 2009-03-25 08:54 · Score: 1

"Oh come off it. You must be an American, because in America excessive gentleness and tenderness in dealing with even the most outrageous and inexcusable problems seems to the present cultural norm."
Where is this gentle programming territory in the US? Remember, the Daily WTF was started in the USA - not exactly a font of tenderness.
NOT to improve integrity? News to me. by Ungrounded+Lightning · 2009-03-25 09:18 · Score: 1

He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking.
Well I was unaware of it, too.
And when I did a journaling system back in the mid '80s the whole POINT of it was to maintain a consistent ("though not necessarily current") filesystem on the disk at all times. ("Not necessarily current" means transactions that haven't yet hit the disk get lost in a crash. So if you want to build a reliable transaction processor on top of it you have a bit more to do.)
The idea behind it: Servers are intended to run continuously. So the commonest mode of shutdown will be system crash. Thus the server needs to:
1) Always be able to recover from a crash.
2) Do it very quickly.
(Once you have that you don't even need a shutdown mechanism. Just kill it. Kick off the clients first if you're really concerned about not reversing transactions.)
I had THOUGHT that the journaling file systems we've come to know and deploy were also based on this set of ideas. If they AREN'T, it's time to build one that IS.
(And if I'd known earlier that they weren't I might have gone and done it. B-( )

--
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
Re:A UPS by Baki · 2009-03-25 10:02 · Score: 1

I have had 1 power outage in 8 years (in Switzerland) but various kernel oopses over the years. So at least in my case, the chance on a sudden lockup due to a kernel bug is much much higher than a power problem. Therefore a UPS won't really help.
Re:A UPS by ledow · 2009-03-25 11:32 · Score: 1

I have a wonderful data point here - by sheer coincidence, I have a computer that's been running for 8 years, with plenty of power outages and not a single kernel oops ever (it's on its five or sixth kernel upgrade, at least) - in fact, it would worry me if I saw a kernel oops on a machine I was relying on to store my data, as suggested by the OP, and I would probably want to integrity-check the whole damn computer. Similarly for power-supply failures, or anything else. Once you get those sorts of problems, you have bigger problems than "was the fs journalled?". I *have* seen a journalled fs that quite happily passed fsync after a power failure and had lost data - it's much easier than you think, and you can't trust it.
And what makes you think that the journalling in the case of kernel oops would help you escape a corrupt filesystem? Almost by definition, if the kernel oopses, it has messed up and done something it should NEVER have done (like trawled data across memory etc.), and that might well be in the filesytem code. Maybe I could expand my suggestion and say "UPS + a backup", but that much is obvious if you care about your data. And I do use journalling FS, but I don't *rely* on them, precisely because of things like the recent fsync() discussion... even if you THINK it's working, it doesn't mean it is. A UPS is worth MORE than a journalling fs, because it negates the need for one to a certain extent in a much simpler fashion. However, if you care about your data, the only way to be sure is to have UPS + journalling + backups + integrity check.
if I really cared about the data by toby · 2009-04-07 05:52 · Score: 1

I certainly wouldn't use LVM, RAID-5, ext4, XFS, or Linux. I'd use Solaris 10 or OpenSolaris and ZFS.

--
you had me at #!