Kernel Hackers On Ext3/4 After 2.6.29 Release

Slow performance by rootnl · 2009-03-25 00:23 · Score: 4, Funny

The server is taking too long to respond; please wait a minute or 2 and try again.

Mmmh, must be a big problem

--

We are the people our parents warned us about.

Re:Slow performance by morgan_greywolf · 2009-03-25 00:51 · Score: 5, Funny

Well, they had to switch the lkml server to ext3 because posts kept getting killed and cut into pieces with their old filesystem and the admins just kept saying "Well, they must've gone to Russia."

--
My blog
Re:Slow performance by mrsteveman1 · 2009-03-25 06:07 · Score: 4, Funny

You sure that wasn't an ad for Viagra targeted specifically to the over-the-hill nerd community?
What would a nerd need Viagra for ?^)
Longer uptime of course

Idiotic by baadger · 2009-03-25 00:27 · Score: 5, Informative

Mirror for the thread:

http://thread.gmane.org/gmane.linux.kernel/811167/focus=811699

lkml.org server is slashdotted. by javilon · 2009-03-25 00:28 · Score: 4, Funny

this is what I get from http://lkml.org/lkml/2009/3/24/460:

"The server is taking too long to respond; please wait a minute or 2 and try again."

Considering that there is only one comment on this slashdot thread, that means that most people will comment without actually reading TFA.

Like me... :-)

--

When his defense asked, "Which computer has Jon Johansen trespassed upon?" the answer was: "His own."

Re:lkml.org server is slashdotted. by FernandoTorres · 2009-03-25 00:44 · Score: 5, Funny

Well this is just my meta comment. I'll be writing my real comment later...
Re:lkml.org server is slashdotted. by Anonymous Coward · 2009-03-25 01:34 · Score: 5, Insightful

Well this is just my meta comment. I'll be writing my real comment later...
You forgot to include a link to the comment you'll be writing later.
Re:lkml.org server is slashdotted. by linuxrocks123 · 2009-03-25 02:30 · Score: 5, Insightful

Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking. All the filesystem can normally promise in the event of a crash is that the metadata will describe a valid filesystem somewhere between the last returned synchronization call and the state at the event of the crash. If you need more than that -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes and putting a special capacitor in the system so the OS can flush everything to disk before the computer loses power in an outage.

--
vi ~/.emacs # I'm probably going to Hell for this.
Re:lkml.org server is slashdotted. by AigariusDebian · 2009-03-25 02:45 · Score: 5, Informative

On-disk state must always be consistent. That was the point of journalig, so that you do not have to do a fsck to get to a consistent state. You write to a journal, what you are planing to do, then you do it, then you activate it and mark done in the journal. At any point in time, if power is lost, the filesystem is in a consistant state - either the state before the operation or the state after the operation. You might get some half-written blocks, but that is perfectly fine, because they are not referenced in the directory structure until the final activation step is written to disk and those half-written bloxk are still considered empty by the filesystem.
Re:lkml.org server is slashdotted. by thomasdz · 2009-03-25 02:48 · Score: 4, Interesting

You forgot to include a link to the comment you'll be writing later.
Maybe the power failed in the middle of him writing his comment?
Don't worry...it'll appear in some other Slashdot thread until CmdrTaco does a fsck.

--
Karma: Excellent. 15 moderator points expire sometime.
Re:lkml.org server is slashdotted. by Anonymous Coward · 2009-03-25 03:39 · Score: 4, Informative

No, you're the one who's clueless.
The issue (as Linus said) isn't that the journalling is providing data integrity, it's that doing the journalling the wrong way causes *MORE* data loss.
Basically, you're sacrificing data integrity for speed, when you don't need to.
Perhaps you should work on your reading comprehension.
Re:lkml.org server is slashdotted. by mmontour · 2009-03-25 06:56 · Score: 3, Insightful

Some of us have discovered the 'shutdown' command. [...]Anyhow, I suggest you use it occasionally. Then perhaps you can only fsck when something bad has happened.
Don't be too smug - a "shutdown" doesn't always guarantee a clean startup. I remember a bug (hopefully fixed now) where "shutdown" was completing so quickly that it powered off the computer while data was still sitting in the hard drive's volatile write cache. Even though the OS had unmounted the filesystem, the on-disk blocks were still dirty.
p.s. If any OS/kernel developers are listening - how about implementing a standard API through which drive write-caches can be flushed+disabled whenever a system starts a shutdown procedure, gets a signal that the UPS is running on battery power, or otherwise concludes that it is in a state where a temporarily-increased risk of data loss justifies slowing down I/O?

Let me guess... by Puls4r · 2009-03-25 00:43 · Score: 5, Funny

The server is running linux.

Re:Let me guess... by UnRDJ · 2009-03-25 00:46 · Score: 5, Funny

too much karma for your tastes?
Re:Let me guess... by Anonymous Coward · 2009-03-25 00:53 · Score: 4, Informative

According to Netcraft, yes. Ubuntu.

Wait, this is Slashdot... I need a cliche... uh...

Netcraft confirms is, that server is dying?

OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · 2009-03-25 00:47 · Score: 5, Insightful

Quote from Linus:

"...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."

In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.

How about ASKING them rather than calling the Morons?

(note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)

TDz.

Re:OK, then... *WHO* is the official ext3 "moron"? by morgan_greywolf · 2009-03-25 01:03 · Score: 3, Informative

Most likely Ted T'so, based on the git commit logs. I say most likely because someone more familiar with the kernel git repo than myself should probably confirm or deny this statement.

--
My blog
Re:OK, then... *WHO* is the official ext3 "moron"? by Anonymous Coward · 2009-03-25 01:09 · Score: 5, Insightful

Torvalds exactly knows who it is and most people following the discussion will probably know it, too.
Also, there has been a fairly public discussion including a statement by the responsible person in question.
Not saying the name is Torvalds attempt at saving grace. Similar to a parent of two children saying, I don't know who did the mess, but if I come back, it better be cleaned up.
Yes, Mr. Torvalds is fairly outspoken.
Re:OK, then... *WHO* is the official ext3 "moron"? by Skuto · 2009-03-25 01:17 · Score: 4, Interesting

Well, some Linux filesytem developers (and some fanboys) have been chastising other (higher-performance) filesytems for not providing the guarantees that ext3 ordered move provides.
Application developers hence were indirectly educated to not use fsync(), because apparently a filesystem giving anything other than the ext3 ordered mode guarantees is just unreasonable, and ext3 fsync() performance really sucks. (The reason why you don't actually *want* what fsync implies has been explained in the previous ext4 data-loss posts).
Some of those developers are now complaining that their "new" filesystem (designed to do away with the bad performance of the old one) is disliked by users who are losing data due to applications being encouraged to be written in a bad way, and telling the developers that they now should add fsync() anyway (instead of fixing the actual problem with the filesystem).
Moreover, they are complaining that the application developers are "weird" because of expecting to be able to write many files to the filesystem and not having them *needlessly* corrupted. IMAGINE THAT!
As an aside joke, the "next generation" btrfs which was supposed to solve all problems has ordered mode by default, but its an ordered mode that will erase your data in exactly the same way as ext4 does.
Honestly, the state of filesystems in Linux is SO f***d that just blaming whoever added writeback mode is irrelevant.
Re:OK, then... *WHO* is the official ext3 "moron"? by 644bd346996 · 2009-03-25 01:24 · Score: 5, Informative

ext3 was merged to the mainline kernel in 2001. Git was created in 2005. I wouldn't trust any authorship evidence in a git repo for code predating the repo.
The journalling behavior of ext3 was probably decided by Stephen Tweedie
Re:OK, then... *WHO* is the official ext3 "moron"? by red_dragon · 2009-03-25 01:25 · Score: 4, Funny

they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus
He's following Ext3 writeback semantics. You'll have to wait for a patch to fix his behaviour.

--
In Soviet Russia, Jesus asks: "What Would You Do?"
Re:OK, then... *WHO* is the official ext3 "moron"? by houghi · 2009-03-25 01:35 · Score: 5, Insightful

Knowing the humor that Linus has, it could be himself.

--
Don't fight for your country, if your country does not fight for you.
Re:OK, then... *WHO* is the official ext3 "moron"? by Ecuador · 2009-03-25 02:03 · Score: 5, Funny

Yep, we urgently need some kind of killer FS for Linux...
Oh, wait...

--
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
Re:OK, then... *WHO* is the official ext3 "moron"? by SpinyNorman · 2009-03-25 02:19 · Score: 4, Insightful

fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.
I think sometimes programmers do fsync() when they really want fflush() (flush library buffers to driver) which is about program behavior ("I want this data written to disk real-soon-now", not hanging around in the library buffer indefinitely) rather than a data-on-disk guarantee.
IMO telling programmers to flatly avoid fsync is almost as bad as having a borked meta-data/data write order - progammers should be educated about what fsync does and when they really want/need it and when they don't. I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.
Re:OK, then... *WHO* is the official ext3 "moron"? by Rich0 · 2009-03-25 02:49 · Score: 3, Interesting

I agree. What we need is a mechanism for an application to indicate to the OS what kind of data is being written (in terms of criticality/persistance/etc). If it is the gimp swapfile chances are you can optimize differently for performance than if it is a file containing innodb tables.
Right now app developers are having to be concerned with low-level assumptions about how data is being written at the cache level, and that is not appropriate.
I got burned by this when my mythtv backend kept losing chunks of video when the disk was busy. Turns out the app developers had a tiny buffer in ram, which they'd write out to disk, and then do an fsync every few seconds. So, if two videos were being recorded the disk is contantly thrashing between two huge video files while also busy doing whatever else the system is supposed to be doing. When I got rid of the fsyncs and upped the buffer a little all the issues went away. When I record video to disk I don't care if when the system goes down that in addition to losing the next 5 minutes of the show during the reboot I also lose the last 20 seconds as well. This is just bad app design, but it highlights the problems when applications start messing with low-level details like the cache.
Linux filesystems just aren't optimal. I think that everybody is more interested in experimenting with new concepts in file storage, and they're not as interested in just getting files reliably stored to disk. Sure, most of this is volunteer-driven, so I can't exactly put a gun to somebody's head to tell them that no, they need to do the boring work before investing in new ideas. However, it would be nice if things "just worked".
We need a gradual level of tiers ranging from a database that does its own journaling and needs to know that data is fully written to disk to an application swapfile that if it never hits the disk isn't a big deal (granted, such an app should just use kernel swap, but that is another issue). The OS can then decide how to prioritize actual disk IO so that in the event of a crash chances are the highest priority data is saved and nothing is actually corrupted.
And I agree completely regarding transaction support. That would really help.
Re:OK, then... *WHO* is the official ext3 "moron"? by Skuto · 2009-03-25 03:08 · Score: 3, Insightful

fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.
The two issues are very closely related, not "an entirely different issue". What the apps want is not "put this data on the disk, NOW", but "put this data on the disk sometime, but do NOT kill the old data until that is done".
Applications don't want to be sure that the new version is on disk. They want to be sure that SOME version is on disk after a crash. This is exactly what some people can't seem to understand.
fsync() ensures the first at a huge performance cost. rename() + ext3 ordered gives you the latter. The problem is that ext4 breaks this BECAUSE of the journal ordering. The "consistent state" is broken for application data.

I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.
Yes. But they are assuming this exists and the API is called rename() :)
Re:OK, then... *WHO* is the official ext3 "moron"? by TheLink · 2009-03-25 05:56 · Score: 4, Funny

Yeah, the metadata was written first, then only ext3 was actually created.

A filesystem that writes the metadata before the actual data, is a "Duke Nukem Forever" Filesystem.
--
- Too many replies beneath your current threshold

I would go further than Linus on this one... by pla · 2009-03-25 00:47 · Score: 4, Insightful

FTA: "if you write your data _first_, you're never going to see corruption at all"

Agreed, but I think this still misses the point - Computers go down unexpectedly. Period.

Once upon a time, we all seemed to understand that, and considered writeback behavior (when rarely available) always a dangerous option only for use in non-production systems and with a good UPS connected. And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!

Yes, it gives a huge performance boost... But performance without reliability means absolutely nothing. Eventually every computer will go down without enough warning to flush the write buffers.

Re:I would go further than Linus on this one... by Skuto · 2009-03-25 00:59 · Score: 5, Informative

You are confusing writeback caching with ext3/4's writeback option, which is simply something different.
The problem with all the ext3/ext4 discussions has been the ORDER in which things get written, not whether they are cached or not. (Hence the existance of an "ordered" mode)
You want new data written first, and the references to that new data updated later, and most definitely NOT the other way around.
Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.
Re:I would go further than Linus on this one... by AlterRNow · 2009-03-25 01:08 · Score: 3, Interesting

Am I right believing that the new data is written elsewhere and then the metadata is updated in place to point to the new data? I don't know much about filesystems..

--
The disappearing pencil trick. Let me show you it.
Re:I would go further than Linus on this one... by Anonymous Coward · 2009-03-25 01:11 · Score: 4, Insightful

Yes! This is the whole point. I am not a filesystem guy either. I don't even know that much about filesystems. But imagine you write a program with some common data storage. Imagine part of that common data is a pointer to some kind of matrix or whatever. Does anybody think it is a good idea to set that pointer first, and then initialize the data later?
Sure, a realy robust program should be able to somehow recover from corrupt data. But that doesn't mean you can just switch your brain off when writing the data.
Re:I would go further than Linus on this one... by mysidia · 2009-03-25 01:12 · Score: 4, Interesting

This is a potential problem when you are overwriting existing bytes or removing data.
In that case, you've removed or overwritten the data on disk, but now the metadata is invalid.
i.e. You truncated a file to 0 bytes, and wrote the data.
You started re-using those bytes for a new file that another process is creating.
Suddenly you are in a state where your metadata on disk is inconsistent, and you crash before that write completes.
Now you boot back up.. you're ext3, so you only journal metadata, so that's the only thing you can revert, unfortunately, there's really nothing to rollback, since you haven't written any metadata yet.
Instead of having a 0 byte file, you have a file that appears to be the size it was before you truncated it, but the contents are silently corrupt, and contain other-program-B's data
Re:I would go further than Linus on this one... by morgan_greywolf · 2009-03-25 01:29 · Score: 3, Insightful

Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.
It's common sense! Duh. Write data first, pointers to data second. If the system goes down, you're far less likely to lose anything. That's obvious. Those who think this is somehow not obvious don't have the right mentality to be writing kernel code.
I think the problem is Ted T'so has had a slight 'works for me' attitude about it:

All I can tell you is that *I* don't run into them, even when I was
using ext3 and before I got an SSD in my laptop. I don't understand
why; maybe because I don't get really nice toys like systems with
32G's of memory. Or maybe it's because I don't use icecream (whatever
that is).

--
My blog
Re:I would go further than Linus on this one... by Hatta · 2009-03-25 01:55 · Score: 3, Insightful

In that case, you've removed or overwritten the data on disk, but now the metadata is invalid.
i.e. You truncated a file to 0 bytes, and wrote the data.
Why on earth would you do that? Write the new data, update the metadata, THEN remove the old file.

--
Give me Classic Slashdot or give me death!
Re:I would go further than Linus on this one... by Spazmania · 2009-03-25 02:15 · Score: 4, Informative

Here's what Linus had to say, and I think he hit the nail on the head:
The point is, if you write your metadata earlier (say, every 5 sec) and
the real data later (say, every 30 sec), you're actually MORE LIKELY to
see corrupt files than if you try to write them together.
And if you write your data _first_, you're never going to see corruption
at all.
This is why I absolutely _detest_ the idiotic ext3 writeback behavior. It
literally does everything the wrong way around - writing data later than
the metadata that points to it. Whoever came up with that solution was a
moron. No ifs, buts, or maybes about it.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:I would go further than Linus on this one... by Cassini2 · 2009-03-25 04:32 · Score: 4, Insightful

When you have less than 64K of RAM, and a processor that barely has a modern memory management unit, then some of these "extras" like Copy-On-Write appear as advanced features. Additionally, when your computer costs $500,000, you tend not to scrimp on stuff like a UPS.
Economics have changed much since the early days of UNIX. Many of the file system design principles still remain the same. Assumptions need to change with the times. Reasonable historical assumptions were:
- Every UNIX machine has a UPS.
- Production servers run UNIX. What's this Linux you are talking about?
- Disk space is expensive. No one will pay for unused disk space.
- RAM is expensive. As such, it can be quickly flushed to disk.
- No one has enough disk space, RAM, or disk bandwidth to experience a random fault rate of 1 part in 1 quadrillion (1E-15).
Times have changed, Linux is used on heavy servers now. UNIX (with deference to AIX and Solaris) is almost gone from the market place. RAM and disk space are cheap, so cheap that random data errors can big issue. A UPS can cost more than a hard drive, and sometimes more than the computer it is attached to. Disk capacities are huge.
Unfortunately, the file system designers haven't kept pace. The Ext4 bug was detected, reproduced, and ultimately solved for a group desktop Ubuntu users. Linux is used in cheap embedded applications, like home NAS servers. Applications that don't have a UPS. Linux isn't a just server O/S anymore. The way to design and optimize a file system needs to change too.
Additionally, even for servers, the times have changed, and this affects file systems. It used to be that accepting data loss was OK, since you would need to rebuild a server after a failure. Today, the disk arrays are so large, that if you attempted to restore the data from backups, it would take hours (sometimes days.) As such, capabilities like "snapshots" are becoming very important to servers. Server disk storage is increasingly bandwidth limited, and not disk size limited. Today, it is possible to have 1 TB of data on a single disk, while being unable to use that disk space effectively. Under many workloads, the users are capable of changing the data faster than a backup program can copy the data off the disk. In such a case, without a snapshot capability, it is impossible to make a valid backup.

Safest mkfs/mount options? by Per+Wigren · 2009-03-25 00:58 · Score: 3, Interesting

If I were to setup a new home spare-part-server using software RAID-5 and LVM today, using kernel 2.6.28 or 2.6.29 and I really care about not losing important data in case of a power outage or system crash but still want reasonable performance (not run with -o sync), what would be my best choice of filesystem (EXT4 or XFS), mkfs and mount options?

--
My other account has a 3-digit UID.

Re:Safest mkfs/mount options? by remmelt · 2009-03-25 01:36 · Score: 4, Informative

You could also look into Sun's RAID-z:
http://en.wikipedia.org/wiki/Non-standard_RAID_levels#RAID-Z
Re:Safest mkfs/mount options? by Blackknight · 2009-03-25 01:37 · Score: 3, Insightful

Solaris 10 with ZFS, if you actually care about your data.
Re:Safest mkfs/mount options? by larry+bagina · 2009-03-25 01:41 · Score: 3, Informative

with lvm, you can easily try out the various file systems (don't forget jfs!). Personally, I've found linux XFS to corrupt itself beyond repair, so I use ext3.

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:Safest mkfs/mount options? by mmontour · 2009-03-25 01:53 · Score: 4, Informative

My advice:
- Make regular backups; you'll need them eventually. Keep some off-site.
- ext3 filesystem, default "data=ordered" journal
- Disable the on-drive write-cache with 'hdparm'
- "dirsync" mount option
- Consider a "relatime" or "noatime" mount option to increase performance (depending on whether or not you use applications that care about atime)
- If you don't want the performance hit from disabling the on-drive write-cache, add a UPS and set up software to shut down your system cleanly when the power fails. You are still vulnerable to power-supply failures etc. even if you have a UPS.
- Schedule regular "smartctl" scans to detect low-level drive failures
- Schedule regular RAID parity checks (triggered through a "/sys/.../sync_action" node) to look for inconsistencies. I have a software-RAID1 mirror and I've found problems here a few times (one of which was that 'grub' had written to only one of the disks of the md device for my /boot partition).
- Periodically compare the current filesystem contents against one of your old backups. Make sure that the only files that are different are ones that you expected to be different.
If you decide to use ext4 or XFS most of the above points will still apply. I don't have any experience with ext4 yet so I can't say how well it compares to ext3 in terms of data-preservation.

Um. This doesn't make sense. by Colin+Smith · 2009-03-25 01:41 · Score: 4, Insightful

Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.

from the FAQ: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html

"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.

"mount -o data=writeback"
Only journals metadata changes, and data updates are entirely
left to the normal "sync" process. After a crash, files will
may contain stale data blocks from old files: this mode is
exactly equivalent to running ext2 with a very fast fsck on reboot.

So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...

--
Deleted

ZFS by chudnall · 2009-03-25 01:57 · Score: 4, Informative

Linux seriously needs to find a workaround to its licensing squabbles and find a way to get a rock-solid ZFS in the kernel. Right now, ZFS on OpenSolaris is simply wonderful, and this is what I am deploying for file service at all my customer sites now. The scary thing about file system corruption is that it is often silent, and can go on for a long time, until your system crashes, and you find that all of your backups are also crap. I've replaced a couple of linux servers (and more than a couple of Windows servers) after filesystem and disk corruption compounded by naive RAID implementations (RAID[1-5] without end-to-end checksumming can make your data *less* safe), and my customers couldn't be happier. Having hourly snapshots and a fast in-kernel CIFS server fully integrated with ZFS ACLS (and with support for NTFS-style mixed case naming) is jut icing on the cake. Now if only I could have an Opensolaris desktop with all the nice linux userland apps available. Oh wait, I can!

--
Disclaimer: Evolution comes with NO WARRANTY, except for the IMPLIED WARRANTY of FITNESS FOR A PARTICULAR PURPOSE.

Re:ZFS by mikeee · 2009-03-25 02:58 · Score: 3, Interesting

It's similar (at least, a lot more similar than any other Linux filesystem), but less mature.
In defense of the LK team on the whole ZFS issue, I understand that part of the reason they didn't pursue some ZFS-like features years ago was because of patents. Now that SUN has open-sourced (though not in a GPL-compatible way) ZFS and is defending that against Network Appliance in a lawsuit, the way looks a lot clearer for Btrfs and company to proceed.
Actually, on that thought, the IBM acquisition of SUN should get NetApp to drop that lawsuit. Going up against SUN in a MAD patent dispute is a bit risky, but (as SCO discovered) aggressive IP lawsuits against IBM come in right behind "land war in Asia".
Re:ZFS by Mr.Ned · 2009-03-25 08:55 · Score: 3, Insightful

FreeBSD has ZFS. My understanding is while ZFS is a good filesystem, it isn't without issues. It doesn't work well on 32-bit architectures because of the memory requirements, isn't reliable enough to host a swap partition, and can't be used as a boot partition when part of a pool. Here's FreeBSD's rundown of known problems: http://wiki.freebsd.org/ZFSKnownProblems.
On the other hand, the new filesystems in the Linux kernel - ext4 and btrfs - are taking the lessons learned from ZFS. I'm excited about next-generation filesystems, and I don't think ZFS is the only way to go.

Data - metadata ordering: softupdates by ivoras · 2009-03-25 02:14 · Score: 5, Informative

Somebody's going to mention it so here it is: there was a BSD unix research project that ended as the soft-updates implementation (currently present in all modern free BSDs). It deals precisely with the ordering of metadata and data writes. The paper is here: http://www.ece.cmu.edu/~ganger/papers/softupdates.pdf. Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata. It has proven to be very resilient (up to hardware problems).

Here's an excerpt:

We refer to this requirement as an update dependency, because safely writing the direc- tory entry depends on first writing the inode. The ordering constraints map onto three simple rules: (1) Never point to a structure before it has been initialized (e.g., an inode must be initialized before a directory entry references it). (2) Never reuse a resource before nullifying all previous pointers to it (e.g., an inode's pointer to a data block must be nullified before that disk block may be reallocated for a new inode). (3) Never reset the last pointer to a live resource before a new pointer has been set (e.g., when renaming a file, do not remove the old name for an inode until after the new name has been written). The metadata update problem can be addressed with several mecha- nisms. The remainder of this section discusses previous approaches and the characteristics of an ideal solution.

There's some quote about this... something about those who don't know unix and about reinventing stuff, right :P ?

--
-- Sig down

Re:Data - metadata ordering: softupdates by LizardKing · 2009-03-25 03:14 · Score: 4, Informative

It has proven to be very resilient (up to hardware problems).
No it hasn't, which is why it has been removed from NetBSD and replaced by a journaled filesystem. I've also heard grumblings from OpenBSD people about corrupted filesystems with softdep enabled.

Saving grace by coryking · 2009-03-25 02:38 · Score: 4, Funny

Not saying the name is Torvalds attempt at saving grace

Is the person responsible going to pull a classic political step-down where they resign "in order to spend more time with their family"?

Maybe it was Hans Reiser? Sure the guy is locked up in San Quentin, but nobody knows how to hack a filesystem to bits better than Reiser. Bada ba ching! Thank you, thank you... I'll be here all night.

Fix it by Frankie70 · 2009-03-25 04:27 · Score: 4, Funny

Maybe Linus should just fixit instead of whining about it. It's open source, dammit.

Integrity vs. consistency. by WebCowboy · 2009-03-25 05:22 · Score: 4, Informative

Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle

Linus is not clueless in this case. I think it is a case of you misinterpreting the issue he was discussing.

Journaling is, as you say NOT about data integrity/prevention of data loss. That is what RAID and UPSes are for. However, it IS about data CONSISTENCY. Even if a file is overwritten, truncated or otherwise corrupted in a system failure (i.e. loss of data integrity) the journal is supposed to accurately describe things like "file X is Y bytes in length and resides in blocks 1,2,3...." (data/metadata consistency). Why would you update that information before you are sure the data was actually changed? A consistent journal is the WHOLE REASON why you can "alleviate the delay caused by fscking".

Linus rightly pointed out, with a degree of tact that Theo de Raadt would be proud of, that writing meta-data before the actual data is committed to disk is a colossally stupid idea. If the journal doesn't accurately describe the actual data on the drive then what is the point of the journal? In fact, it can be LESS than useless if you implicitly trust the inconsistent journal and have borked data that is never brought to your attention.

Slashdot Mirror

Kernel Hackers On Ext3/4 After 2.6.29 Release

50 of 316 comments (clear)