What's the Damage? Measuring fsck Under XFS and Ext4 On Big Storage

← Back to Stories (view on slashdot.org)

What's the Damage? Measuring fsck Under XFS and Ext4 On Big Storage

Posted by timothy on Friday February 3, 2012 @05:40AM from the disks-groaning-with-shame dept.

An anonymous reader writes "Enterprise Storage Forum's long-awaited Linux file system Fsck testing is finally complete. Find out just how bad the Linux file system scaling problem really is."

52 of 196 comments (clear)

Min score:

Reason:

Sort:

fsck speed, want safety by Anonymous Coward · 2012-02-03 05:47 · Score: 3, Insightful

How fast a full fsck scan is is my last concern. What about how successful they are at recovering the filesystem?
1. Re:fsck speed, want safety by h4rr4r · 2012-02-03 05:52 · Score: 5, Insightful
  
  If you need to fsck you should already be restoring from backups onto another machine.
2. Re:fsck speed, want safety by rickb928 · 2012-02-03 05:59 · Score: 4, Insightful
  
  More helpful advice from the Linux community. Thank you ever so much, once again right on point, timely, and effective.
  
  --
  deleting the extra space after periods so i can stay relevant, yeah.
3. Re:fsck speed, want safety by pankkake · 2012-02-03 06:07 · Score: 2
  
  Most popular Linux filesystems are atomic and should not need fsck, unless something really bad happens.
  
  --
  Kill all hipsters.
4. Re:fsck speed, want safety by h4rr4r · 2012-02-03 06:15 · Score: 3, Interesting
  
  Because sometimes it does work. Relying on any such software is stupid.
  While the FSCK/CHKDSK runs you restore onto another machine. This way if the check finishes first, you can use it until you can switch over to the restored machine. It also can save your ass if you are not smart enough/fortunate enough to have good backups.
5. Re:fsck speed, want safety by pankkake · 2012-02-03 06:22 · Score: 2
  
  Now that's just plain dishonesty.
  It's not because it's not useful most of the time that it is useless or not to be used.
  Some filesystems are not atomic or can be mounted with non-atomic options.
  Data corruption occurs.
  It's simply useful to test if the filesystem is all right. At least for developers.
  Doesn't change the fact that you can't rely on fsck to *recover* data.
  
  --
  Kill all hipsters.
6. Re:fsck speed, want safety by grumbel · 2012-02-03 06:24 · Score: 2
  
  Yep, my last experience with fsck was after a HDD has gotten a few bad sectors. fsck on the ext3 file system let me recover the data alright, except of course for the filenames, thus I ended up with a whole lot of unsorted and unnamed stuff in /lost+found, which wasn't very helpful. I'd really like to see more focus on how secure the filesystems are and less on how fast they are.
7. Re:fsck speed, want safety by kangsterizer · 2012-02-03 06:25 · Score: 2
  
  But then again, you'll want to fsck from time to time to know if you have an issue.
  If you're waiting for the issue to appear "hey boss we apparently lost half the db" you'll lose more data during the time the corruption happens and you're not aware of it, than if you detected it earlier.
  Thus being able to fsck in a decent amount of time matters.
  Thats not the only thing of course. Sometimes you don't have a backup. Sometimes things are fucked up. Sometimes you're just required to get the thing running before the backup restoration is complete. Etc.
  Otherwise, you know, we could just delete fsck, since as you pointed out, it's *never* needed!
  Yea, right. :)
8. Re:fsck speed, want safety by h4rr4r · 2012-02-03 06:30 · Score: 2
  
  No, its primary job is to tell you about integrity of the filesystem. Any attempt at fixing it is secondary.
9. Re:fsck speed, want safety by darkpixel2k · 2012-02-03 06:36 · Score: 5, Funny
  
  when I need to fsck, I just call my girlfriend
  Why? Do you not know how to use the command line?
  
  --
  There's no place like ::1 (I've completed my transition to IPv6)
10. Re:fsck speed, want safety by hackstraw · 2012-02-03 06:50 · Score: 5, Interesting
  
  The largest filesystem I admin is just shy of 1/2 petabyte. And its one in number. Backing up everything on that filesystem is simply not feasible. To put it in perspective 1 stream @ 200 MiB/s would take almost 28 days to backup the whole thing. I would imagine a restore would take about the same order. Telling hundreds of users their files are unavailable for reading or writing for 30 days is not really an option, so I run fsck.
  Backups simply are not really an option past 20+ terabytes of storage, and simply not feasible if the storage is volatile in nature. AFAIK everyone has gone to redundancy over backups at scale.
11. Re:fsck speed, want safety by h4rr4r · 2012-02-03 06:56 · Score: 2
  
  You need to be writing that data to two or more, more really, filesystems at the same time. Streaming replication.
  Redundancy can be backups, if they are in different locations and proper versioning is used.
12. Re:fsck speed, want safety by chuckymonkey · 2012-02-03 07:05 · Score: 5, Insightful
  
  You're fairly wrong there, you can actually back that much data up. You just have to be willing to pay for some seriously large tape libraries and they're not cheap. We're in the process of installing a 700TB array with a 1.5PB tape library backup. You just have to do the backups using filesystem snapshots and run them pretty much constantly.
  
  --
  "Some books contain the machinery required to create and sustain universes."-Tycho
13. Re:fsck speed, want safety by HiThere · 2012-02-03 07:14 · Score: 2
  
  The last time I checked, the system required that fsck be run after a power loss. Also after the first reboot aften n days had passed. (I think n is somewhere around 200, but I haven't been interested enough to pin it down precisely.) And occasionally a system upgrade will require a reboot.
  OTOH, recovery is definitely a lot faster than it used to be, thanks to journaling.
  OTTH, all of my parftitions together are barely over 1TB, so this is only significant (to me) for future systems, when this will have changed anyway.
  
  --
  
  I think we've pushed this "anyone can grow up to be president" thing too far.
14. Re:fsck speed, want safety by phorm · 2012-02-03 07:14 · Score: 3, Insightful
  
  If you're in a scenario where "Backups are not really an option", somebody is doing something wrong...
  How long did it take you to get to 0.5PB? If you use a differential backup/sync, then you should generally only need to copy *NEW* data, and the old stuff will already be there.
15. Re:fsck speed, want safety by Nutria · 2012-02-03 07:15 · Score: 2
  
  you restore onto another machine
  ROTFLMAO,
  We struggle to even get test machines; there's no way that "they" would pay for all that kit to just sit there gathering dust waiting for a disaster. If anything, it would be our DR machine and we'd instantly flip production over to it.
  
  --
  "I don't know, therefore Aliens" Wafflebox1
16. Re:fsck speed, want safety by Anonymous Coward · 2012-02-03 07:38 · Score: 2, Interesting
  
  So you have 1/2 petabyte storage but 200 MiB/s speed -- are you kidding me ? Is your storage controller broken or really cheap or both ?
  Also, xfsdump (which is used to backup xfs) can do multi-threaded backups.
  Now to comment on the test -- it is completely insane. As mentioned by you and others, if you are running fsck while your whole application is down -- thing broken is not system but the thing inside the skull -- you will obviously need a very fast backup/restore and/or a HA solution, both are not (and need not be) mutually exclusive.
17. Re:fsck speed, want safety by tlhIngan · 2012-02-03 08:23 · Score: 4, Interesting
  
  The largest filesystem I admin is just shy of 1/2 petabyte. And its one in number. Backing up everything on that filesystem is simply not feasible. To put it in perspective 1 stream @ 200 MiB/s would take almost 28 days to backup the whole thing. I would imagine a restore would take about the same order. Telling hundreds of users their files are unavailable for reading or writing for 30 days is not really an option, so I run fsck.
  Which means You're Doing It Wrong(tm).
  Two words: volume snapshot.
  What it does is give you a view of the filesystem as it exists at that the time the snapshot is taken. The frozen image is mounted in another mountpoint (read-only), while the snapshotted voume is still accessible (read-write). Changes to the volume since the snapshot was taken won't be in the snapshot (obviously).
  Your backup points to that snapshot which won't change and that's copied to tape. Once you're done backing up 30 days later, you delete the snapshot.
  Since your backup takes so long, you'd immediately then make another snapshot and being the backup again.
  If it's a database, the database backup tools work on a database snapshot - it will be correct and consistent as of when the snapshot was taken while the database remains available for reading and writing outside of the snapshot.
  Having to take a system down to back it up is a dead concept on modern OSes as they all tend to have snapshot capability.
18. Re:fsck speed, want safety by _LORAX_ · 2012-02-03 08:25 · Score: 4, Informative
  
  Backups simply are not really an option past 20+ terabytes of storage, and simply not feasible if the storage is volatile in nature. AFAIK everyone has gone to redundancy over backups at scale.
  200TB/130TB usable clustered/distributed system with 4x LTO5 drives and we do a full snapshot to tape every week. With data that size you either pay up-front for proper engineering or you pay for the life of the system for poor performance and eventual cleanup of the mess.
19. Re:fsck speed, want safety by lvxferre · 2012-02-03 08:42 · Score: 5, Funny
  
  Protip: if 'make love' returns no target, you need to do the job by hand.
  
  --
  Nerdy news for your nerdy needs? http://www.soylentnews.org Soylent News is people!
20. Re:fsck speed, want safety by Aighearach · 2012-02-03 08:49 · Score: 2
  
  Databases.
21. Re:fsck speed, want safety by h4rr4r · 2012-02-03 09:12 · Score: 3, Insightful
  
  Most people are worried more about cost then reliability.
  Most people is often a category that does not do things the best way or the right way.
22. Re:fsck speed, want safety by chuckymonkey · 2012-02-03 09:23 · Score: 3, Insightful
  
  I know I'm posting to an AC here, but I want to point something out. "Backups simply are not really an option past 20+ terabytes of storage, and simply not feasible if the storage is volatile in nature." He was claiming that it's not feasible to back up more than 20+ TB of storage when in fact it is. I was pointing out that yes you can, but it's pretty expensive.
  
  --
  "Some books contain the machinery required to create and sustain universes."-Tycho
23. Re:fsck speed, want safety by Daniel+Phillips · 2012-02-03 09:28 · Score: 2
  
  You are not helpful. In the real world fsck is an important determinant of filesystem robustness. In your career, your proverbial butt will be saved at least once by a good fsck, and you will be left twisting in the breeze at least twice because of a bad or absent fsck. Why twice? Because that is how many times it takes to send the message to someone unwilling to receive it.
  
  --
  Have you got your LWN subscription yet?
24. Re:fsck speed, want safety by Dishevel · 2012-02-03 09:36 · Score: 2
  
  Yup.
  Every week I switch over my systems (master/slave arrangement) and take the old master down and fsck.
  Making sure all is well. Sometimes there is a small issue. It fixes it. All is well.
  So far I have never had catastrophe. Where I loose all data on the Master while my slave is down hard.
  Going to a tape back up even a day old is going to be bad news.
  
  --
  Why is it so hard to only have politicians for a few years, then have them go away?
25. Re:fsck speed, want safety by ion++ · 2012-02-03 09:53 · Score: 2
  
  We're in the process of installing a 700TB array with a 1.5PB tape library backup. You just have to do the backups using filesystem snapshots and run them pretty much constantly.
  And XFS is pretty brilliant for taking filesystem snapshots. Using the command xfs_freeze you can make good snapshots of XFS in what appears to have no downtime at all see XFS manpage like http://linux.die.net/man/8/xfs_freeze
  And then run these commands:
  
  xfs_freeze -f /mount/point && block_level_snapshot && xfs_freeze -u /mount/point
  
  Last time I checked that did not work with EXT4.
fsck xfs does something? by drewstah · 2012-02-03 05:54 · Score: 3, Interesting

When I had some EBS problems a couple years ago, I figured I would run xfs_check. It seemed to do absolutely nothing, even if there were disks known to be bad in the md array. xfs is nice and fast, but I haven't seen the xfs_check or xfs_repair to do either of the things I'd assume they'd do -- check and repair. I found it easier to delete the volumes and start from scratch, because any compromised xfs filesystem seems to be totally unfixable. Is fsck for xfs new?

--
I do stuff Zhrodague
1. Re:fsck xfs does something? by larry+bagina · 2012-02-03 06:00 · Score: 2
  
  I set up an xfs volume a couple years back. After copying a few files over nfs, it became corrupted. the xfs fsck did something -- it told me that it was so corrupted, it couldn't be fixed.
  
  --
  Do you even lift?
  These aren't the 'roids you're looking for.
2. Re:fsck xfs does something? by Sipper · 2012-02-03 07:37 · Score: 3, Informative
  
  I set up an xfs volume a couple years back. After copying a few files over nfs, it became corrupted. the xfs fsck did something -- it told me that it was so corrupted, it couldn't be fixed.
  I think you mean xfs_repair. On XFS, fsck is a no-op.
  I've never yet seen xfs_repair tell me there was an issue it couldn't fix -- that sounds unusual. However there have been lots of changes to XFS in the Linux kernel in recent years, and occasionally there has been a few nasty bugs, some of which I ran into. Linux-2.6.19 in particular had some nasty XFS filesystem corruption bugs.
3. Re:fsck xfs does something? by Sipper · 2012-02-03 11:51 · Score: 2
  
  You have your root filesystem mounted read only and then run xfs_repair on it. Sometimes getting your root filesystem remounted read-only can be tricky, however. Sometimes this requires passing init=/bin/sh to the kernel, so you start with no other processes running. However you go about getting your root filesystem mounted read only, after you run xfs_repair(or e2fsck for that matter really) you reboot immediately.
  Just tested it [on the box in which I'm using XFS on top of LUKS encyryption], and I didn't like the results.
  grub2 by default on Debian makes a "recovery" boot option to boot into single user mode, but even with this as you mention it's required to modify the boot option and add init=/bin/sh in order to actually be able to mount the root filesystem read-only. However after finally succeeding in diong this, xfs_check reports about a full screen of errors concerning file and directory link counts, which all appear simply to be due to the filesystem being mounted and in use. When using a Knoppix CD (v6.4.4) and after using 'cryptsetup luksOpen ' to decrypt the root partition, xfs_check reports no errors at all. [And I did run xfs_repair anyway just to double-check in the latter case, and no errors were found.]
  Furthermore, upon trying to reboot from or exit the single-user mode, I got an error related to "trying to kill init" immediately followed by a kernel panic.
  So I'll admit that I was wrong and that it is possible to run xfs_repair on an XFS filesystem read-only, but I really don't like the results and I highly don't recommend it.
  
  Stop trying to oversimply things you don't understand.
  Perhaps you don't understand things as well as you think you do. See the section below regarding the -d option to xfs_repair and the context in which you'd use it.
  -d Repair dangerously. Allow xfs_repair to repair an XFS filesystem mounted read only. This is typically done on a root fileystem from single user mode, immediately followed by a reboot.
  
  I had tried it before and IIRC I had lots of trouble getting the filesystem mounted read-only, and had confusing and poor results when I finally did get it mounted read-only. All I remembered clearly in my mind was "it really didn't work", and having gone through it again I still think it doesn't. You can judge for yourself what you think I know or not. ;-)
Breaking News! by Anonymous Coward · 2012-02-03 05:54 · Score: 2, Funny

This just in:
Full filesystem scans take longer as the size of the filesystem increases.
News at 11.
Damage? by eggstasy · 2012-02-03 06:02 · Score: 3, Funny

Honey badger don't give a fsck.
Who would engineer a storage system like that? by Anonymous Coward · 2012-02-03 06:11 · Score: 2, Insightful

A single file system that big without checking features that file systems like ZFS or clustering file stores provide seems insane to me.
Re:linux is fail by hobarrera · 2012-02-03 06:11 · Score: 2

I'll go tell _average joe/jane_ to go and get AIX, and dump ubuntu+unity which they like so much because it's shiny and pretty.
Re:Why bother? by _LORAX_ · 2012-02-03 06:16 · Score: 2

After evaluating our options in the 50-200TB range with room for further growth we ended up moving away from linux and to an object based storage platform with a pooled, snapshotted, and checksummed design. One of the major reasons for this was the URE problem, we would virtually be guaranteeing silent data corruption at that size with a filesystem that did not have internal checksums. The closest thing in the OS world would be ZFS whose openness is in serious doubt. It is scary how much trust the community places on spinning rust.
The tests are also useless since the "speed" will be linerally controlled by the IOPS of the array. Sure would be nice to be able to throw 10x15k spindles at 3.5TB ( 230 disks for the 72TB test ) that's one way to improve random IO performance, but how many can afford such luxury on a big data store that could reach into the 100's of TB?
Re:linux is fail by gweihir · 2012-02-03 06:36 · Score: 4, Insightful

A cranky coward from the shadows is not s reliable source of information.
I have used AIX and Solaris, and I can say that a lot of stuff is easier on Linux.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:linux is fail by Anonymous Coward · 2012-02-03 06:38 · Score: 5, Funny

sudo kill yourself
;-)
Re:linux is fail by hawguy · 2012-02-03 06:46 · Score: 2

I'll go tell _average joe/jane_ to go and get AIX, and dump ubuntu+unity which they like so much because it's shiny and pretty.
Few average Joe's have 72TB of disk space, and even for those that do, they're probably ok with 30 - 60 minutes of FSCK time. And more likely, instead of 100's of millions of files, they probably have a few million, so their fsck time will be in the 3 - 15 minute time range.
I've seen servers that take over 3 minutes for their POST check.
Re:Why bother? by _LORAX_ · 2012-02-03 06:47 · Score: 3, Interesting

Our BTRFS evaluation resulted in rejecting it for some very serious problems ( what they claim are snapshots are actually clones, panic in low memory situations, no fsck, horrible support tools, developers who are hostile to criticism, pre-release software, ... ). ZFS was nice, but limited to non-distributed systems and still had a non-trivial amount of volume and backend management headaches. Personally I use ZFS for my personal servers at home ( incremental snapshots are the bomb ) but out production systems needed more.
Re:Fsck times by Gazzonyx · 2012-02-03 07:12 · Score: 2

They were using 15K RPM SAS drives. Your 7200 RPM drives aren't going to touch the speed of 15K RPM drives on a SAS backplane. Not by a long shot.

--
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Re:linux is fail by ifrag · 2012-02-03 07:15 · Score: 3, Funny

I like how you completely ignored Solaris yet still presented the comment as if it was a valid counterargument.
I also like how GP completely ignored Solaris. I just like the fact it is being ignored.

--
Fear is the mind killer.
Re:Why bother? by Guspaz · 2012-02-03 07:41 · Score: 3, Interesting

ZFS now runs pretty well on Linux too, as a kernel module, thanks to zfsonlinux. If you're running a Debian-based distro, installing it is trivial (one command to add the PPA, one command to install the package).
Re:linux is fail by fnj · 2012-02-03 07:50 · Score: 2

killall Anonymous\ Coward
Re:linux is fail by cryptographrix · 2012-02-03 08:08 · Score: 2

...until you have a drive die during a scrub, destroy a zfs filesystem in a deduplicating zpool, or any other number of things that makes ZFS **ANGRY**, that is. and despite all that, I still trust it more than any most linux filesystems.
Re:linux is fail by aix+tom · 2012-02-03 08:16 · Score: 4, Informative

You see my nick?
AIX sucks more than Linux.
Usual process for "weird"* AIX Problems:
1) weird problem occurs after install. You report problem to IBM.
2) IBM asks for your software version, see they are the newest ones available, and say they look into it.
3) You ask several month later if they did find anything. They ask for your software version, they ask you to upgrade and see if the problem goes away.
4) You upgrade to newest version.
5) go to 2)
*There are of course non-weird problems where you get the answer from IBM support in 2-3 days, and from Linux forums in 2-3 minutes.
Re:Why bother? by ratsg · 2012-02-03 08:38 · Score: 2

and ZFS is available to Mac OS X systems as an add on. Both opensource, and as of this week, a commercial version is available.
There is very little reason to be running a system with out ZFS, unless you are running AIX, HP-UX or IRIX.
Re:linux is fail by lvxferre · 2012-02-03 08:40 · Score: 3, Funny

Why would you replace a zero-ed string with another? At least use /dev/random, bro.

--
Nerdy news for your nerdy needs? http://www.soylentnews.org Soylent News is people!
Damage? by erice · 2012-02-03 08:55 · Score: 2

When an article about fsck has a tag line of "What's the damage", I expect to see some discussion of how fsck deals with a damaged file system.
The time required to fsck a file system that doesn't need checking is less interesting and inconsistant with the title. Although, if fsck had complained about the known clean file system that would be interesting.
Re:linux is fail by Saxophonist · 2012-02-03 09:09 · Score: 3, Funny

No, you're thinking of ReiserFS.
Re:linux is fail by jd · 2012-02-03 09:32 · Score: 3, Interesting

Works best if you use the "Doom as Sys Admin" hack.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:linux is fail by jd · 2012-02-03 09:58 · Score: 5, Interesting

A lot of stuff is also faster on Linux, particularly on the x86. Solaris x86 is dog slow. AIX ("aches") is an appropriate name for a mainframe OS that never really got the hang of this new-fangled "interactive user" stuff. It's a good mainframe OS, that is what it is designed for, tuned for and intended for, but traditional mainframe batch transactional work isn't the sort of payload that is typically run these days. The high-end users want hard real-time (i.e.: they know to the microsecond - or nanosecond, in some cases - exactly when each process will start and stop) for data collection, data analysis and simulation. The data centers want massive multithreading for gigantic servers with minimal overhead and service guarantees per thread. The typical user wants extremely low latency interactive. None of these are pre-scripted batch jobs.
Now, if you wanted to develop a data warehouse for, say, technical writings, journalism, etc, where you're compiling a collection of things that can be typeset overnight, that may be doable as a batch job. However, anyone planning on publishing a journal that needs 72 terabytes of storage had best consider the marketplace a little more closely first. A publishing company, say Nature, might conceivably have use for AIX for batch work. I could see the number of submissions, referee responses and article selections per journal being such that a mainframe would be a perfectly valid way to do things. Even then, it might still be sufficiently small that a live transactional database would be more cost-effective.
Traditionally, batch processing has been a niche market for electrical and gas companies, etc, where the number of customers is staggering. Even then, it has largely been replaced with live transactional systems because customers want things adjusted NOW and not overnight or at the end of the week.
Mass mailers still use batch processing, but printing is the bottleneck and there is no point in having an expensive OS process everything in a fraction of a second on an expensive mainframe when it takes N actual real-world seconds before a printer becomes available to take the next block of data. You need run no faster than the slowest component because the end produce won't be delivered any faster. You would have to have a gigantic number of printers before the OS became a significant factor and most shops just don't have that kind of printing power.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:linux is fail by jimicus · 2012-02-05 02:19 · Score: 2

There are of course non-weird problems where you get the answer from IBM support in 2-3 days, and from Linux forums in 2-3 minutes.
I really wouldn't paint Linux support in such rosy terms. Many forums are heading in the direction of the blind leading the blind; application-specific mailing lists and IRC channels, while improving, still have a slight tendency to say "RTFM n00b!". (Or, as happened to me, "Can't be done. It's a stupid demand anyway. Fuck off" - twenty minutes later I figured out how to do it on my own, so it evidently could be done...)