What's the Damage? Measuring fsck Under XFS and Ext4 On Big Storage
An anonymous reader writes "Enterprise Storage Forum's long-awaited Linux file system Fsck testing is finally complete. Find out just how bad the Linux file system scaling problem really is."
How fast a full fsck scan is is my last concern. What about how successful they are at recovering the filesystem?
When I had some EBS problems a couple years ago, I figured I would run xfs_check. It seemed to do absolutely nothing, even if there were disks known to be bad in the md array. xfs is nice and fast, but I haven't seen the xfs_check or xfs_repair to do either of the things I'd assume they'd do -- check and repair. I found it easier to delete the volumes and start from scratch, because any compromised xfs filesystem seems to be totally unfixable. Is fsck for xfs new?
I do stuff Zhrodague
They're testing 70 TB of storage, so with current hard drive quality, the odds of an unrecoverable read error are probably close to 100%. It would be simpler to write a two-line fsck utility to report it:
This just in:
Full filesystem scans take longer as the size of the filesystem increases.
News at 11.
Honey badger don't give a fsck.
A single file system that big without checking features that file systems like ZFS or clustering file stores provide seems insane to me.
I'll go tell _average joe/jane_ to go and get AIX, and dump ubuntu+unity which they like so much because it's shiny and pretty.
Not to mention the everyday low price
For justice, we must go to Don Corleone
A much better test of linux "big data"
1) write garbage to X blocks
2) run fsck if no errors found, repeat step 1
How long would it take before either of these filesystems noticed a problem and how many corrupt files do you have? With a real filesystem you should be able to identify and/or correct the data before it takes out any real data.
OK, so I have a large x86/64 server and want to follow your advice. Can you please tell me where you can get AIX, or HP-UX, to run on X86?
I like how you completely ignored Solaris yet still presented the comment as if it was a valid counterargument.
"The more corrupt a society, the more numerous are its laws." -Tacticus
Seconds?
When you have that much data and you need high reliability you are doing streaming replication to multiple devices and layering other backup methods as well.
Any idea what the cost of just trusting that the FSCK fixed the problems on 72TB of data your business needs could be?
A cranky coward from the shadows is not s reliable source of information.
I have used AIX and Solaris, and I can say that a lot of stuff is easier on Linux.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
What system did you end up going with?
How do you back it up?
I'll go tell _average joe/jane_ to go and get AIX, and dump ubuntu+unity which they like so much because it's shiny and pretty.
Few average Joe's have 72TB of disk space, and even for those that do, they're probably ok with 30 - 60 minutes of FSCK time. And more likely, instead of 100's of millions of files, they probably have a few million, so their fsck time will be in the 3 - 15 minute time range.
I've seen servers that take over 3 minutes for their POST check.
They were using 15K RPM SAS drives. Your 7200 RPM drives aren't going to touch the speed of 15K RPM drives on a SAS backplane. Not by a long shot.
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
What "stuff"?
Give actual, useful comparisons.
Otherwise, your comment can be reduced to,
"I am most familiar with linux. Therefore, using linux is easier for me"
I like how you completely ignored Solaris yet still presented the comment as if it was a valid counterargument.
I also like how GP completely ignored Solaris. I just like the fact it is being ignored.
Fear is the mind killer.
http://www.enterprisestorageforum.com/print/storage-hardware/linux-file-system-fsck-testing----the-results-are-in.html
going through 3 pages is so annoying...
Not sure how AIX will help here since it is on a similar filesystem. Also, you are comparing apples and radishes -- how does AIX compare to ubuntu+unity - one being server and other being desktop -- in other words, are you insane ?
Isn't that the point of using a filesystem that can do online scrubs, like ZFS? As far as I know, ZFS also checks metadata when scrubbing.
kill -9 $$ # does the job pretty well
ZFS has 0 FSCK time as it does not need it. If you never leave your FS in an unstable state, you won't need to worry about fixing it.
killall Anonymous\ Coward
and FSCK has 0 jail time, unlike ZFS
...until you have a drive die during a scrub, destroy a zfs filesystem in a deduplicating zpool, or any other number of things that makes ZFS **ANGRY**, that is. and despite all that, I still trust it more than any most linux filesystems.
Each pool is a LUN that is 3.6TB in size before formatting or actually 3,347,054,592 bytes as reported by "cat /proc/partitions".
a file system with about 72TB using "df -h" or 76,982,232,064 bytes from "cat /proc/partitions"
Yeah, I think there's definitely a scaling problem there.
Or perhaps a reading comprehension problem, since /proc/partitions reports in blocks, not bytes, but either way it doesn't inspire any kind of confidence in the rest of their testing methodology.
You see my nick?
AIX sucks more than Linux.
Usual process for "weird"* AIX Problems:
1) weird problem occurs after install. You report problem to IBM.
2) IBM asks for your software version, see they are the newest ones available, and say they look into it.
3) You ask several month later if they did find anything. They ask for your software version, they ask you to upgrade and see if the problem goes away.
4) You upgrade to newest version.
5) go to 2)
*There are of course non-weird problems where you get the answer from IBM support in 2-3 days, and from Linux forums in 2-3 minutes.
and XFS worked great with IRIX. WTF happed to it with lunux???
regardless of the time, it beats loosing all of your data.
Why would you replace a zero-ed string with another? At least use /dev/random, bro.
Nerdy news for your nerdy needs? http://www.soylentnews.org Soylent News is people!
Whatever zLinux. Also, there is a point to tightly coupling the OS to the Hardware. Not every workload needs to be on x86 toys.
IBM said please don't use AIX, use Linux instead. That was like... 10+ years ago.
When an article about fsck has a tag line of "What's the damage", I expect to see some discussion of how fsck deals with a damaged file system.
The time required to fsck a file system that doesn't need checking is less interesting and inconsistant with the title. Although, if fsck had complained about the known clean file system that would be interesting.
No, you're thinking of ReiserFS.
Works best if you use the "Doom as Sys Admin" hack.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I expect a /. article like this to include a summary. Like, a word about what the results actually were, without having to click through twice to get to them.
A lot of stuff is also faster on Linux, particularly on the x86. Solaris x86 is dog slow. AIX ("aches") is an appropriate name for a mainframe OS that never really got the hang of this new-fangled "interactive user" stuff. It's a good mainframe OS, that is what it is designed for, tuned for and intended for, but traditional mainframe batch transactional work isn't the sort of payload that is typically run these days. The high-end users want hard real-time (i.e.: they know to the microsecond - or nanosecond, in some cases - exactly when each process will start and stop) for data collection, data analysis and simulation. The data centers want massive multithreading for gigantic servers with minimal overhead and service guarantees per thread. The typical user wants extremely low latency interactive. None of these are pre-scripted batch jobs.
Now, if you wanted to develop a data warehouse for, say, technical writings, journalism, etc, where you're compiling a collection of things that can be typeset overnight, that may be doable as a batch job. However, anyone planning on publishing a journal that needs 72 terabytes of storage had best consider the marketplace a little more closely first. A publishing company, say Nature, might conceivably have use for AIX for batch work. I could see the number of submissions, referee responses and article selections per journal being such that a mainframe would be a perfectly valid way to do things. Even then, it might still be sufficiently small that a live transactional database would be more cost-effective.
Traditionally, batch processing has been a niche market for electrical and gas companies, etc, where the number of customers is staggering. Even then, it has largely been replaced with live transactional systems because customers want things adjusted NOW and not overnight or at the end of the week.
Mass mailers still use batch processing, but printing is the bottleneck and there is no point in having an expensive OS process everything in a fraction of a second on an expensive mainframe when it takes N actual real-world seconds before a printer becomes available to take the next block of data. You need run no faster than the slowest component because the end produce won't be delivered any faster. You would have to have a gigantic number of printers before the OS became a significant factor and most shops just don't have that kind of printing power.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
So is this about big filesystems or lots of tiny files?
'cause they are not the same thing.
How many files is a lot? 300K? 10M? 100M?
A Pirate and a Puritan look the same on a balance sheet.
I think you're confusing AIX with S/390. AIX is IBM's Unix system, not mainframe. It handles interactive workloads just fine. Hell, S/390 does, too. Your batch processing concepts are a few decades out of date. Just sayin'.
1. Why did they put a label on the RAID devices? They should have just used /dev/sd[b-x] directly, and not confused the situation with a partition table.
2. Did they align the partitions they used to the RAID block size? They don't indicate this. If they used the default DOS disk label strategy of starting /dev/sdb1 at block 63, then their filesystem blocks were misaligned with their 128 kiB RAID block size, and one in every 32 filesystem blocks will span two disks (assuming 4 kiB filesystem blocks).
3. Why did they use md and not LVM? md can sometimes introduce bandwidth limits, and LVM lets you alternate between striped and linear volumes for your testing.
4. Why don't they report the raw bandwidth of the disk, and maybe some IOPS numbers?
5. Why don't they report total operations and bandwidth consumed as measured by iostat or sar?
6. Why didn't they give geometry hints to mkfs? The ext4 mkfs invocation, for example, should have included "-E stride=$[128 / 4],stripe-width=$[(10 - 2) * (128 / 4)]".
7. What about using an external journal?
8. They report that "during the file system check the server did not swap, and no additional use of virtual memory was observed." Wouldn't it have been better to just do "swapoff -a" and report that no swap was available?
9. Why didn't they (as someone else also suggested above) test an actually damaged filesystem?
10. Is there any indication other than their credentials that these people know what they're doing?
I am not sure it has much impact, but why would you use a 5 year old linux kernel to perform the test? Maturity is all very nice, but if you are pushing technology, it is not always the best approach.
...other file systems, such as ZFS (doesn't it work w/ Linux?), Veritas, UFS and so on?
There are of course non-weird problems where you get the answer from IBM support in 2-3 days, and from Linux forums in 2-3 minutes.
I really wouldn't paint Linux support in such rosy terms. Many forums are heading in the direction of the blind leading the blind; application-specific mailing lists and IRC channels, while improving, still have a slight tendency to say "RTFM n00b!". (Or, as happened to me, "Can't be done. It's a stupid demand anyway. Fuck off" - twenty minutes later I figured out how to do it on my own, so it evidently could be done...)
Thank goodness someone has actually posted something relatively negative about ZFS. The way many people rave about it, you'd think it was God's gift to filesystems.
Ironically, that has made me more interested in using it. My general instinct is to distrust anything that is painted as all good.
OK, so I have a large x86/64 server and want to follow your advice. Can you please tell me where you can get AIX, or HP-UX, to run on X86?
Right. Very funny how you managed to pick out the two systems that don't run on x6 out of the three. If your question was even remotely serious there are two options for you: Solaris and FreeBSD.
"This much data" ? Hello? Are you a time traveler from the 1990s who has missed a decade of storage space expansion or simply trying to have a cheap laugh? 72TB is not "much" in this day and age. Also, fsck only deals with metadata, if you are worried about what happened to your data, the file system at hand is not adequate to your needs anyways.