The Lies Disks and Their Drivers Tell
davecb writes "Pity the poor filesystem designer: they just want to know when their data is safe, but the disks and drivers try so hard to make I/O 'easy' that it ends up being stupidly hard. Marshall Kirk McKusick writes about the difficulties in making the systems work nicely together: 'In the real world, many of the drives targeted to the desktop market do not implement the NCQ specification. To ensure reliability, the system must either disable the write cache on the disk or issue a cache-flush request after every metadata update, log update (for journaling file systems), or fsync system call. Both of these techniques lead to noticeable performance degradation, so they are often disabled, putting file systems at risk if the power fails. Systems for which both speed and reliability are important should not use ATA disks. Rather, they should use drives that implement Fibre Channel, SCSI, or SATA with support for NCQ.'"
But you lost me the moment you mentioned ATA drives.
Cheap, fast and reliable.
Pick any two.
I haven't seen a drive in at least a couple of years that didn't support NCQ. Is this really an issue? It sounds blown out of proportion.
One can't have ones cake and eat it. Speed or reliability, there should be more differentiation and more clarity in the specs. I want my backup-disk to be very reliable, I want my boot-disk to be fast. Best performance for both, but different circumstances.
All those moments will be lost in time, like tears in rain. Time to die.
We're talking about ATA drives?
As in non-SATA drives?
Who has those anymore?
While the article is good for publication in an academic journal like ACM, it's useless for the real world.
For that, the author should tell us whether most drives on the market have NCQ already or not. Popular drives like WD Green and Seagate's various lines.
Otherwise, saying "$A is useless without $Y" is pointless.
I'm not a lawyer, but I play one on the Internet. Blog
We shouldn't even be writing for ATA drives anymore. And any name brand manufacturer that you would trust (on a mediocre level) WD, Seagate etc... all support NCQ.
Get your PostgreSQL here: http://www.commandprompt.com/
I put my important files (pr0n, etc.) on my zfs mirror file server and scrub each week. The really important stuff (tax returns, etc.) I put in a safe deposit box at the bank.
Don't assume that "enterprise" disks do this correctly either.
Many have options to make them behave properly but out of the box have write back caches and ignore FUA or similar, leading to the same problems.
I never recommended ATA drives for servers. Really old stuff that used MFM and RLL drives was back in the era where the just anything else. I used ATA drives for my home stuff and lab where it wasn't expected to be very reliable, and SCSI was all I used for a very long time. Even today I recommend against SATA though it seems tolerable, but SCSI drives are still my standard.
Mostly I thought SCSI drives were also made better, but Seagate and WD convinced me otherwise.
And yes, MFM drives in a Novell DCB setup were among my first servers. Making NW 2.15c mount a 4 GB volume just so you can say you did it would not be fun today, but back then it was work, and clients paid for it. I'm glad it wasn't a VINES server.
deleting the extra space after periods so i can stay relevant, yeah.
Systems for which both speed and reliability are important should not use ATA disks.
Ok, I'll keep that in mind next time I buy ATA disks.
Implementing the NCQ specification in nonhierarchical file system can easily be accomplished by passing an FMGH array through an EMH converter, while maintaining the NCQ specification via a THGN override. All NCQ specification still conform to the YTUR standard established in 1987 at the CMSD conference in Barcelona. If that helps at all.
2) The article's point on NCQ is that many consumer drives do not implement it correctly, and disable the write cache on the disk and issue cache-flush requests to increase performance, but leading to possible file-system failures if there is a power outage.
I think this article is saying that for the enterprise, buy enterprise drives, not consumer drives. Most consumers use laptops now, so power failure doesn't fit in, and consumers prefer speed over reliability, which is why I've always been stuck using laptops lacking ECC RAM.
Slashdot: Playing Favorites Since 1997
They might SAY they support it, but HOW CAN YOU REALLY TELL?
We all know that hardware LIES all the time about its ACTUAL capabilities, just READ the article!
Native Command Queueing
Because not everybody knows everythingTM
systemd is Roko's Basilisk.
They do despite the people parroting his words without being able to back up the statements beyond a fallacious appeal to authority.
The people who make hardware RAID know all about the lying drives, they get good information from the manufacturer on how to make the drives play nice with the RAID controller.
Just read the compatibility charts for your RAID controller, many drives have footnotes with minimum drive firmware requirements and other odd behavior.
You can avoid the need for NCQ if you use a log-structure and protect references with strong checksums. In that way you will know after a crash if say a child tree node referenced is what the referencing parent thinks it should be, and you can use double-buffering or logging to roll back to a known good state. I believe ZFS does this, as does the experimental Lithium distributed file system developed by VMware. Don't bother with NCQ.
That would test and identify a drive for NCQ and cache disable/enable operation correctness that would report the model/serial and result to a central website
There is no right to feel safe thru security vaudeville at the expense of everyone's freedom, privacy and tax money.
Except on the many Linux versions where O_DIRECT doesn't work properly. I have kernels where it works as expected; ones where it quietly fails to sync to disk; and ones where using it causes a PANIC. It's never been a priority for that API to function correctly given that Linus thinks direct IO is totally braindamaged.
I think this is quite interesting.
http://yarchive.net/comp/linux/drive_caches.html
While I've often gotten the impression that the write cache opens up a large "write hole", Linus says that data is cached only for milliseconds, not held in the cache for several seconds. Still, I'd like to see battery backed caches in regular drives and/or controllers.
Would be nice to hear from some drive firmware writers.
hey GN, don't loose your cool when you see someone play lose with grammer.
Put some flash ram on the HD with its own on-board battery backup ...
AccountKiller
ATA? Does anyone use that anymore? Hasn't the world gone to SATA, FC, or SCSI-? This seems a lot of ado about nothing...
-- Ed Carp, N7EKG erc@pobox.com PGP KeyID: 0x0BD32C9B What I'm up to: http://intuitives.mine.nu
The article is total crap, every disk supports NCQ as half the world's population has pointed out in the comments.
The problems are elsewhere: When a disk suddenly loses power while it is writing, there is a risk of various interesting errors. The disk may a) write nulls instead of the correct data, b) write garbage instead of the correct data, c) fail in the middle of a Read-Modify-Write operation and therefore destroy data in files which weren't written to at all, d) write good data to the wrong place on the disk, e) write garbage to a random spot on the disk. Sometimes you are lucky and the errors result in bad hardware checksums so you know you have lost data, at other times the wrong data gets the correct checksum.
In practice, very few desktop/notebook/whatever users will see these problems. No reviews test for these types of errors, so you cannot try to buy drives which fail in less harmful ways. If you care enough, you will use file systems with checksumming designed to catch all the above errors and more (Btrfs and ZFS come to mind). They will at least notify you that it happened, and depending on the redundancy settings they may be able to rescue the destroyed data.
Finally! A year of moderation! Ready for 2019?
Given that we are talking about Kirk McKusick an appeal to authority is entirely fair. Just because he didn't have a bunch of citations or references listed at the bottom of the article does not mean they do not exist somewhere. For you to say it is a "fallacious" appeal to authority is unfair - it has not been proven as fallacious
It's usually up to the one who makes a claim to back it up with evidence, not for others to disprove it - and they can't either, because there's falsifiability here. If I show that my drive has NCQ that works, that still doesn't falsify his claim. I can't bloody well test every drive on the planet, so there's no way to disprove him.
So yes, this is appeal to authority and what you do is putting the onus on those who disagree to prove a negative.
He may be right, and he's certainly renown, but to jump from there to "therefore he is right" is bunk. Even Einstein and Feynman make wrong claims. No one is immune. So some evidence would be welcome.
Probably every half decent controller card on the market for the last decade gets around this problem with a bit of memory and a battery to keep it alive. If you have a lot of disks on one system you'd probably have a controller like that anyway just to get enough SATA/SAS connections.
I can see how it's a big deal with workstations/desktops/laptops but that's really only a small chunk of storage in general.
Only? What about the advantage of a lot more SATA/SAS connections than you get on your motherboard? Also ZFS is limited in the number of platforms it is available for and BTRFS is not ready so it's a bit of a red herring throwing those in and saying that hardware RAID is not required because those exist.
This is a lot of noise for nothing. For kids and amateurs, here's a quick summary...
fsync used-to be the go-to, but that was decades ago, when IDE was in full-swing. Back then, there was a big hub-hub about drives lying. Since then, it's been common knowledge and status-quo that fsync is not trustworthy, end of story.
Today, we have WRITE BARRIERS, and they work great. Ever since, say, the advent of 60GB IDE drives, I've never found a drive that doesn't support write barriers, and in my conversations with Theodore Tso (maintainer of EXT3/EXT4), he said as much as well. I was surprised when I started up DRBD on a test system and found the system complaining the old 40GB drive I was using for testing didn't support write barriers, so that's how long ago we're talking about this having LAST been an issue.
There's still some issue with non-journaled file systems. If you're a BSDer, you really need to disable disk cache to prevent risks of corruption with soft updates. The XFS guys recommend disabling disk cache as well, but I suspect that's just because larger RAID arrays may have entire large files cached, resulting in some individual file loss after power-outages.
Any RAID controllers will have such an option... Write-through and write-back... with advice to be sure your RAID cache's backup battery is working fine before enabling write caching.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Shitty consumer oriented hardware not suitable for enterprise class data integrity and retention.
If you need data integrity and cache, you need a battery backed up IO controller and UPS for a start. If you're relying on the fact that turning cache off on the drive is going to ensure that your writes complete before the power goes out to the drive, you've already set sail for fail.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Way to go, Kirk McKusick!
And does this mean I have a shot at being a US Marshall?
Almost as big a one as being a U.S. Martial.
I see even classic Slashdot is now pretty much unusable on dial up anymore.
That my 1TB / 32MB cache drive for under $100 is not up to the task of being both reliable and fast?
“Common sense is not so common.” — Voltaire
BtrFS is ready for serious use, there are just additional goodies planned for it.
ZFS exists in some form for all relevant server platforms -- on Linux, the kernel module is indistributable except as source[1], but installing dkms doesn't even require knowing what a compiler is. Unlike BtrFS, I wouldn't use it for production use yet (on non-Solaris non-BSD) because the kernel module is quite new, but it's there.
Both of them can do RAID better than the traditional models as they know the filesystem's layout. Also, they can store some files as JBD and some as RAID on the same filesystem.
[1]. Its license was designed to be incompatible.
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
It's not the licence that's the entire problem on linux, the great big missing chunks of functionality are the problem which makes ZFS a long way from general purpose use on that platform, which is a pity because I'm very impressed with ZFS and considering setting up my next file server as BSD or some type of solaris to use it (got to see what can handle the LSI stuff - probably all). BtrFS is no more ready for serious use than reiserfs ever was - the "additional goodies" are needed to cope with hiccups from hardware even if the file system is supposed to be perfect.
The OP AC confused the name "Marshall" with the law enforcement officer "Marshal".
There was no viable joke in their post.
I see even classic Slashdot is now pretty much unusable on dial up anymore.
That is, NCQ (for SATA) does not have enough command slots available. Only ~32 or so per port. The SAS stuff works a bit better but I think still limits you to ~32 slots per target device (instead of ~32 per port on the driver side).
So what happens if you try to use the so-called media completion feature is that write operations eat up all your tags and crowd out the far more important read operations. This makes the feature almost worthless in my view. Not to mention that there is no way to determine, with the SATA spec, that the drive actually honors the bit. Just like the old ATA stuff, drive vendors play fast and loose with the AHCI/SATA specs in order to try to force people to use the far more expensive SAS drives, even though the actual hardware is the same.
What we do in DragonFly is split available tags into read-dedicated tags and write-dedicated tags, approximately 3:1. Writes only last as long as it takes the device to copy the data to its internal ram caches, so there's no real need to reserve more than a few tags for writing. This leaves the remaining tags available for reading.
If you don't do this what happens is that you can stall your read I/O by saturating all available tags with write IOs... the writes get retired instantly to the drive's ram cache UNTIL that cache is full, then suddenly the newly issued write tags stall and sit there until the drive can flush some data out. If you don't control how many tags you use for pending writes, you can completely lock out read activity. A simple 'dd' to a file... hell, a simple file copy, for a large file, is big enough to exhibit this behavior.
This leaves even fewer tags available for writing. At best, if you want to maintain read performance, you can't really use more than ~8 or so tags for writing.
For a SSD maximum performance can be achieved with even fewer tags since in this case all you are doing is soaking up the command overhead by pipe-lining the IOs. Meaning that 2 write tags is sufficient, 3 to be safe.
All of this pretty much precludes being able to use the media completion bit with AHCI/SATA and still have good performance. To really be able to make use of such a bit one needs to support ~256 tags per target... that's to the actual physical device, NOT ~256 tags to the driver or ~256 tags to the SATA/SAS controller on the host.
In DragonFly, for HAMMER1, there were numerous ordering constraints that requires at least two DISK FLUSH commands per volume header sync. For HAMMER2 there are essentially no ordering constraints except for the volume header write itself so only one DISK FLUSH is required to create a recovery point. In both cases all the writes leading up to the required demarcation point could complete in any order. When combined with read:write tag reservation performance remains good even though the DISK FLUSH doesn't operate NCQ.
For SATA only read and write commands can use NCQ. All other commands require serialization and cannot run concurrently with NCQ commands, including unfortunately the DISK FLUSH command. You can blame Intel for this bit of stupidity... they intentionally broke the AHCI/SATA spec in order to artificially differentiate between SATA and SAS, so drive manufacturers could pump up the prices for SAS drives (even though both the hardware and the physical attachment is exactly the same). Intel broke the AHCI chipset spec in other ways to differentiate it, particularly when it comes to error recovery. They'll tell people that it was to 'maintain compatibility with the ATA command set' but IMHO they are lying. Some of the things Intel did in the AHCI spec were just phenomenally stupid. It is still much, much better than the ATA stuff, but they had a chance to make something really robust and blew it.
For example, with AHCI/SATA error recovery requires serialization, which means that if you use a port multiplier and one drive is having problems you have to stall out I/O to ALL OTHER DRIVES while you deal with the one that is having problems.
Lets clear up some things:
* First, on NCQ. *ALL* modern SATA hard drives implement NCQ and have ~31 tags.
* Bridge chips. *NO* modern motherboard uses a bridge chip any more. Bridge chips used on devices is another matter. Some devices still use bridge chips. Many DVDs and CDs used bridge chips (which is why early SATA DVDs and CDs were so broken), though I think that is finally dying out. The most famous was one of the OCZ SSD models which used a bridge chip to tie two controllers together. The controllers could handle ~31 tags, the bridge chip could not so the host probe would indicate no NCQ support. Also, multi-physical-interface devices such as netbooks (hard drives in a 'book' with SATA, USB, and Firewire interfaces)... those generally use bridge chips that often don't support NCQ.
* BIOS 'RAID'. So-called soft-raid. This is fake-raid. It isn't real. Don't expect it to actually work properly in a failure case. It's still talking to the AHCI controller, it's just hiding the fact from the OS. BIOS soft-raid 'controllers' are usually pretty horrible, avoidance is best.
* On data loss from caches. It isn't the caches that you need to battery-back (unless you are REALLY dependent on fsync() times in e.g. a database application). I think the trend is more towards off-host cache redundancy these days because it gets you to approximately the same place without the need for expensive gear. A large percentage of modern filesystems use write barriers and have no problem handling drive cache loss.
That isn't the problem. It's the physical power to the device being dropped while the write IO is in progress that is the problem. Devices, particularly SSDs but also many HDDs, cannot retire meta-data (for a SSD) or even the current sector (for a HDD) if a sudden loss of power occurs. In addition, a sudden power loss on a HDD can cause UNRELATED sectors to fail depending on how the HDD is writing (whether it is doing a full-track write or not). This can lead to serious corruption of the drive, even outright destruction. I've had quantum drives go through sudden power loss during a write with HUNDREDS of sectors lost instead of just one or two. That was a while ago, but it was still in the SATA-era... they were modern drives.
So for UPS/power concerns the only thing that really really matters is that the drive remain powered for at least a second or two. Even that is no guarantee. Barring that you want redundant storage on separate UPS's so someone kicking a plug out or crow-baring the UPS's output doesn't take you out.
* Super-caps... e.g. as Intel advertises on newer SSDs. These are primarily to retire meta-data so the SSD doesn't brick when you power it back up. Intel SSDs have very tiny ram caches so it might be able to retire those too, but most other SSDs have larger ram caches and no super-cap has enough suds to retire the entire cache. The idea is to not end up with a partially corrupt sector here, not necessarily to be able to retire the entire ram cache. Also, SSDs often do background cleanup when idle so not having any pending writes to an SSD doesn't make it safe, necessarily. This is what the super-cap idea primarily addresses.
Battery-backed ram comes under the same category. Well, in this case perhaps super-cap-backed ram (good for maybe ~a week to ~a month with low power static ram). Lots of options here that don't cost an arm and a leg, but again what matters the most is that the drive be able to retire whatever it is currently writing and not necessarily whatever is currently in its caches.
* SAS vs SATA. There is lots of talk about this all the time. I've never noticed any real difference in reliability, probably because the only real difference between the two is firmware. Drive vendors will talk-up using more robust parts but I believe that about as much as I believe that the moon is made out of cheese. There are so few components in HDDs that it is fairly difficult to differentiate consumer from enterprise these days.