The Lies Disks and Their Drivers Tell
davecb writes "Pity the poor filesystem designer: they just want to know when their data is safe, but the disks and drivers try so hard to make I/O 'easy' that it ends up being stupidly hard. Marshall Kirk McKusick writes about the difficulties in making the systems work nicely together: 'In the real world, many of the drives targeted to the desktop market do not implement the NCQ specification. To ensure reliability, the system must either disable the write cache on the disk or issue a cache-flush request after every metadata update, log update (for journaling file systems), or fsync system call. Both of these techniques lead to noticeable performance degradation, so they are often disabled, putting file systems at risk if the power fails. Systems for which both speed and reliability are important should not use ATA disks. Rather, they should use drives that implement Fibre Channel, SCSI, or SATA with support for NCQ.'"
But you lost me the moment you mentioned ATA drives.
Cheap, fast and reliable.
Pick any two.
I haven't seen a drive in at least a couple of years that didn't support NCQ. Is this really an issue? It sounds blown out of proportion.
One can't have ones cake and eat it. Speed or reliability, there should be more differentiation and more clarity in the specs. I want my backup-disk to be very reliable, I want my boot-disk to be fast. Best performance for both, but different circumstances.
All those moments will be lost in time, like tears in rain. Time to die.
We're talking about ATA drives?
As in non-SATA drives?
Who has those anymore?
While the article is good for publication in an academic journal like ACM, it's useless for the real world.
For that, the author should tell us whether most drives on the market have NCQ already or not. Popular drives like WD Green and Seagate's various lines.
Otherwise, saying "$A is useless without $Y" is pointless.
I'm not a lawyer, but I play one on the Internet. Blog
We shouldn't even be writing for ATA drives anymore. And any name brand manufacturer that you would trust (on a mediocre level) WD, Seagate etc... all support NCQ.
Get your PostgreSQL here: http://www.commandprompt.com/
I put my important files (pr0n, etc.) on my zfs mirror file server and scrub each week. The really important stuff (tax returns, etc.) I put in a safe deposit box at the bank.
Don't assume that "enterprise" disks do this correctly either.
Many have options to make them behave properly but out of the box have write back caches and ignore FUA or similar, leading to the same problems.
I never recommended ATA drives for servers. Really old stuff that used MFM and RLL drives was back in the era where the just anything else. I used ATA drives for my home stuff and lab where it wasn't expected to be very reliable, and SCSI was all I used for a very long time. Even today I recommend against SATA though it seems tolerable, but SCSI drives are still my standard.
Mostly I thought SCSI drives were also made better, but Seagate and WD convinced me otherwise.
And yes, MFM drives in a Novell DCB setup were among my first servers. Making NW 2.15c mount a 4 GB volume just so you can say you did it would not be fun today, but back then it was work, and clients paid for it. I'm glad it wasn't a VINES server.
deleting the extra space after periods so i can stay relevant, yeah.
Opening a file with attribute O_DIRECT seems to work quite nicely in bypassing the pagefile caching system, and getting the data to the disk in a timely fashion.
Systems for which both speed and reliability are important should not use ATA disks.
Ok, I'll keep that in mind next time I buy ATA disks.
Implementing the NCQ specification in nonhierarchical file system can easily be accomplished by passing an FMGH array through an EMH converter, while maintaining the NCQ specification via a THGN override. All NCQ specification still conform to the YTUR standard established in 1987 at the CMSD conference in Barcelona. If that helps at all.
2) The article's point on NCQ is that many consumer drives do not implement it correctly, and disable the write cache on the disk and issue cache-flush requests to increase performance, but leading to possible file-system failures if there is a power outage.
I think this article is saying that for the enterprise, buy enterprise drives, not consumer drives. Most consumers use laptops now, so power failure doesn't fit in, and consumers prefer speed over reliability, which is why I've always been stuck using laptops lacking ECC RAM.
Slashdot: Playing Favorites Since 1997
The thrust of the article seems to be that desktop-market SATA drives don't support native command queuing and that means filesystems can't guarantee integrity right before a power failure. That sounds a little out-of-date to me, I thought most SATA drives supported NCQ these days. A quick unscientific skim through the top three desktop drive manufacturers suggests this is true:
Seagate website:
"Since late 2004, most new SATA drive families have supported NCQ"
Western Digital Website does not make a similar statement but it appears that at least the "green" and "black" lines of desktop drive support NCQ meaning most if not all of their popular drives
Hitachi does not make statements on their website but searching product descriptions shows that at least their most popular "deskstar" line supports NCQ
Which would suggest that only a very small population of old or ultracheap hard drives are affected.
Native Command Queueing
Because not everybody knows everythingTM
systemd is Roko's Basilisk.
The people who make hardware RAID know all about the lying drives, they get good information from the manufacturer on how to make the drives play nice with the RAID controller.
Just read the compatibility charts for your RAID controller, many drives have footnotes with minimum drive firmware requirements and other odd behavior.
You can avoid the need for NCQ if you use a log-structure and protect references with strong checksums. In that way you will know after a crash if say a child tree node referenced is what the referencing parent thinks it should be, and you can use double-buffering or logging to roll back to a known good state. I believe ZFS does this, as does the experimental Lithium distributed file system developed by VMware. Don't bother with NCQ.
That would test and identify a drive for NCQ and cache disable/enable operation correctness that would report the model/serial and result to a central website
There is no right to feel safe thru security vaudeville at the expense of everyone's freedom, privacy and tax money.
I think this is quite interesting.
http://yarchive.net/comp/linux/drive_caches.html
While I've often gotten the impression that the write cache opens up a large "write hole", Linus says that data is cached only for milliseconds, not held in the cache for several seconds. Still, I'd like to see battery backed caches in regular drives and/or controllers.
Would be nice to hear from some drive firmware writers.
Are you even real nerds? What's up with you Slashdot??
hey GN, don't loose your cool when you see someone play lose with grammer.
Way to go, Kirk McKusick!
And does this mean I have a shot at being a US Marshall?
Put some flash ram on the HD with its own on-board battery backup ...
AccountKiller
ATA? Does anyone use that anymore? Hasn't the world gone to SATA, FC, or SCSI-? This seems a lot of ado about nothing...
-- Ed Carp, N7EKG erc@pobox.com PGP KeyID: 0x0BD32C9B What I'm up to: http://intuitives.mine.nu
Those web sites reviewing disk hardware never include any details about reliability. In a few cases you see they comment about the reliability in the specification, but they never ever actually test it. All they test is performance, some go even further and test both sequential and random access/write. That's the best kind of reviews you get. Noone is testing powerloss.
If a disk vender creates a reliable consumer driver it will not sell because it will get bad reviews.
I have Intel 330 SSD and had to manually enable NCQ. The worst part is that I've not noticed and performance changes at all.
The article is total crap, every disk supports NCQ as half the world's population has pointed out in the comments.
The problems are elsewhere: When a disk suddenly loses power while it is writing, there is a risk of various interesting errors. The disk may a) write nulls instead of the correct data, b) write garbage instead of the correct data, c) fail in the middle of a Read-Modify-Write operation and therefore destroy data in files which weren't written to at all, d) write good data to the wrong place on the disk, e) write garbage to a random spot on the disk. Sometimes you are lucky and the errors result in bad hardware checksums so you know you have lost data, at other times the wrong data gets the correct checksum.
In practice, very few desktop/notebook/whatever users will see these problems. No reviews test for these types of errors, so you cannot try to buy drives which fail in less harmful ways. If you care enough, you will use file systems with checksumming designed to catch all the above errors and more (Btrfs and ZFS come to mind). They will at least notify you that it happened, and depending on the redundancy settings they may be able to rescue the destroyed data.
Finally! A year of moderation! Ready for 2019?
Given that we are talking about Kirk McKusick an appeal to authority is entirely fair. Just because he didn't have a bunch of citations or references listed at the bottom of the article does not mean they do not exist somewhere. For you to say it is a "fallacious" appeal to authority is unfair - it has not been proven as fallacious
It's usually up to the one who makes a claim to back it up with evidence, not for others to disprove it - and they can't either, because there's falsifiability here. If I show that my drive has NCQ that works, that still doesn't falsify his claim. I can't bloody well test every drive on the planet, so there's no way to disprove him.
So yes, this is appeal to authority and what you do is putting the onus on those who disagree to prove a negative.
He may be right, and he's certainly renown, but to jump from there to "therefore he is right" is bunk. Even Einstein and Feynman make wrong claims. No one is immune. So some evidence would be welcome.
Probably every half decent controller card on the market for the last decade gets around this problem with a bit of memory and a battery to keep it alive. If you have a lot of disks on one system you'd probably have a controller like that anyway just to get enough SATA/SAS connections.
I can see how it's a big deal with workstations/desktops/laptops but that's really only a small chunk of storage in general.
Only? What about the advantage of a lot more SATA/SAS connections than you get on your motherboard? Also ZFS is limited in the number of platforms it is available for and BTRFS is not ready so it's a bit of a red herring throwing those in and saying that hardware RAID is not required because those exist.
This is a lot of noise for nothing. For kids and amateurs, here's a quick summary...
fsync used-to be the go-to, but that was decades ago, when IDE was in full-swing. Back then, there was a big hub-hub about drives lying. Since then, it's been common knowledge and status-quo that fsync is not trustworthy, end of story.
Today, we have WRITE BARRIERS, and they work great. Ever since, say, the advent of 60GB IDE drives, I've never found a drive that doesn't support write barriers, and in my conversations with Theodore Tso (maintainer of EXT3/EXT4), he said as much as well. I was surprised when I started up DRBD on a test system and found the system complaining the old 40GB drive I was using for testing didn't support write barriers, so that's how long ago we're talking about this having LAST been an issue.
There's still some issue with non-journaled file systems. If you're a BSDer, you really need to disable disk cache to prevent risks of corruption with soft updates. The XFS guys recommend disabling disk cache as well, but I suspect that's just because larger RAID arrays may have entire large files cached, resulting in some individual file loss after power-outages.
Any RAID controllers will have such an option... Write-through and write-back... with advice to be sure your RAID cache's backup battery is working fine before enabling write caching.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
I don't know how this is news. I've been developing RAID systems for 15 years (stated with parallel SCSI-10 & 20), and none of this is new. The problem with disk write cache is you will lose your data if there's a power loss or the drive bugchecks and resets itself. This could also happen if you pull a hot-swap drive. With write cache enabled, the host has already been told that the data is saved. With cache disabled and tagging or NCQ, you still use the cache and get the advantages of optimized ordering and consolidated writes, but you don't get notification of write completion until the data actually hits the media. The latency is lower for write caching, but the maximum steady-state throughput of uncached tagged writes is about the same since the drive cache is always full and needs to be dumped to media. Some SATA drives literally used the same code for NCQ and caching with the notification time being the only difference. Full-feature RAID systems cam provide safe write caching that is protected against power loss by batteries, supercaps, and flash (or some combination of these). Enterprise class systems also have multiple controllers and mirror the cache between them in case one controller or memory system dies. A drive could provide the same functionality by including enough flash to copy the cache into and a capacitor (or a system to convert latent rotational momentum into electricity) that will keep the electronics alive long enough to do the copying. However, this will probably never happen since SATA drives are designed and marketed for mass markets, and adding even a few cents of per unit cost is hard to justify. For the same reasons, the mechanical part of the SATA drives are also cheap. If you want reliability, you need to spend the extra money to get either enterprise-class SATA, SAS, SCSI, or Fibre Channel. SATA drives also suffer from poor error detection/correction algorithms because they minimize the amount of redundant metadata to increase user data space (typically 8 bytes of CRC vs. 40 bytes of ECC). The rates of undetected errors is about the same as a falsely detected error (on the order of 1 per petabyte). If you're using SATA for nearline storage, there's a good chance of inducing errors when you save and restore the data if your sets are big enough.
Shitty consumer oriented hardware not suitable for enterprise class data integrity and retention.
If you need data integrity and cache, you need a battery backed up IO controller and UPS for a start. If you're relying on the fact that turning cache off on the drive is going to ensure that your writes complete before the power goes out to the drive, you've already set sail for fail.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Way to go, Kirk McKusick!
And does this mean I have a shot at being a US Marshall?
Almost as big a one as being a U.S. Martial.
I see even classic Slashdot is now pretty much unusable on dial up anymore.
That my 1TB / 32MB cache drive for under $100 is not up to the task of being both reliable and fast?
“Common sense is not so common.” — Voltaire
BtrFS is ready for serious use, there are just additional goodies planned for it.
ZFS exists in some form for all relevant server platforms -- on Linux, the kernel module is indistributable except as source[1], but installing dkms doesn't even require knowing what a compiler is. Unlike BtrFS, I wouldn't use it for production use yet (on non-Solaris non-BSD) because the kernel module is quite new, but it's there.
Both of them can do RAID better than the traditional models as they know the filesystem's layout. Also, they can store some files as JBD and some as RAID on the same filesystem.
[1]. Its license was designed to be incompatible.
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
The fuck is a U.S. "Martial"? That's not even a viable "Marital" aid joke.
It's not the licence that's the entire problem on linux, the great big missing chunks of functionality are the problem which makes ZFS a long way from general purpose use on that platform, which is a pity because I'm very impressed with ZFS and considering setting up my next file server as BSD or some type of solaris to use it (got to see what can handle the LSI stuff - probably all). BtrFS is no more ready for serious use than reiserfs ever was - the "additional goodies" are needed to cope with hiccups from hardware even if the file system is supposed to be perfect.
The OP AC confused the name "Marshall" with the law enforcement officer "Marshal".
There was no viable joke in their post.
I see even classic Slashdot is now pretty much unusable on dial up anymore.
That is, NCQ (for SATA) does not have enough command slots available. Only ~32 or so per port. The SAS stuff works a bit better but I think still limits you to ~32 slots per target device (instead of ~32 per port on the driver side).
So what happens if you try to use the so-called media completion feature is that write operations eat up all your tags and crowd out the far more important read operations. This makes the feature almost worthless in my view. Not to mention that there is no way to determine, with the SATA spec, that the drive actually honors the bit. Just like the old ATA stuff, drive vendors play fast and loose with the AHCI/SATA specs in order to try to force people to use the far more expensive SAS drives, even though the actual hardware is the same.
What we do in DragonFly is split available tags into read-dedicated tags and write-dedicated tags, approximately 3:1. Writes only last as long as it takes the device to copy the data to its internal ram caches, so there's no real need to reserve more than a few tags for writing. This leaves the remaining tags available for reading.
If you don't do this what happens is that you can stall your read I/O by saturating all available tags with write IOs... the writes get retired instantly to the drive's ram cache UNTIL that cache is full, then suddenly the newly issued write tags stall and sit there until the drive can flush some data out. If you don't control how many tags you use for pending writes, you can completely lock out read activity. A simple 'dd' to a file... hell, a simple file copy, for a large file, is big enough to exhibit this behavior.
This leaves even fewer tags available for writing. At best, if you want to maintain read performance, you can't really use more than ~8 or so tags for writing.
For a SSD maximum performance can be achieved with even fewer tags since in this case all you are doing is soaking up the command overhead by pipe-lining the IOs. Meaning that 2 write tags is sufficient, 3 to be safe.
All of this pretty much precludes being able to use the media completion bit with AHCI/SATA and still have good performance. To really be able to make use of such a bit one needs to support ~256 tags per target... that's to the actual physical device, NOT ~256 tags to the driver or ~256 tags to the SATA/SAS controller on the host.
In DragonFly, for HAMMER1, there were numerous ordering constraints that requires at least two DISK FLUSH commands per volume header sync. For HAMMER2 there are essentially no ordering constraints except for the volume header write itself so only one DISK FLUSH is required to create a recovery point. In both cases all the writes leading up to the required demarcation point could complete in any order. When combined with read:write tag reservation performance remains good even though the DISK FLUSH doesn't operate NCQ.
For SATA only read and write commands can use NCQ. All other commands require serialization and cannot run concurrently with NCQ commands, including unfortunately the DISK FLUSH command. You can blame Intel for this bit of stupidity... they intentionally broke the AHCI/SATA spec in order to artificially differentiate between SATA and SAS, so drive manufacturers could pump up the prices for SAS drives (even though both the hardware and the physical attachment is exactly the same). Intel broke the AHCI chipset spec in other ways to differentiate it, particularly when it comes to error recovery. They'll tell people that it was to 'maintain compatibility with the ATA command set' but IMHO they are lying. Some of the things Intel did in the AHCI spec were just phenomenally stupid. It is still much, much better than the ATA stuff, but they had a chance to make something really robust and blew it.
For example, with AHCI/SATA error recovery requires serialization, which means that if you use a port multiplier and one drive is having problems you have to stall out I/O to ALL OTHER DRIVES while you deal with the one that is having problems.
Lets clear up some things:
* First, on NCQ. *ALL* modern SATA hard drives implement NCQ and have ~31 tags.
* Bridge chips. *NO* modern motherboard uses a bridge chip any more. Bridge chips used on devices is another matter. Some devices still use bridge chips. Many DVDs and CDs used bridge chips (which is why early SATA DVDs and CDs were so broken), though I think that is finally dying out. The most famous was one of the OCZ SSD models which used a bridge chip to tie two controllers together. The controllers could handle ~31 tags, the bridge chip could not so the host probe would indicate no NCQ support. Also, multi-physical-interface devices such as netbooks (hard drives in a 'book' with SATA, USB, and Firewire interfaces)... those generally use bridge chips that often don't support NCQ.
* BIOS 'RAID'. So-called soft-raid. This is fake-raid. It isn't real. Don't expect it to actually work properly in a failure case. It's still talking to the AHCI controller, it's just hiding the fact from the OS. BIOS soft-raid 'controllers' are usually pretty horrible, avoidance is best.
* On data loss from caches. It isn't the caches that you need to battery-back (unless you are REALLY dependent on fsync() times in e.g. a database application). I think the trend is more towards off-host cache redundancy these days because it gets you to approximately the same place without the need for expensive gear. A large percentage of modern filesystems use write barriers and have no problem handling drive cache loss.
That isn't the problem. It's the physical power to the device being dropped while the write IO is in progress that is the problem. Devices, particularly SSDs but also many HDDs, cannot retire meta-data (for a SSD) or even the current sector (for a HDD) if a sudden loss of power occurs. In addition, a sudden power loss on a HDD can cause UNRELATED sectors to fail depending on how the HDD is writing (whether it is doing a full-track write or not). This can lead to serious corruption of the drive, even outright destruction. I've had quantum drives go through sudden power loss during a write with HUNDREDS of sectors lost instead of just one or two. That was a while ago, but it was still in the SATA-era... they were modern drives.
So for UPS/power concerns the only thing that really really matters is that the drive remain powered for at least a second or two. Even that is no guarantee. Barring that you want redundant storage on separate UPS's so someone kicking a plug out or crow-baring the UPS's output doesn't take you out.
* Super-caps... e.g. as Intel advertises on newer SSDs. These are primarily to retire meta-data so the SSD doesn't brick when you power it back up. Intel SSDs have very tiny ram caches so it might be able to retire those too, but most other SSDs have larger ram caches and no super-cap has enough suds to retire the entire cache. The idea is to not end up with a partially corrupt sector here, not necessarily to be able to retire the entire ram cache. Also, SSDs often do background cleanup when idle so not having any pending writes to an SSD doesn't make it safe, necessarily. This is what the super-cap idea primarily addresses.
Battery-backed ram comes under the same category. Well, in this case perhaps super-cap-backed ram (good for maybe ~a week to ~a month with low power static ram). Lots of options here that don't cost an arm and a leg, but again what matters the most is that the drive be able to retire whatever it is currently writing and not necessarily whatever is currently in its caches.
* SAS vs SATA. There is lots of talk about this all the time. I've never noticed any real difference in reliability, probably because the only real difference between the two is firmware. Drive vendors will talk-up using more robust parts but I believe that about as much as I believe that the moon is made out of cheese. There are so few components in HDDs that it is fairly difficult to differentiate consumer from enterprise these days.
http://en.wikipedia.org/wiki/Elevator_algorithm
You must account for the hardware-side: It's a constraint over theoreticals @ the logical filesystem level ONLY, vs. the APPLIED thought above (which works @ the actual physical level rather well reducing head movements - I note others below in caching & others) to compensate for the physical machine level world of actual physical movement vs. signals only travelling @ say, 67% of the speed of light via Coax minus attenuation degradations? Well, the last of it currently is in HDD's that include mechanicals for I/O to CPU other than tracking signals such as for fanspeeds!
(However, I appreciate yours!)
However - Current journalling filesystems are adequate such as NTFS & it's Binary Seek methods @ the applied level, & within the bounds of current circular track driven filesystems @ the logical (and physical with HDD's, still predominant))...
Additionally - For both speed + tolerance related purposes!
(You sound as if you may have read up some on the only bolded portion of this below from the sounds of it)
Anyhow/anyways:
* It is applied techniques of that nature in any art &/or sciences (hopefully both in combination) that makes me realize the human race still has hope because we are capable of building some cool things that involve that level of thought, & others considerably more complex...
APK
P.S.=> That's the best part of disks lately in terms of intelligent design, imo, instead of just more or wider lanes of transfer with added signal bits... That, my friends, is APPLIED THOUGHT above... lol, it truly "elevates" the human condition!
That's a lot different, you must admit, than just throwing "more" @ a problem instead of conquering it thru more intelligent + efficient methods & designs...
Put it this way:
Lotus got 1,400 hp out of a 4 cylinder (the type of car motor we all should have)
So, imo @ least?
All "things disk" have YET to peak!
One day, We'll all have:
---
1.) Non-Flash main "True SSD" disks
2.) With a Flash backup in realtime via mirroring - to maintain state (that type of tech ought to)
3.) Using filesystems DESIGNED FOR SSD, not circular disks (excellent read on that is searching IRON FILESYSTEMS online, albeit applied to ramdisks/ramdrives be they software OR hardware)
4.) Memory path circuits all based on whatever the current state-of-the-art DDRam or whatever is mainstream + on the most maxed-out bus @ the time!
---
(Now, that's what disks ought to be... see above!)
"Hyper-Performance!"
Hope I'm around for it when (if) that design happens!
I emulate it now with 10k rpm WD Velociraptors 16mb cache buffered, 128mb EEC Promise Ex-8350 SATA II RAID Caching Controller, & driver software OS kernelmode system caching + a 4gb TRUE SSD (DDR2 Ram) Gigabyte IRAM offloading TONS of things most folks burden their disks with slowing them down (thank goodness for the elevator algorithm designer) to it (pagefile, logging, temp/tmp ops from all things, & more) on an NTFS compressed partition (like doubling RAM, except the pagefile.sys doesn't get it) offloading my WD Velociraptors cached-to-the-max)!
(Patent pending APK - let /. be my documenting the design architecture of the disk of the future for a long time in the distance, because it is doable, now - Total "haul A$$" drives on all levels, no shortcomings that outweigh the benefts (except costs mostly currently)."The Future IS Now", absolutely (only not yet implemented exactly as above))...
... apk