Slashdot Mirror


The Lies Disks and Their Drivers Tell

davecb writes "Pity the poor filesystem designer: they just want to know when their data is safe, but the disks and drivers try so hard to make I/O 'easy' that it ends up being stupidly hard. Marshall Kirk McKusick writes about the difficulties in making the systems work nicely together: 'In the real world, many of the drives targeted to the desktop market do not implement the NCQ specification. To ensure reliability, the system must either disable the write cache on the disk or issue a cache-flush request after every metadata update, log update (for journaling file systems), or fsync system call. Both of these techniques lead to noticeable performance degradation, so they are often disabled, putting file systems at risk if the power fails. Systems for which both speed and reliability are important should not use ATA disks. Rather, they should use drives that implement Fibre Channel, SCSI, or SATA with support for NCQ.'"

33 of 192 comments (clear)

  1. almost clicked the link... by adturner · · Score: 5, Funny

    But you lost me the moment you mentioned ATA drives.

    1. Re:almost clicked the link... by Lunix+Nutcase · · Score: 5, Insightful

      And yet fails to name any. Looking at Seagates site about NCQ pretty much every consumer model since 2004 has NCQ. This seems overblown.

    2. Re:almost clicked the link... by h4rr4r · · Score: 3, Interesting

      I still bet those drives if you pull power on them will lose the data in their onboard caches.

      Which means they are lying about fsync.

    3. Re:almost clicked the link... by Lunix+Nutcase · · Score: 2, Insightful

      As weighty of an argument as your bet might seem to you, I'd refer actual evidence.

    4. Re:almost clicked the link... by MikeBabcock · · Score: 2

      If so, the article should link a proper study or basic attempt at surveying drives and how well they survive such behaviour instead of surmising.

      --
      - Michael T. Babcock (Yes, I blog)
    5. Re:almost clicked the link... by Lunix+Nutcase · · Score: 2

      Ok. All my drives, which range in age of at least 4-5 years, support it and they are all the same models that Seagate lists support for. So once again, this sounds like overinflated sensationalism. If it was really such a problem he could have listed a few models to support his claim instead of nebulous handwaving, no?

    6. Re:almost clicked the link... by Lunix+Nutcase · · Score: 2

      Falllacious appeal to authority. I know who he is yet if it was as common as he claims he could do better than nebulous handwaving.

    7. Re:almost clicked the link... by TheGratefulNet · · Score: 4, Informative

      yeah, well, I have quite a bit of experience with samsung (not seagate branded but the older samsungs) drives.

      they REPORTED having ncq but you always had to disable them.

      I got so that I do this at bootup:

      if [ -e /sys/block/sda/device/queue_depth ] ; then
            echo " sda NCQ now off"
            echo 1 > /sys/block/sda/device/queue_depth
      fi

      and so on.

      performance does not suffer (that I would care about) BUT the data reliab was more than making up for it. no more timeouts, no more syslog 'scaries'.

      vendors really do fuck up the protocol implementations. seagate is 'strange' in ways, so is WD, so is hitachi and ibm (I know they are not even in the biz anymore, at least for consumer drives).

      windows has a 'blacklist' of what things to not use when talking to drives and so does linux. its a fact of life.

      drive vendors are borderline idiots. sad but true ;(

      --

      --
      "It is now safe to switch off your computer."
    8. Re:almost clicked the link... by TheGratefulNet · · Score: 3, Informative

      you'll see it in syslog!

      timeouts, retries, even exiting the bus and doing full bus resets (which are slow and you'll NOT miss them).

      as I posted before, older (5yr) samsungs were notorious for SAYING they support ncq but you would be foolish to let it just negotiate it and use it.

      this was how things were in the very early days of 10/100 ethernet and full/half duplex. yes, the early models 'negotiated' duplex but many of them got it wrong and you'd have to manually set this on hubs/switches since you knew better than the equipment. there were even early NIC chips that worked better at 10meg ethernet than 100baseT! we would do ftp transfer tests and quite often a GOOD 10baseT was more reliable (over time) than 100baseT. the same happened to gig-e, too, in the early years.

      --

      --
      "It is now safe to switch off your computer."
    9. Re:almost clicked the link... by Eponymous+Hero · · Score: 5, Informative
      you didn't bother to RTFA, good for you. it says quite plainly that (only part of) the problem is not drives that don't support ncq, but those drives that have it and disable it. and that was a relatively small portion of TFA. here's how the disks lie:

      File systems need to be aware of the change to the underlying media and ensure that they adapt by always writing in multiples of the larger sector size. Historically, file systems were organized to store files smaller than 512 bytes in a single sector. With the change in disk technology, most file systems have avoided the slowdown of 512-byte writes by making 4,096 bytes the smallest allocation size. Thus, a file smaller than 512 bytes is now placed in a 4,096-byte block. The result of this change is that it takes up to eight times as much space to store a file system with predominantly small files. Since the average file size has been growing over the years, for a typical file system the switch to making 4,096 bytes the minimum allocation size has resulted in a 10- to 15-percent increase in required storage.

      just to clarify what the author's point was:

      The conclusion is that file systems need to be aware of the disk technology on which they are running to ensure that they can reliably deliver the semantics that they have promised. Users need to be aware of the constraints that different disk technology places on file systems and select a technology that will not result in poor performance for the type of file-system workload they will be using. Perhaps going forward they should just eschew those lying disks and switch to using flash-memory technology—unless, of course, the flash storage starts using the same cost-cutting tricks.

      if you want to argue that, great, go nuts. nobody who actually RTFA thinks the argument is really about ncq. the ac you responded to said

      the way I interpret TFA, the problem also applies to SATA drives which do not implement the NCQ specification.

      well, here's what TFA actually said:

      Luckily, SATA (serial ATA) has a new definition called NCQ (Native Command Queueing) that has a bit in the write command that tells the drive if it should report completion when media has been written or when cache has been hit. If the driver correctly sets this bit, then the disk will display the correct behavior.

      In the real world, many of the drives targeted to the desktop market do not implement the NCQ specification. To ensure reliability, the system must either disable the write cache on the disk or issue a cache-flush request after every metadata update, log update (for journaling file systems), or fsync system call. Both of these techniques lead to noticeable performance degradation, so they are often disabled, putting file systems at risk if the power fails. Systems for which both speed and reliability are important should not use ATA disks. Rather, they should use drives that implement Fibre Channel, SCSI, or SATA with support for NCQ

      i hope it's painfully obvious by now that the point about ncq is not that some drives don't have it; it's that some don't use it -- mostly so you don't go giving their drives bad reviews for being slow but unnoticeably reliable. if it's disabled, you can enable it. what sata drives don't have ncq? i asked wikipedia:

      SATA revision 1.0 (SATA 1.5 Gbit/s) .... During the initial period after SATA 1.5 Gbit/s finalization, adapter and drive manufacturers used a "bridge chip" to convert existing PATA designs for use with the SATA interface. Bridged drives have a SATA connector, may include either or both kinds of power connectors, and, in general, perform identically to their PATA equivalents. Most lack support for some SATA-specific features such as NCQ. Native SATA products quickly eclipsed bridged products with the introduction of the second generation of SATA drives.

      so yeah, probably not a whole lot of these drives being sold new, but there are lots of shops that buy used gear because it's cheap. these older sata drives haven't all just disappeared when revision 2.0 came out.

      --
      insensitive clod overlords obligatory xkcd car analogy russian reversals whoosh pedant fanbois ftfy in 3...2...1..PROFIT
    10. Re:almost clicked the link... by greg1104 · · Score: 2

      Intel's early SSDs such as the Intel X25-E were the last time I really got screwed by SATA drives that screwed this up very badly. See the PostgreSQL page on Reliable Writes for a lot more details on this subject.

    11. Re:almost clicked the link... by hoggoth · · Score: 5, Insightful

      LOSE LOSE LOSE LOSE! YOU WILL LOSE DATA!

      Sorry... I'm usually a calm rational person. I almost never become a grammar-nazi, spelling nazi, or troll. It's just that I see this so often I'm afraid one day Webster will just give up and switch the definitions of Lose and Loose.

      --
      - For the complete works of Shakespeare: cat /dev/random (may take some time)
    12. Re:almost clicked the link... by anomaly256 · · Score: 4, Informative

      Green drives from Seagate do not appear to have NCQ. As per below, I have 1 normal and 4 greens in this box:

      ~$ cat /sys/block/sd?/device/queue_depth
      31
      1
      1
      1
      1

      ~$ cat /sys/block/sd?/device/queue_type
      simple
      none
      none
      none
      none

    13. Re:almost clicked the link... by ak3ldama · · Score: 2

      Given that we are talking about Kirk McKusick an appeal to authority is entirely fair. Just because he didn't have a bunch of citations or references listed at the bottom of the article does not mean they do not exist somewhere. For you to say it is a "fallacious" appeal to authority is unfair - it has not been proven as fallacious. (You assert it to be fallacious due to a lack of reference... the culture created by Wikipedia and all the "[Citation Needed]" slackers never fails to impress me.) Surely there exists blacklists in source in Linux/FreeBSD/other publicly viewable code, I also will not hold your hand and show you where.

      I have personally seen these kinds of issues (with writes not happening soon enough and fsync calls introduced for data integrity) with flash media which is something mentioned in the beginning of article. I would like to further comment that the article talked about other things such as sector size side effects and the impact on useful space. ++Great article. Does anyone else remember how he (Kirk McK.) used to sell shirts and pc stickers? I still have the bsd daemon logo sticker on the case of my first pc.

      --
      "but money is the God of Algiers & Mahomet their prophet." - Rich. O'Bryen June 8th 1786
    14. Re:almost clicked the link... by hoggoth · · Score: 4, Funny

      "and then..."

      and then all hell will break lose, obviously.

      --
      - For the complete works of Shakespeare: cat /dev/random (may take some time)
    15. Re:almost clicked the link... by causality · · Score: 2

      Intel's early SSDs such as the Intel X25-E were the last time I really got screwed by SATA drives that screwed this up very badly. See the PostgreSQL page on Reliable Writes for a lot more details on this subject.

      This is why I am never an early adopter. If there were some tremendous emergency that only an early SSD could solve, and life-and-limb were on the line, I suppose I would take my chances. But I've never had that much of a need for an SSD.

      I suppose I have pioneers like you to thank, however, for helping to identify and work out the problems so that people like me who wait a little while have such a good experience. It's like volunteer work, except of course that you had to pay in order to do it.

      --
      It is a miracle that curiosity survives formal education. - Einstein
  2. 2 out of 3 by ardmhacha · · Score: 4, Insightful

    Cheap, fast and reliable.

    Pick any two.

  3. Sorry, what? by Compaqt · · Score: 3, Insightful

    We're talking about ATA drives?

    As in non-SATA drives?

    Who has those anymore?

    While the article is good for publication in an academic journal like ACM, it's useless for the real world.

    For that, the author should tell us whether most drives on the market have NCQ already or not. Popular drives like WD Green and Seagate's various lines.

    Otherwise, saying "$A is useless without $Y" is pointless.

    --
    I'm not a lawyer, but I play one on the Internet. Blog
    1. Re:Sorry, what? by MikeBabcock · · Score: 2

      An SATA drive is a subset of ATA drives. You're thinking of PATA or IDE drives.

      http://en.wikipedia.org/wiki/Serial_ATA

      In other words, when someone says "ATA drives" they aren't exclusively talking about non-SATA drives.

      --
      - Michael T. Babcock (Yes, I blog)
  4. ATA drives...? WTF by poet · · Score: 3

    We shouldn't even be writing for ATA drives anymore. And any name brand manufacturer that you would trust (on a mediocre level) WD, Seagate etc... all support NCQ.

    --
    Get your PostgreSQL here: http://www.commandprompt.com/
    1. Re:ATA drives...? WTF by FranTaylor · · Score: 2

      Are you saying we should cast the ATA driver out of the kernel and dispose of all our ATA hardware?

      Even though it's not in new hardware any more, we still need to support it in existing hardware. The driver still needs work when the kernel APIs change.

  5. I work in the storage industry. by Anonymous Coward · · Score: 3, Informative

    Don't assume that "enterprise" disks do this correctly either.

    Many have options to make them behave properly but out of the box have write back caches and ignore FUA or similar, leading to the same problems.

  6. Duh by rickb928 · · Score: 2

    I never recommended ATA drives for servers. Really old stuff that used MFM and RLL drives was back in the era where the just anything else. I used ATA drives for my home stuff and lab where it wasn't expected to be very reliable, and SCSI was all I used for a very long time. Even today I recommend against SATA though it seems tolerable, but SCSI drives are still my standard.

    Mostly I thought SCSI drives were also made better, but Seagate and WD convinced me otherwise.

    And yes, MFM drives in a Novell DCB setup were among my first servers. Making NW 2.15c mount a 4 GB volume just so you can say you did it would not be fun today, but back then it was work, and clients paid for it. I'm glad it wasn't a VINES server.

    --
    deleting the extra space after periods so i can stay relevant, yeah.
    1. Re:Duh by petermgreen · · Score: 2

      And google is not your average company.

      Google has a LOT of servers running much the same workloads. As such it makes sense for them to put in the software engineering effort to achive higher level redundancy. They engineer things so they don't have to care if a server dies.

      Most companies have a relatively small number of servers each with a particular task. If one of those servers fails it's a much bigger deal that can mean significant downtime and/or data loss. IIRC restoring a big database from backup and then replaying logs onto it is not a fast process.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
  7. Not about ATA, about enterprise data storage by MSTCrow5429 · · Score: 4, Informative
    1) This article isn't about ATA, ignore it.

    2) The article's point on NCQ is that many consumer drives do not implement it correctly, and disable the write cache on the disk and issue cache-flush requests to increase performance, but leading to possible file-system failures if there is a power outage.

    I think this article is saying that for the enterprise, buy enterprise drives, not consumer drives. Most consumers use laptops now, so power failure doesn't fit in, and consumers prefer speed over reliability, which is why I've always been stuck using laptops lacking ECC RAM.

    --
    Slashdot: Playing Favorites Since 1997
    1. Re:Not about ATA, about enterprise data storage by MSTCrow5429 · · Score: 2

      Windows 7's Device Manager, there is a Policies tab, allowing you to "Enable write caching on the device" and additionally to "Turn off Windows write-cache buffer flushing on the device." The former warns "a power outage or equipment failure might result in data loss or corruption." The latter states "do no select this check box unless the device has a separate power supply that allows the device to flush its buffer in case of power failure." In Windows 7, by default, write-caching is on, and write-cache buffer flush is off. It does note that not all drives allow you to change these settings, possibly indicating that the article's author recommends any modern drive that allows one to manually choose reliability over performance. The major issue with both is that data may reside in primary memory and has not been written to the drive, there's a power failure, and your data disappears.

      --
      Slashdot: Playing Favorites Since 1997
    2. Re:Not about ATA, about enterprise data storage by ChumpusRex2003 · · Score: 5, Informative

      The "Turn off Windows write-cache buffer flushing on the device" option activates an ancient windows bug, and should never be used.

      When Windows 3.11 was released, MS accidentally introduced a bug, whereby a call to "sync" (or whatever the windows equivalent was called) would usually be silently dropped. At the time, a few programmers noticed that their file I/O appeared to have improved, and attributed this to MS's much marketed new 32-bit I/O layer. What a lot of naive developers didn't notice was that the reason their I/O appeared to be faster was that the OS was handling file steams in an aggressive write-back mode, and then calls to "sync" were being ignored by the OS.

      Because of this, there was a profusion of office software, in particular, accounting software, which would "sync" frequently - some packages would call "sync" on every keypress, or everytime enter was pressed, or the cursor moved to the next data entry field. As on 3.11, this call was effectively a NOP, a lot of packages made it onto client machines, and because it was fast, no one noticed.

      With Win95, MS fixed the bug. Suddenly, corporate offices around the world had their accounting software reduced to glacial speed, and tech support departments at software vendors rapidly went into panic mode. Customers were blaming MS, Win95 was getting slated, lawyers were starting to drool, etc. Developers were calling senators and planning anti-trust actions. The whole thing was getting totally out of hand.

      In the end, MS decided the only way to deal with this bad PR, was to put an option into windows, where the bug could be reproduced for software which depended upon it. The option to activate the bug was hidden away reasonably well, in order to stop most people from turning it on, and running their file-system in a grossly unstable mode. However, in Win95 - Vista, it had a rather cryptic name "Advanced performance", which meant that a lot of hardware enthusiasts would switch it on, in order to improve performance, without any clear idea of what it did. At least in Win7 it now has a clear name, even though it still doesn't make clear that it should only be used for when using defective software.

  8. NCQ - Native Command Queueing by wonkey_monkey · · Score: 4, Informative

    Native Command Queueing

    Because not everybody knows everythingTM

    --
    systemd is Roko's Basilisk.
  9. Get Hardware RAID by FranTaylor · · Score: 4, Insightful

    The people who make hardware RAID know all about the lying drives, they get good information from the manufacturer on how to make the drives play nice with the RAID controller.

    Just read the compatibility charts for your RAID controller, many drives have footnotes with minimum drive firmware requirements and other odd behavior.

    1. Re:Get Hardware RAID by randallman · · Score: 3, Interesting

      The only real advantage to "Hardware RAID" is the battery backed cache. Hardware RAID comes with the disadvantage of a whole other operating system "firmware" with its own bugs and often proprietary disk layout. Parity calculations are nothing for current CPUs, so the onboard processor is not so useful. Advanced filesystems such as ZFS or BTRFS need direct access to the disks. I'd like to see drives and/or controllers with battery backed cache. Until then, I rely on my UPS.

  10. Linus's Input on Write Cache by randallman · · Score: 3, Interesting

    I think this is quite interesting.

    http://yarchive.net/comp/linux/drive_caches.html

    While I've often gotten the impression that the write cache opens up a large "write hole", Linus says that data is cached only for milliseconds, not held in the cache for several seconds. Still, I'd like to see battery backed caches in regular drives and/or controllers.

    Would be nice to hear from some drive firmware writers.

  11. The real problems are entirely different by amorsen · · Score: 2

    The article is total crap, every disk supports NCQ as half the world's population has pointed out in the comments.

    The problems are elsewhere: When a disk suddenly loses power while it is writing, there is a risk of various interesting errors. The disk may a) write nulls instead of the correct data, b) write garbage instead of the correct data, c) fail in the middle of a Read-Modify-Write operation and therefore destroy data in files which weren't written to at all, d) write good data to the wrong place on the disk, e) write garbage to a random spot on the disk. Sometimes you are lucky and the errors result in bad hardware checksums so you know you have lost data, at other times the wrong data gets the correct checksum.

    In practice, very few desktop/notebook/whatever users will see these problems. No reviews test for these types of errors, so you cannot try to buy drives which fail in less harmful ways. If you care enough, you will use file systems with checksumming designed to catch all the above errors and more (Btrfs and ZFS come to mind). They will at least notify you that it happened, and depending on the redundancy settings they may be able to rescue the destroyed data.

    --
    Finally! A year of moderation! Ready for 2019?
  12. Can he turn water to wine too? by arth1 · · Score: 2

    Given that we are talking about Kirk McKusick an appeal to authority is entirely fair. Just because he didn't have a bunch of citations or references listed at the bottom of the article does not mean they do not exist somewhere. For you to say it is a "fallacious" appeal to authority is unfair - it has not been proven as fallacious

    It's usually up to the one who makes a claim to back it up with evidence, not for others to disprove it - and they can't either, because there's falsifiability here. If I show that my drive has NCQ that works, that still doesn't falsify his claim. I can't bloody well test every drive on the planet, so there's no way to disprove him.
    So yes, this is appeal to authority and what you do is putting the onus on those who disagree to prove a negative.

    He may be right, and he's certainly renown, but to jump from there to "therefore he is right" is bunk. Even Einstein and Feynman make wrong claims. No one is immune. So some evidence would be welcome.