Your Hard Drive Lies to You

← Back to Stories (view on slashdot.org)

Posted by CowboyNeal on Thursday May 12, 2005 @07:18PM from the say-it-ain't-so dept.

fenderdb writes "Brad Fitzgerald of LiveJournal fame has written a utility and a quick article on how all hard drives from the consumer level to the highest level 'enterprise' grade SCSI and SATA drives do not obey the fsync() function. Manufacturers are blatantly sacrificing integrity in favor of scoring higher on 'pure speed' performance benchmarking."

12 of 512 comments (clear)

Min score:

Reason:

Sort:

Re:Err... "lying" is the default setting. RTFM. by ewhac · 2005-05-12 19:35 · Score: 5, Informative

Yes, except there is a 'sync' command packet that is supposed to make the drive commit outstanding buffers to the platters, and not signal completion until those writes are done. It would appear, at first blush, that the drives are mis-handling this command when write-caching is enabled.
There is historical precedent for this. There were recorded incidents of drives corrupting themselves when the OS, during shutdown, tried to flush buffers to the disk just before killing power. The drive said, "I'm done," when it really wasn't, and the OS said Okay, and killed power. This was relatively common on systems with older, slower disks that had been retrofitted with faster CPUs.

However, once these incidents started ocurring, the issue was supposed to have been fixed. Clearly, closer study is needed here to discover what's really going on.

Schwab

--
Editor, A1-AAA AmeriCaptions
Author lied when implied that DRIVES are the issue by Anonymous Coward · 2005-05-12 19:42 · Score: 5, Informative

The author lied when implied that DRIVES are the issue.

ATA-IDE, SCSI, and S-ATA drives from all major manufacturers will accept commands to flush the write buffer including track cache buffer completely.

These commands are critical before cutting power and "sleeping" in machines that can perform a complete "deep sleep" (no power at all whatsoever sent to the ATA-IDE drive.

Such OSes include Apples OS 9 on a G4 tower, and some versions of OSX on machines not supplied with certain nuaghty video cards.

Laptops, for example need to flush drives... AND THEY do.

All drives conform.

As for DRIVER AUTHORS not heeding the special calls sent to them.... he is correct.

Many driver writers (other than me) are loser shits that do not follow standards.

As for LSI raid cards, he is right, and otehr raid cards... that is becasue the products are defective. But the drives are not and the drivers COULD be written to honor a true flush.

As for his "discovery" of sync not working.... DUH!!!!!

the REAL sync is usually a privelidged operation, sent from the OS, and not highly documented.

For example on a Mac the REAL sync in OS9 is a jhook trap and not the documented normal OS call which has a governor on it.

Mainframes such as PRIMOS and other old mainframes including even unix typically faked the sync command and ONLY allowed it if the user was at the actual physical systems console and furthermore logged in as a root or backup operator.

This cheating always sickened me. but all OSes do this because so many people that think they know what they are doing try to sync all the time for idiotic self-rolled journalling file systems and journalled databases.

But DRIVES, except a couple S-ATA seagates from 2004 with bad firmware, ALWAYS will flush.

This author should have explained that its not the hard drives.

They perform as documented.

Admittedly Linux used to corrupt and not flush several years ago... but it was not the IDE drives. They never got the commands.

Its all a mess... but setting a DRIVE to not cache is NOT the solution! Its retarded to do so, and all the comments in this thread taling of setting the cache off are foolish.

As for caching device topics, there are many options.

1> SCSI WCE permanent option

2> ATA Seagate Set Features command 82h Disable write cache

3> ATA config commands sent over SCSI (RAID card) device using a SCSI CDB in passthrough It uses 16 byte CBD with 8h, or 12 byte CDB with Ah for sending the tunneled command.

4> ATA ATAPI commands for WCE bit, asif it was SCSI

Fibre Channel drives of course honor SCSI commands.

As for mere flushing, a variety of low level calls all have the same desired effect and are documented in respective standards manuals.
Re:Corporate Integrity by Dorsai65 · 2005-05-12 19:42 · Score: 4, Informative

What the article is saying is that the drive (or sometimes the RAID card and/or OS) is lying (with fsync) when it answers that it wrote the data: it didn't; so when you lose power, the data that was in cache (and should have been written) gets lost. It isn't a question of whether caching is turned on or not, but the drive truthfully saying whether or not the data was actually written.

--
--- Asking inconvenient questions for over 30 years...
Here's how by Moraelin · 2005-05-12 19:44 · Score: 4, Informative

For example, don't think "home user losing the last porn pic", think for example "corporate databases using XA transactions".

The semantics of XA transactions say that at the end of the "prepare" step, the data is already on the disc (or whatever other medium), just not yet made visible. That, basically all that could possibly fail, has in fact had its chance to fail. And if you got an OK, then it didn't.

Introducing a time window (likely extending not just past "prepare", but also past "commit") where the data is still in some cache and God knows when it'll actually get flushed, throws those whole semantics out the window. If, say, power fails (e.g., PSU blows a fuse) or shit otherwise hits the fan in that time window, you have fucked up the data.

The whole idea of transactions is ACID: Atomicity, Consistency, Isolation, and Durability:

- Atomicity - The entire sequence of actions must be either completed or aborted. The transaction cannot be partially successful.

- Consistency - The transaction takes the resources from one consistent state to another.

- Isolation - A transaction's effect is not visible to other transactions until the transaction is committed.

- Durability - Changes made by the committed transaction are permanent and must survive system failure.

That time window we introduced makes it at least possible to screw 3 out of 4 there. An update that involves more than one hard drive may not be Atomically executed in that case: only one change was really persisted. (E.g., if you booked a flight online, maybe the money got taken from your account, but not given to the airline.) It hasn't left the data in a Consistent state. (In the above example some money have disappeared into nowhere.) And it's all because it wasn't Durable. (An update we thought we committed hasn't, in fact, survived a system failure.)

--
A polar bear is a cartesian bear after a coordinate transform.
He misunderstands fsync() by Dahan · 2005-05-12 20:07 · Score: 4, Informative

According to SUSv3:
The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined.

If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. This could be valid in the case where the system cannot assure non-volatile storage under any circumstances or when the system is highly fault-tolerant and the functionality is not required. In the middle ground between these extremes, fsync() might or might not actually cause data to be written where it is safe from a power failure.
(Emphasis added). If you don't want your hard drive to cache writes, send it a command to turn off the write cache. Don't rely on fsync(). Either that, or hack your kernel so that fsync() will send a SYNCHRONIZE CACHE command to the drive. That'll sync the entire drive cache though, not just the blocks associated with the file descriptor you passed to fsync().
Re:What's this? by thsths · 2005-05-12 20:10 · Score: 5, Informative

> 1,000,000,000 bytes != 1 Gigabyte

Actually, it is. The standard was updated in 1998 to avoid confusion (Standard IEC 60027-2). Giga is 10^9, and it is constant, which means it does not change just because you use it for hard disks or memory.

If you mean 2^30, then you have to say gigabinary, abbreviated as gibi or Gi. Having different name for different things can avoid an awful lot of confusion, so it would very much recommend using them.

And now please put the following events into the correct order: America goes metric, hell freezes over, people use Gibi correctly.
Re:Of course it does! by cowbutt · 2005-05-12 21:00 · Score: 4, Informative

Sort of, yes:
# smartctl -a /dev/hde | grep 'Reallocated_Sector_Ct' 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
This indicates that /dev/hde is far from exhausting its supply of reserved blocks (the first 100) and never has been (the second 100, which is 'worst'). When it crosses the threshold (36) (or the threshold of any of the other 'Pre-fail' attributes for that matter), failure is imminent.
Re:Err... "lying" is the default setting. RTFM. by Everleet · 2005-05-12 21:13 · Score: 5, Informative

fsync() is pretty clearly documented to cause a flush of the kernel buffers, not the disk buffers. This shouldn't come as a surprise to anyone.
From Mac OS X --

DESCRIPTION Fsync() causes all modified data and attributes of fd to be moved to a permanent storage device. This normally results in all in-core modified copies of buffers for the associated file to be written to a disk. Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physi- cally write the data to the platters for quite some time and it may be written in an out-of-order sequence. Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present while earlier writes are not. This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures. For applications that require tighter guarantess about the integrity of their data, MacOS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications such as databases that require a strict ordering of writes should use F_FULLFSYNC to ensure their data is written in the order they expect. Please see fcntl(2) for more detail.

From Linux --

NOTES In case the hard disk has write cache enabled, the data may not really be on permanent storage when fsync/fdatasync return.

From FreeBSD's tuning(7) --

IDE WRITE CACHING FreeBSD 4.3 flirted with turning off IDE write caching. This reduced write bandwidth to IDE disks but was considered necessary due to serious data consistency issues introduced by hard drive vendors. Basically the problem is that IDE drives lie about when a write completes. With IDE write caching turned on, IDE hard drives will not only write data to disk out of order, they will sometimes delay some of the blocks indefinitely under heavy disk load. A crash or power failure can result in serious file system corruption. So our default was changed to be safe. Unfortu- nately, the result was such a huge loss in performance that we caved in and changed the default back to on after the release. You should check the default on your system by observing the hw.ata.wc sysctl variable. If IDE write caching is turned off, you can turn it back on by setting the hw.ata.wc loader tunable to 1. More information on tuning the ATA driver system may be found in the ata(4) man page. There is a new experimental feature for IDE hard drives called hw.ata.tags (you also set this in the boot loader) which allows write caching to be safely turned on. This brings SCSI tagging features to IDE drives. As of this writing only IBM DPTA and DTLA drives support the feature. Warning! These drives apparently have quality control problems and I do not recommend purchasing them at this time. If you need perfor- mance, go with SCSI.

--
It's tragic. Laugh.
Re:Of course it does!-Perfect world. by cowbutt · 2005-05-12 21:15 · Score: 4, Informative

Part of me wonders if this explains the anecdotal stories that SCSI disks are more reliable than their cheaper ATA counterparts - even when they use the same physical hardware. Perhaps (and this is blind speculation) the drives with fewer errors get sold to the customers willing to pay more.
Sort of. According to this paper from Seagate, the main differences between SCSI and ATA are:

SCSI drives are individually tested, rather than tested in batch

SCSI drives typically have a 5 year warranty, rather than 1 year for ATA (note that Seagate's ATA drives also have 5 years, and WD's Special Edition -JB ATA drives have 3 years).

SCSI drives usually have higher rotational speeds (i.e. 10K or 15K RPM vs. 7200RPM)

SCSI drives usually make use of the latest technology. ATA uses whatever older technology has been cost-engineered to a suitable price-point

The physical and programming interface

I also suspect that SCSI drives have a larger number of reserved blocks for remapping, and that they remap blocks on read operations when the ECC indicate that a block has crossed some threshold of near-unreadability. This would account for a) SCSI drives' lower capacities and b) a report I had from a SCSI-using friend running BSD who reports that a 'remapping' message turned up in his syslog without needing any special action to invoke.

By contrast, in my experience, ATA drives only remap failed blocks on write operations. Lots of people think that when a drive returns a read error on a file, it's only fit for the bin, but I've forced the remapping to take place by writing to the affected blocks (either by zeroing the entire partition or drive using dd or badblocks -w, or by removing the affected file then creating a large file that fills all unallocated space in a partition, then removing it to reclaim the space).
Re:Err... "lying" is the default setting. RTFM. by frinkazoid · 2005-05-12 23:25 · Score: 4, Informative

this is true .. Installing a fresh windows 98 SE on a fairly new pc and then doing windows update, there is an update witch this description:

The Windows IDE Hard Drive Cache Package provides a workaround to a recently identified issue with computers that have the combination of Integrated Drive Electronics (IDE) hard disk drives with large caches and newer/faster processors. Computers with this combination may risk losing data if the hard disk shuts down before it can preserve the data in its cache.

This update introduces a slight delay in the shutdown process. The delay of two seconds allows the hard drive's onboard cache to write any data to the hard drive.

I found it nice to see how M$ worked around it, just waiting 2 seconds, how ingenious !
link to the M$ update site: http://www.microsoft.com/windows98/downloads/conte nts/WUCritical/q273017/Default.asp
Re:drive write caching _is unsafe_. by putaro · 2005-05-13 01:37 · Score: 4, Informative

Let's try a reply with a bit less flame attached.

A journaling file system will know when it needs to get everything committed to disk in order to have a consistent state. At that point it will issue a sync to the drive to flush the drive's write cache. However, not every write has to get to the disk for the filesystem to be in a consistent state.

Now, you're yelling BS, BS, BS...hold on and listen for a minute. I write file systems for a living and have done so for over 15 years.

What is the commitment that a journaling file system makes to you? It makes the commitment that it will not be in an inconsistent state. It doesn't make the commitment that every last write will make it to disk. For example, ext3 in journaling mode only journals metadata transactions. Any data writes that you make are not guaranteed at all, unless you make the proper sync call. As someone pointed out above, fsync is not the proper call on many OS's.

The way that we have settled on to make filesystems and databases work is to create atomic transactions and move from transaction to transaction. If a transaction fails (for any reason, but let's just assume it's because the system crashed), all of the data that was written as part of it is discarded when you restart. If the partial data was not discarded then the filesystem would be in an inconsistent state AND the data that you were writing (if you care about consistency) would be in an inconsistent state. So, forcing every write to immediately go to disk is pointless as if the transaction you're doing is interrupted you'll be discarding the data anyway. It's only when you are finishing the transaction that you need to make sure that everything is on disk. By that time it might be already, especially if that transaction was large.

Let's take a simple situation. Say that you have a filesystem that guarantees that everytime you do a write() call, when the call returns that data will be on disk and available for you the next time and that if the write() errors or does not return, the file will be as it was before the write() was called. Now, you do a write of 100MB with a single call. The filesystem may scatter that data all over the disk depending on how fragmented it is. Forcing each write to disk in order will bang the head a lot and reduce your performance. By letting the write cache do its job and reorder writes as necessary your performance will be much better (we used to do this in the driver and file system cache. However, modern disk drives provided such an abstract interface that it's nearly impossible for the OS to micromanage write ordering. In the old days the OS knew where the head is because it told the damn drive where to put it. Now, you can sort of guess and you're usually wrong). Cache on ATA drives tops out at around 16MB so you will definitely flush most of the data out of the cache in the course of writing anyway. Finally, at the end, before returning, the FS would sync the drive's cache to the disks and mark the transaction as closed. Were the system to crash in the middle of the write when the system restarts it would need to discard any data that might have been written and it wouldn't matter which data had been written or not written. (Important note: Journaling file systems and databases have a recovery process after a crash. It's just a lot less involved than running fsck or DSKCHK over the whole disk)

So, write caching is valuable and widely used. In order to avoid data corruption it's not necessary to turn off caching but it is necessary for the cache to do what it is told, when it is told (all of the write caches too, not just the disk's). Were the disks truly lying to the OS it would be bad. More likely, this guy's Perl script is just not OS specific enough to get the OS to really do what he thinks he is asking it to do. There's a reason why serious data management apps need to be ported and certified on an OS. Getting everything to do its job right is tough.
Much ado about nothing by jgarzik · 2005-05-13 01:50 · Score: 4, Informative

All it would have taken is ten minutes of searching on Google to discover what is going on.

You need a vaguely recent 2.6.x kernel to support fsync(2) and fdatasync(2) flushing your disk's write cache. Previous 2.4.x and 2.6.x kernels would only flush the write cache upon reboot, or if you used a custom app to issue the 'flush cache' command directly to your disk.

Very recent 2.6.x kernels include write barrier support, which flushes the write cache when the ext3 journal gets flushed to disk.

If your kernel doesn't flush the write cache, then obviously there is a window where you can lose data. Welcome to the world of write-back caching, circa 1990.

If you are stuck without a kernel that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE (SCSI) command, it is trivial to write a userspace utility that issues the command.

Jeff, the Linux SATA driver guy