Slashdot Mirror


MIT's New File System Won't Lose Data During Crashes

jan_jes sends news that MIT researchers will soon present a file system they say is mathematically guaranteed not to lose data during a crash. While building it, they wrote and rewrote the file system over and over, finding that the majority of their development time was spent defining the system components and the relationships between them. "With all these logics and proofs, there are so many ways to write them down, and each one of them has subtle implications down the line that we didn’t really understand." The file system is slow compared to other modern examples, but the researchers say their formal verification can also work with faster designs. Associate professor Nickolai Zeldovich said, "Making sure that the file system can recover from a crash at any point is tricky because there are so many different places that you could crash. You literally have to consider every instruction or every disk operation and think, ‘Well, what if I crash now? What now? What now?’ And so empirically, people have found lots of bugs in file systems that have to do with crash recovery, and they keep finding them, even in very well tested file systems, because it’s just so hard to do.”

27 of 168 comments (clear)

  1. But is it useful? by Junta · · Score: 4, Interesting

    slow compared to other modern examples, but the researchers say their formal verification can also work with faster designs

    If we can accept 'slow', it's not that difficult to build an always consistent filesystem. While they can formally verify a faster design should one exist, there remains the question of whether it's possible to be formally proven to be resilient to data loss while taking some of the utterly required performance shortcuts for acceptable performance. I suspect the answer is that some essential performance 'shortcuts' will fail that verification.

    --
    XML is like violence. If it doesn't solve the problem, use more.
    1. Re:But is it useful? by Anonymous Coward · · Score: 5, Funny

      A sufficiently slow writing filesystem is indistinguishable from a read-only filesystem. Read-only filesystems are consistent. Therefore, a slow enough writing filesystem is consistent.

      For more serious analysis of crash-resistant write sequences, look elsewhere in this discussion.

  2. Formally verified system fails by Anonymous Coward · · Score: 3, Funny

    MIT's new "crash-proof file system" crashed today amid accusations of bugs in the formal proof verification software used to formally verify it.

    MIT are now working on a formal verification of the formal verification system, in order to avoid similar formal-verification-related problems in the future.

  3. Linux File Systems by Anonymous Coward · · Score: 5, Interesting

    I find some of the current file systems to be adequately reliable. Even their performance is acceptable. But, the Linux systems are lacking.

    Is there a reliable Linux file system such as EXT4 that has an easy to use copy on write(CoW) feature to allow instant recovery of any file changed at any time?

    rm ./test
    restore --last test ./

    dd if=/dev/random of=./test bs=1M count=10
    restore --datetime test ./

    Novell Netware FS did all this and more in 1995 FFS! Fifteen years later and Linux doesn't seem to be able to do this. NTFS doesn't seem to be able to do this either. Yet Novell is dead?

    1. Re:Linux File Systems by Anonymous Coward · · Score: 2, Informative

      Well, ex-Googler Kent Overstreet recently announced a COW filesystem on lkml:

      https://lkml.org/lkml/2015/8/21/22

      Not ready for production I would say, but looks interesting.

    2. Re:Linux File Systems by swb · · Score: 4, Interesting

      I still think Netware's filesystem permission model was better than anything out there now, at least for filesharing.

      The feature I miss the most is allowing traversal through a directory hierarchy a user has no explicit permissions for to get to a folder they do have permissions for. I find the workarounds for this in other filesystems to be extremely ugly.

      I think NDS was better in a lot of ways than AD, although it would have been nice if there had been something better than bindery mode for eliminating the need for users to know their fully qualified NDS name.

      I also kind of wish TCP/IP had used the network:mac numbering scheme that IPX used. The rest of IPX/SPX I don't need, but there'd be no talk of address exhaustion of IPv4 if that scheme had been adopted, little need for DHCP address assignment and the addressing scheme would scale to the larger broadcast domains enabled by modern switching (avoiding the need to renumber legacy segments completely when they exhausted a /24 space and expansion via mask reduction wasn't possible due to linear numbering on adjacent segments).

    3. Re:Linux File Systems by Christian+Smith · · Score: 3, Interesting

      I find some of the current file systems to be adequately reliable. Even their performance is acceptable. But, the Linux systems are lacking.

      Is there a reliable Linux file system such as EXT4 that has an easy to use copy on write(CoW) feature to allow instant recovery of any file changed at any time?

      NILFS2 provides continuous point in time snapshots, which can be selectively mounted and data recovered. Not quite as instant recovery as your use case examples, but it's only a few commands/wrapper scripts away.

    4. Re:Linux File Systems by nbvb · · Score: 3, Interesting

      Bah, forget NetWare, VINES and StreetTalk did everything you ask for and then some way before NDS was even a thought.

      VINES' ACL's were beautifully granular ...

    5. Re:Linux File Systems by swb · · Score: 2

      It's a kludge, though, in NTFS, in Netware it just worked.

      IPX/SPX as a layer 3 protocol isn't what I wanted, I wanted TCPv4 with a network prefix and the MAC as the node address. Clients could derive their address automatically from network traffic by picking up the network address from the wire and they already knew their MAC address.

      Although to be honest, IPX/SPX even as a secondary protocol wasn't that bad to support in a mixed environment. We had no issues with TCP, IPX *and* CrappleTalk on 512k frame relay links. We retired internetworking on IPX when the last Netware servers went away but were stuck with AppleTalk (circa 2003).

    6. Re:Linux File Systems by LDAPMAN · · Score: 2

      It still is: https://dl.netiq.com/Download?...

      I build multi-million user IAM systems on it for a living.

    7. Re:Linux File Systems by Bengie · · Score: 2

      A true versioning file system would crumble under an IO workload with many small updates. The problem with btrfs is you can't snapshot two subvolumes in sync with eachother. Snapshots are per subvolume. In ZFS, snapshots are at the pool level, allowing volumes to be in perfect sync.

  4. It's a matter of distance by aglider · · Score: 2

    Not a matter of proof. The distance between perfect design and buggy implementation. IMHO.

    --
    Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
  5. Is it really THAT hard? by ledow · · Score: 4, Interesting

    Write zero to a flag.
    Write data to temporary area.
    Calculate checksum and keep with temporary area.
    When write is complete, signal application.
    Copy data from temporary area when convenient.
    Check checksum from temporary to permanent is the same.
    Mark flag when finished.

    If you crash before you write the zero, you don't have anything to write anyway.
    If you crash mid-write, you've not signalled the application that you've done anything anyway. And you can checksum to see if you crashed JUST BEFORE the end, or half-way through.
    If you crash mid-copy, your next restart should spot the temporary area being full with a zero-flag (meaning you haven't properly written it yet). Resume from the copy stage. Checksum will double-check this for you.
    If you crash post-copy, pre-flagging, you end up doing the copy twice, big deal.
    If you crash post-flagging, your filesystem is consistent.

    I'm sure that things like error-handling are much more complex (what if you have space for the initial copy but not the full copy? What if the device goes read-only mid-way through?) but in terms of consistency is it really all that hard?

    The problem is that somewhere, somehow, applications are waiting for you to confirm the write, and you can either delay (which slows everything down), or lie (which breaks consistency). Past that, it doesn't really matter. And if you get cut-off before you can confirm the write, data will be lost EVEN ON A PERFECT FILESYSTEM. You might be filesystem-consistent, but it won't reflect everything that was written.

    Journalling doesn't need to be mathematically-proven, just logically thought through. But fast journalling filesystems are damn hard, as these guys have found out.

    1. Re:Is it really THAT hard? by Guspaz · · Score: 4, Insightful

      > When write is complete, signal application.

      How do you know the write was complete? Most storage hardware lies about completing the write. The ZFS folks found this out the hard way: their filesystem was supposed to survive arbitrary power failures, and on a limited set of hardware that was true. In reality, most drives/controllers say they've committed the write to disk when it's still in their cache.

      Any filesystem that claims to survive crashes needs to take into account that any write confirmation could be a lie, and that any data it has written in the past may still be in a volatile cache.

    2. Re:Is it really THAT hard? by fnj · · Score: 2

      You can't expect software to paper over BUSTED HARDWARE. If a disk drive flat out lies about status, expose the goddam manufacturer and sue him out of existence. If you think anyone can paper over the scenario you just outlined, then what about this - what about a disk drive that NEVER WRITES ANYTHING but lies and says everything is going hunky dory? Pretty damn sure there's nothing you can do in that scenario.

      I've heard that story about the "drives that lie about about write-to-physical-media-complete" many times. As far as I am concerned it is apocryphal. Never once have I seen the manufacturers and model numbers specified. I think it probably came from a long long time ago (in technology time). Finally I sure as hell have no idea how "examples exist" turned into "most storage hardware" as the tale is spun and respun over the years.

  6. Not exactly new or news... by bradgoodman · · Score: 2

    No specifics. This has been done a million times with journaling filesystem (and block layers) - no idea why this is better or different. But what about disk failure? But what about data loss? But what about (undetected) data corruption (at the disk)? What about unexpected power hits that could drop a disk or tear a write? Not even going to get into snapshotting, disaster recovery, etc. There's a lot more to this than "surviving a crash".

  7. Code not available, will it ever be? by badger.foo · · Score: 2

    It's now August, the conference where they'll be presenting their work is in October, and the article is a tad short on specifics. They've done a formally verified formal verification of a filesystem. if it works, that's excellent news of course, but I'd wait until we have seen the thing work and with actual code to examine before making any comments or bets on how useful this is going to be. And this being an open source-oriented site, we should be asking whether the code will indeed be available under any kind of usable open source license.

    --
    -- That grumpy BSD guy - http://bsdly.blogspot.com/
  8. MIT researchers live in ivory towers by alexhs · · Score: 4, Insightful

    Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald Knuth

    If the hard drive firmware is not proven, this FS won't be any better than ZFS and others.
    Writing safe file systems is the easy part (even trivial using synchronous writes, when you consider their design is "slow").
    The impossible part is dealing with firmwares that are known to lie (including pretending than a write is synchronous): how could you not lose data if the drive didn't ever write the data to the platters in the first place ?

    --
    I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
  9. chmod 751 some_directory by tepples · · Score: 4, Informative

    The feature I miss the most is allowing traversal through a directory hierarchy a user has no explicit permissions for to get to a folder they do have permissions for. I find the workarounds for this in other filesystems to be extremely ugly.

    In POSIX, that's represented in a directory's mode bits with octal digit 1 (4: list files, 2: create or delete files, 1: traverse). What do you find ugly about mode 751 (owner create or delete, owner and group list, world traverse)?

    I also kind of wish TCP/IP had used the network:mac numbering scheme that IPX used.

    It does now. An IPv6 address is divided into a 48- or 56-bit network, a 16- or 8-bit subnet, and a 64-bit machine identifier commonly derived from the MAC.

  10. Re:Proving Coq itself by TheRaven64 · · Score: 4, Interesting

    If you actually want an answer, I suggest that you go and read the Coq papers. You might learn something.

    --
    I am TheRaven on Soylent News
  11. Yeah, right. by dargaud · · Score: 4, Interesting
    In the words of Knuth the Great: "Beware of bugs in the above code; I have only proved it correct, not tried it."

    It reminds me of a story from the late 80s (?) at a tech conference. The makers of a real-time OS with real-time snapshots would periodically pull the plug on their systems, plug it back in and it would resume exactly what it was doing, to the delight and amazement of all the techies in the assistance. In the much larger and much more expensive booth in front of them was a richer vendor. The techies started coaxing them to do the same. After much hand wringing, they did, and after a very long rebuild time the system came back as a mess. Conclusion: the 1st vendor went out of business, the 2nd one is still very big.

    --
    Non-Linux Penguins ?
  12. Re:Tough environments by mlts · · Score: 3, Interesting

    I personally encountered a drive array driver cause an entire array to get overwritten by garbage. I was quite glad that I had tape backups of the computers and the shared array, so a recovery was fairly easy (especially with IBM sysback.)

    Filesystems are one piece of a puzzle, but an important one. If that array decided to just write some garbage in a few sectors, almost no filesystem would notice that, allowing propagating of corrupted data to backups. The only two that might notice it would be a background ZFS task doing a scrub and noticing a 64 bit checksum is off, or ReFS doing something similar. Without RAID-Z2, the damage can't be repaired... but it can be found and relevant people notified.

  13. If it was easy wouldn't it already be done? by sjbe · · Score: 2

    Why am I hearing Jeremy Clarkson asking "how hard can it be?" just before utterly screwing something up?

    Perhaps there is just a tad more difficulty to it than you are considering?

  14. Key Logic's KeyKOS by Sits · · Score: 2

    I've just spent the past few minutes trying to (re)find this story. Is it the KeyKOS recovery speed story from 1990?

  15. Re:Define "crash". by ChumpusRex2003 · · Score: 3, Insightful

    However, considering how many firmware bugs there are in modern SSDs and HDs which cause out-of-order writes, shorn writes, and various other random corruptions and bad behaviour during a power failure or system reset, what would be more interesting is how the file system could cope with one of those.

  16. Re:Define "crash". by gweihir · · Score: 2

    As the Linux FS developers found some time ago, many modern disks also blatantly lie about having flushed data to disk. With software, you can then only do heuristic things to reduce the impact, not more. This "magic" MIT FS cannot do any better, and hence is entirely superfluous.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  17. Re:Define "crash". by Darinbob · · Score: 3, Insightful

    Many consumer devices generally conform to an unwritten standard, not a written one. That unwritten standard is "do whatever is needed to make it work on Windows". This means that mandatory commands might not be fully tested, or sometimes not even implemented. And because these are consumer devices, the manufacturers generally aren't concerned about rare end-user problems, since whoever bought the flash thumb drive on sale isn't going to complain, and the margins are way too small to waste time on reliability. All that matters is getting it to work for most people most of the time.