Slashdot Mirror


Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4

cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.

830 comments

  1. Not a bug by casualsax3 · · Score: 5, Informative
    It's a consequence of not writing software properly. Relevant links later in the same comment thread for those who don't might otherwise miss them:

    https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45

    https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54

    1. Re:Not a bug by mbkennel · · Score: 5, Insightful

      I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.

      Quoting T'so:

      "The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, and that it uses fdatawrite() instead of fsync() to guarantee that data is written on disk. If sqllite had been properly written so that it grabbed new space for its database storage in chunks of 16k or 64k, and released space when it was no longer needed in similar large chunks via truncate(), and if it used fdatasync() instead of fsync(), the performance problems with FireFox 3 wouldn't have taken place."

      In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.

      I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.

      File systems are nice. That's what Unix is about.

      I don't think programmers ought to be required to treat them like a pouty flake: "in some cases, depending on the whims of the kernel and entirely invisible moods, or the way the disk is mounted that you have no control over, stuff might or might not work."

    2. Re:Not a bug by Conley+Index · · Score: 1

      I already wondered about the heise.de title blaming the file system. Now Slashdot repeats it.

      I have seen the same on FreeBSD using UFS (with soft updates).

      KDE4 is supposed to be portable enough to run on file systems that have no data journaling or a guarantee for operations on different files to be written in a certain order without issuing a sync.

    3. Re:Not a bug by idontgno · · Score: 3, Insightful

      lol.

      It's a consequence of a filesystem that makes bad assumptions about file size.

      I suppose in your world, you open a single file the size of the entire filesystem and just do seek()s within it?

      It's a bug. A filesystem which does not responsibly handle any file of any size between 0 bytes and MAXFILESIZE is bugged.

      Deal with it and join the rest of us in reality.

      --
      Welcome to the Panopticon. Used to be a prison, now it's your home.
    4. Re:Not a bug by jgarra23 · · Score: 2, Interesting

      Talk about doublespeak! Not a bug vs. It's a consequence of not writing software properly. reminds me of that FG episode where Stewie says, "it's not that I want to kill Lois... it's that I don't... want... her... to... live... anymore."

    5. Re:Not a bug by virtue3 · · Score: 1

      ... and ultimately who is to say the database wont eventually be flawed because whoever programmed THAT has a workaround for the whole bloody filesystem?

      The filesystem should definitely be abstracted to the point where the software does not need to do anything super special (telling the OS to manually plug in the cached writes it's insane).

      Mind you, this is pretty heavy OS code, so, YMMV. Bottom line, even these guys shouldn't fucking care if you're using ext4 or fat32(just an example! and yes, there are exceptions, but general software case, you shouldn't need to) at the end of the day.

    6. Re:Not a bug by dedazo · · Score: 1, Troll

      You'll have to excuse me for chuckling a bit here, but if NTFS or the filesystem for OS X (whatever that is) had this problem and someone suggested that it's an "application problem" they'd be stoned to death.

      As an application developer, the last thing I want to worry about is whether or not the fraking filesystem is going to persist my data to disk. That's why I write applications, and other people write file systems and kernels.

      You can talk to me about good practices when doing I/O on any given platform, but please don't insult my intelligence by claiming the FS layer's failure to do something is due to my saving lots of little files, or lots of big ones, or anything in between. That's just stupid.

      --
      Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
    7. Re:Not a bug by gweihir · · Score: 1

      Well, everybody should by know have noticed that fsync and fdatasync are not the same anymore (they were in Linux 2.2). Still, both should get your data reliably to disk (unless the disk does write buffereing). Not using either and then expecting your data to be on disk is indeed an implementation problem on the application side.

      I was unable to find a system call named "fdatawrite". It seems that one does not exist or at least is an experimental and very new feature.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    8. Re:Not a bug by Qzukk · · Score: 5, Interesting

      I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.

      It is perfectly OK to read and write thousands of tiny files. Unless the system is going to crash while you're doing it and you somehow want the magic computer fairy to make sure that the files are still there when you reboot it. In that case, you're going to have to always write every single block out to the disk, and slow everything else down to make sure no process gets an "unreasonable" expectation that their is safe until the drive catches up.

      Fortunately his patches will include an option to turn the magic computer fairy off.

      --
      If I have been able to see further than others, it is because I bought a pair of binoculars.
    9. Re:Not a bug by TerranFury · · Score: 4, Insightful

      Ummm... it deals correctly with files of any size. It just loses recent data if your system crashes before it has flushed what it's got in RAM to disk. That's the case for pretty much any filesystem; it's just a matter of degree, and how "recent" is recent.

    10. Re:Not a bug by Hatta · · Score: 3, Insightful

      The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries

      Translation: "Our filesystem is so fucked up, even SQL is better."

      WTF is this guy thinking? UNIX has used hundreds of tiny dotfiles for configuration for years and it's always worked well. If this filesystem can't handle it, it's not ready for production. Why not just keep ALL your files in an SQL database and cut out the filesystem entirely?

      --
      Give me Classic Slashdot or give me death!
    11. Re:Not a bug by icebike · · Score: 1

      Its worse than than a KDE problem. It goes to the heart of Linux/Unix which
      have always been dependent on a multitude of small text files.

      Anytime you suggest users re-write their entire code base to get around
      a bug you've created your professional pride should well up, grab you by
      the wattles and slap you till you spit.

      --
      Sig Battery depleted. Reverting to safe mode.
    12. Re:Not a bug by malkir · · Score: 1

      "in some cases, depending on the whims of the kernel and entirely invisible moods, or the way the disk is mounted that you have no control over, stuff might or might not work." Sounds a lot like ActionScript...

    13. Re:Not a bug by Anonymous Coward · · Score: 5, Informative

      Quoting T'so:

      "The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, ...

      Linux reinvents windows registry?
      Who knows what they will come up with next.

    14. Re:Not a bug by fireman+sam · · Score: 4, Insightful

      The benefit of journaling file systems is that after the crash you still have a file system that works. How many folks remember when Windows would crash, resulting in a HDD that was so corrupted the OS wouldn't start. Same with ext2.

      If these folks don't like asynchronous writes, they can edit their fstab (or whatever) to have the sync option so all their writes will be synchronous and the world will be a happy place.

      Note that they will also have to suffer a slower system, and possible shortened lifetime of their HDD, but at least there configuration files will be safe.

      --
      it is only after a long journey that you know the strength of the horse.
    15. Re:Not a bug by GigsVT · · Score: 3, Insightful

      Instead, the answer is to use a proper small database like sqllite for application registries

      Yeah, linux should totally put in a Windows style registry. What the fuck is this guy on.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    16. Re:Not a bug by jgostling · · Score: 1

      Quoting T'so:

      "...use a proper small database like sqllite for application registries..."

      Is it just me or does this sound an awful lot like the dreaded Windows registry...

    17. Re:Not a bug by Logic+and+Reason · · Score: 5, Insightful

      I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.

      To paraphrase https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 : You certainly can use tons of tiny files, but if you want to guarantee your data will still be there after a crash, you need to use fsync. And if that causes performance problems, then perhaps you should rethink how your application is doing things.

    18. Re:Not a bug by PinchDuck · · Score: 1

      The point of having a rock-solid filesystem is to have a rock-solid filesystem. Any filesystem that crashes and loses data is bad. What is the point of a journal again? To enforce someone's idea of how an API should be coded to, or to reduce data loss?

    19. Re:Not a bug by msuarezalvarez · · Score: 5, Insightful

      As an application developer, the last thing I want to worry about is whether or not the fraking filesystem is going to persist my data to disk.

      As an application developer, you are expected to know what the API does, in order to use it correctly. What Ext4 is doing is 100% respectful of the spec.

    20. Re:Not a bug by Slumdog · · Score: 1

      Talk about doublespeak! Not a bug vs. It's a consequence of not writing software properly. reminds me of that FG episode where Stewie says, "it's not that I want to kill Lois... it's that I don't... want... her... to... live... anymore."

      I think you're confusing lies with mistakes/misunderstandings. Bugs are usually unknown...unintentional. Writing code improperly is an intentional act, possibly with unknown consequences. Windows Vista isn't a "Bug" (although I expect some /. smartass to assert that it is...), Vista is simply badly designed.

    21. Re:Not a bug by davecb · · Score: 5, Insightful

      It seems exceedingly odd that issuing a write for a non-zero-sized file and having it delayed causes the file to become zero-size before the new data is written.

      Generally when one is trying to maintain correctness one allocates space, places the data into it and only then links the space into place (paraphrased from from Barry Dwyer's "One more time - how to update a master file", Communications of the ACM, January 1981).

      I'd be inclined to delay the metadata update until after the data was written, as Mr. Tso notes was done in ext3. That's certainly what I did back in the days of CP/M, writing DSA-formated floppies (;-))

      --dave

      --
      davecb@spamcop.net
    22. Re:Not a bug by lilo_booter · · Score: 1

      Have to agree - suggesting a db to replace 'hundreds of small files' is an appalling attitude. Doesn't even make sense that a developer who's ever used a source code repo would think that was reasonable.

      Unlikely scenario, but say there's a source code repo running on ext4, and as a developer, I want to make changes to hudreds of files in the repo - I checkin and the server goes bang - what's the repo state likely to be? How/why would you use a db to implement the repo? Why should the repo be patched to run specifically on that file system?

      Bizarre...

    23. Re:Not a bug by davecb · · Score: 4, Informative

      Er, actually it removes the previous data, then waits to replace it for long enough that the probability of noticing the disappearance approaches unity on flaky hardware (;-))

      --dave

      --
      davecb@spamcop.net
    24. Re:Not a bug by somenickname · · Score: 1, Insightful

      Beyond that, he's essentially advocating the Windows Registry. He's a very smart person but, Unix is about dot files. If you take them away you, take away the "Unixness" of the machine. I don't care if a filesystem isn't pleased by hundreds or thousands of tiny config files. That's how the machine works. Make your filesystem handle it.

      Cordially,

      An ext4 user.

    25. Re:Not a bug by OeLeWaPpErKe · · Score: 5, Informative

      Let's not forget that the only consequence of delayed allocation is the write-out delay changing. Instead of data being "guaranteed" on disk in 5 seconds, that becomes 60 seconds.

      Oh dear God, someone inform the president ! Data that is NEVER guaranteed to be on disk according to spec is only guaranteed on disk after 60 seconds.

      You should not write your application to depend on filesystem-specific behavior. You should write them to the standard, and that means fsync(). No call to fsync, look it up in the documentation (man 2 write).

      The rest of what Ted T'so is saying is optimization, speeding up the boot time for gnome/kde, it is not necessary for correct workings.

      Please don't FUD.

      You know I'll look up the docs for you :

      (quote from man 2 write)

      NOTES
                    A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee
                    that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.

                    If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has
                    been written, the call succeeds, and returns the number of bytes written.

      That brings up another point, almost nobody is ready for the second remark either (write might return after a partial write, necessitating a second call)

      So the normal case for a "reliable write" would be this code :

      size_t written = 0;
      int r = write(fd, &data, sizeof(data))
      while (r >= 0 && r + written sizeof(data)) {
              written += r;
              r = write(fd, &data, sizeof(data));
      }
      if (r 0) { // error handling code, at the very least looking at EIO, ENOSPC and EPIPE for network sockets
      }

      and *NOT*

      write(fd, data, sizeof(data)); // will probably work

      Just because programmers continuously use the second method (just check a few sf.net projects) doesn't make it the right method (and as there is *NO* way to fix write to make that call reliable in all cases you're going to have to shut up about it eventually)

      Hell, even firefox doesn't check for either EIO or ENOSPC and certainly doesn't handle either of them gracefully, at least not for downloads.

    26. Re:Not a bug by Jurily · · Score: 1

      In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.

      In other words, if the programmer took on the burden to use a proper database interface for their database, it could have been optimized as such.

      I do agree, however, that data loss is inexcusable under any circumstances. Isn't that why we have journals in the first place?

    27. Re:Not a bug by CyprusBlue113 · · Score: 2, Insightful

      Unless you have an explicit sync there, YOUR ASSUMPTION IS BUGGED. This is completely reasonable behavior of a write caching system.

      --
      a handful of selfish greedy people are no match for millions of selfish, greedy people -u4ya
    28. Re:Not a bug by gweihir · · Score: 1

      The problem is not the tiny files. The problem is opening lots of files and rewriting them without keeoing backups or doing syncs. That is inherently non-robust and should never be done on critical files.

      The database is just an example on how to do it robust and fast. fdatasyncing a lot of small files is bound to be slow. Databases bave better performance on a lot of small updates.

      Still, the blame is on the KDE people that use a very risky update pattern in a place where it is completely inappropriate.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    29. Re:Not a bug by Anonymous Coward · · Score: 0

      Guess what, writing software is hard!

      Writing good software is even harder!

      If you are lazy go take a job at McDonalds so that you don't have to think when you work ...

    30. Re:Not a bug by caerwyn · · Score: 4, Insightful

      No. It's not.

      If what you say is true there would be no need for the fsync() function (and related ones).

      Read the standards if you want. The filesystem is only bugged if it loses recent data under conditions where the application has asked it to guarantee that the data is safe. If the app hasn't asked for any such guarantee by calling fsync() or the like, the filesystem is free to do as it likes.

      --
      The ringing of the division bell has begun... -PF
    31. Re:Not a bug by Anonymous Coward · · Score: 2, Insightful

      You're wrong, and so are most comments here.

      When you open() a file in the filesystem, wrtei() one byte to it, and close() that file, you haven't really guaranteed crap on any normal filesystem, unless you're using a very strange filesystem or you're using non-standard mount options to force every action to happen synchronously.

      If a crash happens between close() and the filesystem flushing data to disk, you will lose data. If you want to prevent this happening, you must either use calls like fsync() or fdatasync() (or many other mechanisms that act similarly), or use mount options that make all calls synchronous.

      The only reason this has become a big blow-up issue with ext4 is that while other filesystems generally would sync the data shortly anyways, ext4 does not. Everyone has been relying on bad assumptions about filesystem behavior and getting by on the fact that "usually" the situation was resolved "somewhat quickly". ext4 does not resolve these things quickly, in the name of efficiency and performance. There was a never a guarantee under any filesystem of things getting done (to disk) quickly unless you explicitly ask for it.

    32. Re:Not a bug by ickpoo · · Score: 0, Redundant

      Clearly I won't be using Ext4 for a long while. The attitude of T'so indicates that he doesn't really know the purpose of a file system. It doesn't matter how capable this guy is, he is an idiot.

      Suggesting that this is the domain of the application is crazy. Writing a bunch of small files is par for the course for many applications, suddenly all these apps need to be reworked? The app wrote the file, probably received no error messages indicating that something might be wrong, and closed the file, and yet the file system is loosing data. The file system is suspect.

      --
      I am not a script! .Sig?
    33. Re:Not a bug by Profane+MuthaFucka · · Score: 5, Funny

      That would be smart, but only if the SQL database is encrypted too. It's theoretically possible to read a registry with an editor, and we can't have that. Also, we need a checksum on the registry. If the checksum is bad, we have to overwrite the registry with zeroes. Registries are monolithic, and we have to make sure that either it's good data, or NONE of it is good data. Otherwise the user would get confused.

      I am so excited about this that I'm going to start working on it just as soon as I get done rewriting all my userspace tools in TCL.

      --
      Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
    34. Re:Not a bug by Jurily · · Score: 4, Informative

      It just loses recent data if your system crashes before it has flushed what it's got in RAM to disk.

      No, that's the bug. It loses ALL data. You get 0 byte files on reboot.

    35. Re:Not a bug by avalys · · Score: 1

      Sorry, but you're quite wrong here. Most filesystems can be configured at mount-time to behave in the manner you describe, but by default, they may defer writes to the disk for upwards of several seconds.

      This improves performance tremendously, and the resulting unreliability is simply a tradeoff that is required to deal with what are fundamentally very slow devices.

      http://www.eecs.umich.edu/~enightin/syncio.ps

      You do not want the filesystem to striving to dump all data to disk as fast as possible, all the time - for instance, it doesn't really matter if you lose some items from your browser cache during a crash. So, the filesystem can defer writing new files in your cache until the disk is idle in between some more important operations, and the only effect you'll notice is vastly improved performance.

      --
      This space intentionally left blank.
    36. Re:Not a bug by PIBM · · Score: 2, Interesting

      That's your filesystem definition. Even there, I can guarantee you it can't be built, thus, from your point of view, no file system will ever not be bugged.

      How come ?

      I open a file
      I write one byte
      I close the file

      Data is not on disk BECAUSE IT WAS FULL and you failed to plan for intercepting errors / warnings.

      The filesystems needs to be used along with their specifications, not the way you'd want them to work.

    37. Re:Not a bug by Anonymous Coward · · Score: 0

      File editors do use fsync, so do decent VCSs.

      The problem here is that KDE did not use fsync() when writing tons of small files, and the reason is that fsync() is expensive and slow down the system.
      Couple it with crazy people writing hundreds of small files at once and you get why they don't like fsync.

      It's ok if an app has 1 or 2 config files, it's unlikely it will write back that often so fsync() will not be a huge penalty considering the benefits for data integrity.
      But if an application uses hundreds of tiny files instead of a proper database then something's wrong in that application. Nobody is going to manually touch those files anyway so it really makes no sense to waste all that space for a perceived evilness of more efficient binary file formats.
      Consider that each tny file also eats up a full inode (4k) and a directory entry so each file also eats up a lot of space for nothing.

    38. Re:Not a bug by dedazo · · Score: 1

      What Ext4 is doing is 100% respectful of the spec.

      The idiocy of the filesystem venting your data to space because it's following a "spec" does not strike you as odd, I guess?

      There are many situations where I'm supposed to follow an API or risk dire consequences. "Don't write too many little files" coming from the operating system is probably not one of them. Especially on any Unix.

      --
      Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
    39. Re:Not a bug by Omnifarious · · Score: 1, Insightful

      Filesystems that cannot handle thousands of tiny files efficiently are completely broken. I think the Linux filesystem people have been complete idiots for years for not considering this use case to be worth it. Too many big iron database vendors whispering in their ears apparently.

      I want to be able to use the filesystem to appropriately name and reference my data. I do not want to have to rely on some completely different set of tools to actually see what data I have stored on my filesystem. If that's the case, I'll just use LVM for my 'filesystem' and use something vaguely decent to actually hold my data and use those tools instead of the Unix filesystem tools.

      Now, those applications that are broken because they are written incorrectly should be re-written so they are correct and coincidentally god-awful slow on ext4. Then maybe the designers of ext4 will get a clue and actually write a filesystem instead of a glorified version of LVM with fancy hierarchical namespace for partitions instead of the the flat one LVM has.

    40. Re:Not a bug by caerwyn · · Score: 2, Insightful

      I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.

      No. Writing software properly means calling fsync() if you need a data guarantee.

      Pretty sure no one in their right mind would call using fsync() barriers a "huge burden". There are an enormous number of programs out there that do this correctly already.

      And then there are some that don't. Those have problems. They're bugs. They need to be fixed. Fixing bugs is not a "huge burden", it's a necessary task.

      --
      The ringing of the division bell has begun... -PF
    41. Re:Not a bug by caerwyn · · Score: 4, Informative

      You're right. The correct thing to do is to *always* call fsync() when you need a data guarantee, *regardless* of which FS you're on. The fact that not doing it in the past hasn't caused problems isn't the problem- those calls are the correct way of handling things.

      --
      The ringing of the division bell has begun... -PF
    42. Re:Not a bug by gnasher719 · · Score: 1, Insightful

      It's very simple. Lots of application software is badly written. When this badly written software uses ext4, Bad Things Happen. As a user you have two choices: Don't use that software, or don't use ext4. To me, the decision is clear. There are other file systems that don't lead to these problems, so any reasonable user will avoid ext4 like the plague.

      So the way I understand these comments, the file system has been written to be very fast by delaying certain operations, and it succeeds, except that in case of a crash your hard drive is in a very undesirable state. Programmers can do something about this, but the consequence is that performance drops down through the floor. So the file system is fast with unsafe applications, and dead slow for safe operations. Nice.

    43. Re:Not a bug by dmiller · · Score: 4, Informative

      You are doing it wrong; permanently failing on recoverable EINTR and EAGAIN errors. See here for how to do it right.

    44. Re:Not a bug by xenocide2 · · Score: 4, Insightful

      UNIX filesystems have used tiny files for years and they've had data loss under certain conditions. My favorite example is the XFS that would journal just enough to give you a consistent filesystem full of binary nulls on power failure. This behavior was even documented in their FAQ with the reply "If it hurts, don't do it."

      Filesystems are a balancing act. If you want high performance, you want write caching to allow the system to flush writes in parallel while you go on computing, or make another overlapping write that could be merged. If you want high data security, you call fsync and the OS does its best possible job to write to disk before returning (modulo hard drives that lie to you). Or you open the damn file with O_SYNC.

      What he's suggesting is that the POSIX API allows either option to programmers, who often don't know theres even a choice to be had. So he recommends concentrating the few people who do know the API in and out focus on system libraries like libsqllite, and have dumbass programmers use that instead. You and he may not be so far apart, except his solution still allows hard-nosed engineers access to low level syscalls, at the price of shooting their foot off.

      --
      I Browse at +4 Flamebait

      Open Source Sysadmin

    45. Re:Not a bug by gnasher719 · · Score: 0

      As an application developer, you are expected to know what the API does, in order to use it correctly. What Ext4 is doing is 100% respectful of the spec.

      Respectful my arse.

      So: Use ext4, and stuff tends to disappear. Use something else, and stuff doesn't tend to disappear. It doesn't matter what excuses ext4 has to put the blame somewhere else, fact is that stuff disappears. If people don't want stuff to disappear, they'll have to stay away from ext4.

    46. Re:Not a bug by Kaboom13 · · Score: 2, Informative

      The point of a journal is to allow the file system to return to a defined state in the case the unexpected happens. This keeps the whole file system from being fucked by a crash or sudden data loss. It's better to know you lost some data, then have the filesystem in a state where some data is corrupt but you have no way to tell where or what it is. The situation here is ext 4 has increased the timeframe between commits. This increases performance at the cost of losing more data if a crash happens. Total crashes are pretty rare these days (unless you run some really shitty code) and UPS's are inexpensive. Hell my XP system has Blue Screened once over the last two years, and it was directly related to a beta nvidia driver.

      If your system is likely to crash or lose power, don't use ext4.

    47. Re:Not a bug by QuasiEvil · · Score: 3, Insightful

      In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.

      I couldn't agree more. A filesystem *is* a database, people. It's a sort of hierarchical one, but a database nonetheless.

      It shouldn't care if there's some mini-SQL thing app sitting on top providing another speed hit and layer of complexity or just a bunch of apps making hundreds of f{read|write|open|close|sync}() calls against hundreds of files. Hundreds of files, while cluttered, is very simple and easily debugged/fixed when something gets trashed. Some sort of obfuscated database cannot be fixed with mere vi. (Emacs, maybe, but only because it probably has 17 database repair modules built in, right next to the 87 kitchen sinks that are also included.)

      I do rather agree that it's not a bug. An unclean shutdown is an unclean shutdown, and Ts'o is right - there's not a defined behaviour. Ext4 is better at speed, but less safe in an unstable environment. Ext3 is safer, but less speedy. It's all just trade-offs, folks. Pick one appropriate to your use. (Which is why, when I install Jaunty, I'll be using Ext3.)

    48. Re:Not a bug by OeLeWaPpErKe · · Score: 2, Informative

      You're only partially right. EAGAIN cannot occur unless I asked for it first (and modified my error catching accordingly).

      But you're right about EINTR causing unwarranted disruption. I should ignore that one in the while loop.

    49. Re:Not a bug by lilo_booter · · Score: 1

      Suppose that's fair enough - can't imagine many reasons for having hundreds of tiny config files for a single app by default, but I guess if there's no attempt to consolidate various components state, then it could happen and I would agree with you that it's overkill.

      Will take your word on the prevalent use of fsync in VCS's - never looked tbh, just used it as a common example of something which (fairly critically) tends to be made of many, many files... :-).

    50. Re:Not a bug by pyrrhonist · · Score: 1

      I am so excited about this that I'm going to start working on it just as soon as I get done rewriting all my userspace tools in TCL.

      I've already started implementing your registry idea using Clojure. I've written it so that the registry API uses a GA to breed the best configuration settings. They may not actually be the best settings for you, but they are most definitely the fittest. I don't think we'll need the checksum.

      --
      Show me on the doll where his noodly appendage touched you.
    51. Re:Not a bug by PhilHibbs · · Score: 2, Informative

      It seems exceedingly odd that issuing a write for a non-zero-sized file and having it delayed causes the file to become zero-size before the new data is written.

      But you never create and write to a file as a single operation, there's always one function call to create the file and return a handle to it, and then another function call to write the data using the handle. The first operation writes data to the directory, which is itself a file that already exists, the second allocates some space for the file, writes to it, and updates the directory. Having the file system spot what your application is trying to do and reversing the order of the operations would be... tricky.

    52. Re:Not a bug by Anonymous Coward · · Score: 0

      your write may return ETIMEDOUT on a network socket (activate KEEPALIVE and then interupt the network link whilst idle), even though that's not listed in write(2). you need to read the POSIX specs, not just scroll to the bottom of a man page.
      the code doesn't take into account the sizes of int/size_t either.

    53. Re:Not a bug by Spaham · · Score: 1

      that's what beOS filesystem was all about, from what the tech guys told me at the time.
      Sounded really exciting ! (I mean, really)

    54. Re:Not a bug by yttrstein · · Score: 1

      "That's what Unix is about"

      Exactly.

    55. Re:Not a bug by Anonymous Coward · · Score: 0

      I agree.

      I don't know where tytso got the idea that applications should somehow cluster the data they get into a big file on their own. Isn't the whole point of a file system that it does that for you?

      Take any UN*X-like system and check the average file sizes. Most of them will be tiny.

      Strange disconnect from reality...

      As for truncating existing files to "edit" them (KDE seems to do that) has been a stupid thing to do for 40 years now and it's still stupid. Just create a new file.

      Danny

    56. Re:Not a bug by TerranFury · · Score: 1, Offtopic

      Ack! (I've read through the bug reports more closely, and am alarmed by what I'm reading. TFA made it sound much more innocuous.) I hereby retract my previous posts that pooh-poohed this.

      (Mod this guy up, Informative.)

    57. Re:Not a bug by gweihir · · Score: 2, Insightful

      Indeed. And that is what the suggestion about using a database was all about. You still can use all the tiny files. And there are better options than syncing for reliability. For example, rename the file to backup and then write a new file. The backup will still be there and can be used for automated recovery. Come to think of it, any decent text editor does it that way.

      Tuncating critical files without backup is just incredibly bad design.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    58. Re:Not a bug by Bronster · · Score: 1

      Ah slashdot, where any ignoramus can have a strongly held opinion. Shit was written that took advantage of something that was specifically documented NOT TO BE THE RIGHT WAY OF DOING THINGS and just happened to be more safe due to conservative defaults in an earlier system. More modern filesystems came along tuned for more performance (guess what, you probably want that too) and suddenly the assumptions were less right.

      Go mount with -o sync if you're so sure you know the purpose of a filesystem. Slow and safe boys, slow and safe. Enjoy your super reliable computing experience (especially KDE boot with lots of little file writes)

    59. Re:Not a bug by kelnos · · Score: 1

      -1 Ignorant

      --
      Xfce: Lighter than some, heavier than others. Just right.
    60. Re:Not a bug by Dog-Cow · · Score: 1

      He may not know the purpose, but he certainly understands POSIX filesystems. On the other hand, you are completely clueless.

    61. Re:Not a bug by Hurricane78 · · Score: 0

      Unless the system is going to crash while you're doing it and you somehow want the magic computer fairy to make sure that the files are still there when you reboot it.

      YES!! That is EXACTLY what I expect the every modern file system to do. So stop talking about it in a ridiculing way. Because the only one that is ridiculed, is you.

      A file system should take my data buffer, and after saying "Ok, I got it", *guarantee*, that I can turn off the system in that very moment, without losing data or corrupting the file system in any way.
      If it does not, it failed its purpose, and I could also directly write do a /dev block file.

      All that stuff about creating a backup copy and doing this and that, has to happen inside the file system.

      Letting every programmer re-create such an obviously generalizable functionality is not only spaghetti-style programming, it is also an insult to every real, educated programmer out there.

      Now get off my lawn, or I will put you on thedailywtf.com!

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    62. Re:Not a bug by gweihir · · Score: 2, Informative

      The point of having a rock-solid filesystem is to have a rock-solid filesystem. Any filesystem that crashes and loses data is bad. What is the point of a journal again? To enforce someone's idea of how an API should be coded to, or to reduce data loss?

      ext4 did not crash. Ext4 also did not lose any data it claimed to have gotten to disk. However, unless you want the filesystem slower by a factor of 10x....100x, you have to delay writes. And that means your data is only reliably on disk after an fsync. Any good developer knows that.

      Indicentially, the journal serves to avoid filesystem corruption on crash, nothing else. And no other claim was ever made by the developers.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    63. Re:Not a bug by Bronster · · Score: 3, Insightful

      You're welcome to write lots of little files. It will just be slow if you sync them all, or unsafe if you don't.

      Same way a database will tell you to wrap lots of actions in a single transaction if you don't want the cost of a full commit after each action.

      Except the filesystem API doesn't have any way to says "commit these 500 little files in a single transaction", unfortunately.

      Annoyingly, it also doesn't have "unlink this directory and the files inside it in a single transaction", because unlink performance blows goats.

    64. Re:Not a bug by EkriirkE · · Score: 1

      You're also not updating the data pointer to account for bytes written

      --
      from 09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
      to 45 2F 6E 40 3C DF 10 71 4E 41 DF AA 25 7D 31 3F
    65. Re:Not a bug by Dog-Cow · · Score: 1

      The idiocy is in expecting the FS to do something it was never asked to do. There is one way to commit data to disk in Posix systems. That function has existed for well over 20 years. It's probably going on 35 years now, but I don't know my Unix history well enough to be sure.

      The idiocy is in expecting a deterministic system to conform to desires instead of commands.

    66. Re:Not a bug by Dog-Cow · · Score: 1

      Please kill yourself and all your friends (just to be sure).

    67. Re:Not a bug by bluefoxlucid · · Score: 1

      The problem here is that KDE did not use fsync() when writing tons of small files, and the reason is that fsync() is expensive and slow down the system. Couple it with crazy people writing hundreds of small files at once and you get why they don't like fsync.

      So, if KDE sync's the files on write, it slows the system down. The solution is to have the file system sync the files on write automatically, so that KDE isn't the slow component, and can blame the kernel for being slow instead of for losing data?

    68. Re:Not a bug by bluefoxlucid · · Score: 2, Funny

      No, we have journals so the file system doesn't get a gaping hole in it that starts cross-linking shit and damaging more files after the initial data loss, and then implode and fuck your mom.

    69. Re:Not a bug by Dog-Cow · · Score: 1

      If performance drops through the floor, the application is poorly designed. Opening/rewriting/closing/syncing hundreds of files (no matter the size) is going to be slow. Period. It doesn't matter which filesystem is in use. The only difference is that ext3 would sync automatically after a few seconds, so the app got a free ride. But that's an implementation detail and should NEVER be relied upon.

      One of the key benefits of a well-defined API is implementation independence. The app developers screwed up.

    70. Re:Not a bug by dedazo · · Score: 1

      The idiocy is in expecting the FS to do something it was never asked to do.

      The idiocy is really breaking existing applications that used to work, simply because you're following a spec. As usual the whole point of computing is lost on many people here. Users expect their data to be safe, and they use applications, not file systems. If some distro "upgrades" me to Ext4 and I start losing data because the power goes out (for example, I live in Latin America where the grid sucks), do you think I'm going to shrug and praise the lord because they cleverly decided to ship it with a default that better expresses the super goodness of POSIX, or just be pissed that "the computer" is failing where it used to succeed?

      If KDE used to work fine under Ext3 and is suddenly broken under Ext4, then as far as I'm concerned that's a problem with Ext4, and you can be sure I'm never going to switch to it, no matter how much POSIX bliss it happens to embody or how many howling monkeys are telling me that this is "the way it's supposed to be".

      --
      Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
    71. Re:Not a bug by drew · · Score: 2, Insightful

      The whole bit you quoted about SQLite was about optimization, not correctness.

      the KDE and Gnome developers would be OK using the current file structure to save data so long as they had bothered to call fsync().

      What emacs (and very sophisticated, careful application writers) will do is this:

      3.a) open and read file ~/.kde/foo/bar/baz
      3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
      3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
      3.d) fsync(fd) --- and check the error return from the fsync
      3.e) close(fd)
      3.f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") --- this is optional
      3.g) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

      The fact that series (1) and (2) works at all is an accident. Ext3 in its default configuration happens to have the property that 5 seconds after (1) and (2) completes, the data is safely on disk. (3) is the ***only*** thing which is guaranteed not to lose data. For example, if you are using laptop mode, the 5 seconds is extended to 30 seconds.

      The problem is that the KDE developers were skipping step "d", presumably because they felt it slowed down the application too much. Fortunately(?) for them, with ext3 in its default configuration, it happened to not matter too much that they were skipping an important step.

      The part you quoted was merely discussing a potential way to store lots of isolated bits of data without the overhead of calling fsync() constantly.

      --
      If I don't put anything here, will anyone recognize me anymore?
    72. Re:Not a bug by gweihir · · Score: 3, Insightful

      The idiocy is in expecting the FS to do something it was never asked to do. There is one way to commit data to disk in Posix systems. That function has existed for well over 20 years. It's probably going on 35 years now, but I don't know my Unix history well enough to be sure.

      I think the problem is with more and more people beliving themselves to be good programmers, when they really do not undertstand what they are doing. Truncating and then writing critical files is a very bad idea to begin with. The way you do it is to rename the old file to backup and write to a new file. Also have a procedure in place to recover from backup if the main file is broken. Maybe even to checksums on the main file. In addition, only write if you have to. That is robust design, not the amateur-level truncate the KDE folks seem to be doing routinely.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    73. Re:Not a bug by harry666t · · Score: 1

      > Why not just keep ALL your files in an SQL database and cut out the filesystem entirely?

      I've had similar idea for a while. FUSE + MySQL as a backend. Hell, I'm too lazy to research this.

    74. Re:Not a bug by davolfman · · Score: 1

      Not writing software properly on all fronts to my mind.

      A program writing a crapload of small files is just asking for major performance problems. KDE should not have been written like that.

      A minute or more is an INSANE amount of time to delay a write, in failure it just makes things that much worse. The writers of EXT4 should have been giving this serious thought.

      And the highest heresy: The 'nix attitude of "Lets just put it in a little text file" for every stinkin thing is a great deal to blame for both of these. If almost all configuration wasn't in dozens of little text files the startup process wouldn't be anywhere near that complex for KDE. If programs didn't go and write tons of teeny-tiny text files all the time in 'nix systems there wouldn't have been such a drastic speedup to motivate the EXT4 developers to make such a large buffer.

    75. Re:Not a bug by icebike · · Score: 1

      >The only reason this has become a big blow-up issue with ext4 is that while other filesystems
      >generally would sync the data shortly anyways, ext4 does not. Everyone has been relying on bad
      >assumptions about filesystem behavior and getting by on the fact that "usually" the situation was
      >resolved "somewhat quickly".

      If everyone has been relying on the FS to work a certain way, why should you be surprised when everyone objects when it no longer functions that way?

      Further, why should you defend the introduction of that much more risk (3000%) into a system that has always trumpeted reliability over speed?

      The speed gained will be minimal. Saving 5 seconds of writes up is dramatically better than saving 150 seconds up.

      Absolutely no gain will be achieved by this change. All the speed gains claimed for EXT4 were based on systems designed for EXT3, and as soon as every programmer runs out and puts in sync calls everywhere any speed gains in EXT4 will disappear.

      Further, disk activity will increase, as sync requests come wili-nili from every application running on the box.

      It is utterly stupid to foist what everyone expects be a low level function of a FS onto end-user software. (I've tried to avoid inflammatory words, and Stupid is the nicest I could come up with.)

         

      --
      Sig Battery depleted. Reverting to safe mode.
    76. Re:Not a bug by Anonymous Coward · · Score: 0

      Then mount your drive with the sync option, or did you want all the performance enhancements as well?

    77. Re:Not a bug by Anonymous Coward · · Score: 0

      The idiocy is in expecting the FS to do something it was never asked to do.

      The idiocy is really breaking existing applications that used to work, simply because you're following a spec. ...

      The idiocy is conflating "used to work" with "correct".

      Because of that flaw in your "reasoning", everything else you posted is garbage.

    78. Re:Not a bug by batkiwi · · Score: 1

      NTFS has this issue. Next?

    79. Re:Not a bug by vadim_t · · Score: 1

      If you want that guarantee, use the fsync call, which is there precisely for that reason.

      Though it doesn't necessarily guarantee it will get on the disk platter anyway, since some hard disks lie and use a write cache even if you ask them not to. Probably because it looks better on benchmarks.

    80. Re:Not a bug by raynet · · Score: 1

      I would assume that EXT4 actually is very good with thousands of tiny files as it doesn't have to hit the disks after 5 secs. And about the god-awful slow rewritten apps, you don't have to call sync after every write, every 5 secs after writing some data would be good enough to match EXT3 behaviour.

      --
      - Raynet --> .
    81. Re:Not a bug by Ed+Avis · · Score: 3, Informative

      YES!! That is EXACTLY what I expect the every modern file system to do.

      Your expectation is quite reasonable. When the application writes something to disk, it should be there on disk, right? The way the article is presented makes it sound like a horrible bug in ext4 that it doesn't do this. But believe it or not, almost no filesystem provides this guarantee by default. ext3 doesn't (in the default mode), nor does ext2, nor a typical implementation of FAT or NTFS or the Minix filesystem or whatever.

      For decades now it has been an accepted trade-off that the filesystem can hold back disk writes and do them later, giving better disk performance at the expense of losing data if there is a crash. Losing file data is bad but losing metadata is even worse, since corrupt filesystem metadata can trash the contents of many files and requires a lengthy fsck on startup. So journalling filesystems, as typically configured, keep a journal for metadata so it's not corrupted even if the power gets cut at the most inconvenient moment. But they don't extend the same care to file contents, because it would be too slow. You can enable it by setting the data=journal parameter in ext3 (and I guess ext4 too) but this isn't the detail.

      It is certainly a bit unfair that the filesystem takes such pains with its own bookkeeping information but doesn't bother to be so careful about user data. But as I said, it's a known tradeoff to get better performance. If you want to be sure your file has reached disk you need to fsync(). This sucks, but it's the Unix way, and has been so for like, forever. So it's not a bug in ext4 - just bad luck and perhaps a misunderstanding between kernel and userspace about what guarantees the filesystem provides.

      As SSDs replace rotating storage, there is less need to buffer writes (certainly the need to minimize seek time goes away, and that's the biggest reason), so we might see this whole situation resolved within a few years. Perhaps in 2015, when the system call returns, you can be sure that the data is written. Until that longed-for day, bear in mind that your filesystem is permitted to temporarily lie to you about what has been written, and call fsync() if you are paranoid.

      --
      -- Ed Avis ed@membled.com
    82. Re:Not a bug by Anonymous Coward · · Score: 0

      Hey fuckwad, man 2 fsync. Now go check your shutdown scripts and see if one of them calls sync.

    83. Re:Not a bug by Anonymous Coward · · Score: 0

      It's easy to just call the patent an idiot however unless you're going to go around and fix thousands of apps with this problem there's not much that can be done.

      What are you going to do in this situation? Not run EXT4 because it fucks up your system. So the parent is right. You can be as idealistic about it as you want but the realists here are the ones who are correct.

    84. Re:Not a bug by Anonymous Coward · · Score: 0

      Oh, this really should be used for /etc/ as well. Doing so would shurely gurantee flawless upgrades.

    85. Re:Not a bug by dedazo · · Score: 1

      Like any other OS, Windows also buffers. But the buffering on NTFS is far less aggressive or done differently than Ext4, apparently.

      The other small difference between Ext3 and NTFS is the fact that the FlushFileBuffers API flushes the data for a specific file, whereas Ext3 with the ordered data config on (the default as far as I can see) flushes all the data in the cache. For all the applications on the box. That's brilliant! It's probably why the KDE devs didn't even want to call fsync() to begin with. I think that was a huge problem with the performance of Firefox's new SQLite subsystem on Linux as well.

      --
      Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
    86. Re:Not a bug by cbrocious · · Score: 1

      There's no reason you can't have your cake and eat it too. We have virtual filesystems like proc, why not a 'registry' filesystem? People will still be able to modify config files to their heart's content, with all the benefits of a 'registry'.

      --
      Disconnect and self-destruct, one bullet at a time.
    87. Re:Not a bug by Anonymous Coward · · Score: 0

      "this guy"? Do you actually know who he is? And what's he done? slashdot tards....

    88. Re:Not a bug by Qzukk · · Score: 4, Informative

      A file system should take my data buffer, and after saying "Ok, I got it"

      There's your problem, you didn't even bother to ask if it got it, you just threw a ton of data into the file descriptor and closed it, now didn't you. And you want me on thedailywtf?

      But lets back up here, because there's more than just people too lazy to call fsync() in order to ask the file system to write the data to the disk and say "Ok, I got it".

      All that stuff about creating a backup copy and doing this and that, has to happen inside the file system.

      The filesystem does exactly what you tell it to do. If you don't want it to make a zero byte file, then DON'T USE O_TRUNC OR *truncate() TO EMPTY YOUR FILE. Make a new file, fill it up, rename it over the other file. Don't assume that in just a few instructions, you're going to be filling it back up with new data, because those instructions may never arrive.

      You don't like it? Try and convince people that (open file, erase all the data in it, do some stuff, write some data, do some more stuff, write some more data, write data to disk, close file) should be an uninterruptable atomic operation. You want a versioning filesystem? Take your pick.

      --
      If I have been able to see further than others, it is because I bought a pair of binoculars.
    89. Re:Not a bug by Anonymous Coward · · Score: 0

      Yes, heaven forbid that loads of indifferently able programmers (who may happen to be brilliant at something else) decide to link their enthusiasm for programming with their expertise in other areas, thus producing software that lots of subject specialists just LOVE to use.

      That would be a disaster. Because they probably didn't read the bit in the manual that says 'btw. fsync() is like, really important. Because the filesystem behaviour could change whenever we get around to changing it.'

      So they should just fuck right off, those gosh-darn amateur enthusiasts. They think they're so special. But they just don't understand. All those people who don't understand why an encyclopaedic knowledge of the full glory that is the POSIX spec is vastly more important than just wanting to write cool software should fuck off, too. Right?

      </sarcasm>

      In all seriousness, design almost always represents a compromise, so how about finding a functional compromise between the 'POSIX says it's OK so screw you and your zero-length files' crowd, and the 'wtf??' crowd? Also, keeping the snobbery to a dull roar might make this whole discussion a good deal more pleasant.

    90. Re:Not a bug by Anonymous Coward · · Score: 0

      Duh, that's what fsync does! If you didn't bother calling it (and CHECKING the result), then you *didn't* ask it if it "got it" at all, and it's loss is your fault.

    91. Re:Not a bug by shutdown+-p+now · · Score: 1

      While it's all well and good, the comments you link to are rather trollish in places:

      "... certain crappy applications that apparently write huge numbers of small files in users' home directories. This appears to be the case for both GNOME and KDE."

      "This solves the most common cases where some crappy desktop framework is constantly rewriting large number of files in ~/.gnome or ~/.kde"

      "if sqllite had been properly written ... "

    92. Re:Not a bug by shutdown+-p+now · · Score: 1

      Why is this modded Funny? It is indeed the obvious conclusion from that statement: a "small database" is precisely what Windows registry is.

    93. Re:Not a bug by shutdown+-p+now · · Score: 2, Informative

      The problem with that is that you have to use fsync() for each and every file descriptor you have, and for lots of small times, this is very slow (because if you're syncing after every 10-byte write, you might as well have no caching). What's needed is a way to write those files in a batch, close them all, and then say "now sync all of that".

      To the best of my knowledge, though, Windows has the same problem - its fsync analog, FlushFileBuffers, also applies to a single file handle only (you can flush all writes for the volume, but only if you're an admin.

    94. Re:Not a bug by dbIII · · Score: 4, Insightful

      Linux reinvents windows registry?

      It's called "gconf", and it's worse than that. It's no longer abandonware lurking at the heart of gnome but it's still a nightmare.

    95. Re:Not a bug by Anonymous Coward · · Score: 1, Insightful

      To be fair, the idea of KDE using a consolidated database is quite different than the idea of every single program on the system using the same consolidated database.

    96. Re:Not a bug by shutdown+-p+now · · Score: 1

      Even so, you were almost correct.

      The problem isn't that ext4 doesn't handle small files - it does. The problem with lots of small files is fsync() itself. You have to call it once for each file you change, and it is required to block you until the changes are physically stored on the drive. In practice, you usually do not want that - you want to write a lot of those files, and then, once you've done with the last one, you want to flush all the buffers then, just once. Thus, in practice, when you have to do fsync() to guarantee the safety of the data, you effectively cannot use a large database consisting of a lots of very small files, as e.g. GConf does.

    97. Re:Not a bug by shutdown+-p+now · · Score: 1

      You'll have to excuse me for chuckling a bit here, but if NTFS or the filesystem for OS X (whatever that is) had this problem and someone suggested that it's an "application problem" they'd be stoned to death.

      As a Win32 developer, I can tell you that NTFS and Win32 APIs behave in precisely the same way. If you call WriteFile and then immediately do a hard reset, the data won't be written to disk because it's buffered by default. You have to call FlushFileBuffers if you want changes to be definitely persisted to the media. Alternatively, you can open a file in non-buffering mode - using FILE_FLAG_NO_BUFFERING in a call to CreateFile - but that's strictly opt-in for performance reasons. If I remember correctly, CloseHandle doesn't flush the buffers either.

      The last OS that I saw which did uncached writes by default was DOS without SmartDrive (and it's no surprise that on all preconfigured machines SmartDrive was always enabled by default).

    98. Re:Not a bug by shutdown+-p+now · · Score: 1

      Except the filesystem API doesn't have any way to says "commit these 500 little files in a single transaction", unfortunately.

      Depends on the filesystem and the API (though I must admit that I'm not sure whether committing KTM transaction will fully flush the file buffers).

    99. Re:Not a bug by shutdown+-p+now · · Score: 1

      If some distro "upgrades" me to Ext4 and I start losing data because the power goes out (for example, I live in Latin America where the grid sucks), do you think I'm going to shrug and praise the lord because they cleverly decided to ship it with a default that better expresses the super goodness of POSIX, or just be pissed that "the computer" is failing where it used to succeed?

      The behavior of extFS driver did not change with respect to POSIX. What changed was the default interval before cache flush. It used to be that you lost the last 5 seconds in case of a power failure, now you can lose a minute or more. Either way, the potential data loss is there already, regardless of which extFS version you use, because KDE folks didn't properly code against POSIX API.

    100. Re:Not a bug by ozphx · · Score: 1

      Its very similar on NTFS. You have to call fflush(...) on your stream (or fflush(NULL)), or have opened the stream synchronously, or use setvbuf.

      That said the low level routines (_write) are synchronous, which makes more sense.

      On this laptop I just turn on the option for battery backed hard disks (Enable Advanced Perf) - which turns the write-thru cache into write-back. All the apps that think they are doing synchronous IO are hitting RAM, and compilation/SVN goes like a motherfucker even on this shitty Hitachi disk. Course if I really have a crash I'm going to be totally hosed :D

      --
      3laws: No freebies, no backsies, GTFO.
    101. Re:Not a bug by xlotlu · · Score: 1

      Ts'o is right: DEs are braindead. I'm a total KDE fanboi, but the way they're managing config files is just asking for trouble.

      And IMO the ext3 5 seconds flush default is plain dumb. As an example of what I'm prepared to put up with, here's an excerpt from my fstab:
      /dev/sda2 / xfs allocsize=128k,logbufs=4
      /dev/sda5 /var ext3 data=writeback,commit=60
      /dev/sda6 /home ext3 commit=30

      And from sysctl.conf:
      vm.dirty_background_ratio = 20
      vm.dirty_ratio = 50
      vm.dirty_writeback_centisecs = 6000
      vm.dirty_expire_centisecs = 12000

      What I don't agree with is his suggestion for a sqlite-like database. No thank you. I want my config files easily readable and editable, not windows registry (or gconf for that matter). All thousands of them.

      The problem lies with how all those files are littered all across the $HOME dir. I want a single ~/.config directory that I could mount ext2 sync, or ext3 data=journal,commit=1. Whomever had the bright idea that each app/suite should write its conf in a separate dot file/folder under $HOME should be taken out back and shot.

    102. Re:Not a bug by GXTi · · Score: 2, Informative

      and after saying "Ok, I got it", *guarantee*, that I can turn off the system in that very moment, without losing data or corrupting the file system in any way.

      Which is precisely what fsync does, and is precisely what these developers didn't use. The filesystem knows better than you do how to get all the data it has to write onto the platters as fast as possible so if you need something specific like "it's important that this data get written now, so I'll wait for you to finish", you have to ask. Otherwise your apps would run a great deal slower since every little write (even a single byte!) would have to wait for the OS to say "OK, it's on disk". And if you really want that, there are flags you can use, e.g. O_SYNC. But you don't.

    103. Re:Not a bug by swillden · · Score: 1

      If everyone has been relying on the FS to work a certain way, why should you be surprised when everyone objects when it no longer functions that way?

      Because "everyone" wasn't relying on the FS to work in that undocumented way. A few people were, and they were relying on incorrect assumptions.

      Further, why should you defend the introduction of that much more risk (3000%) into a system that has always trumpeted reliability over speed?

      If it convinces app developers to read and understand 'man 2 write', and to apply fsync() appropriately, then this change will *increase* reliability. Under ext3 there was also the possibility of losing data, it just lasted for a shorter period of time.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    104. Re:Not a bug by Jurily · · Score: 1

      Thus, in practice, when you have to do fsync() to guarantee the safety of the data, you effectively cannot use a large database consisting of a lots of very small files, as e.g. GConf does.

      Close, but no cigar. The data we need safe is the one already on the disk: if you don't flush, you get to keep the old version already on the disk. This is what we're used to, and this is what we want. The problem with ext4: if you don't write the new version, you lose BOTH. What you get instead is an empty file, and hope it wasn't important.

      Bogdan Gribincea wrote on 2009-01-22: (permalink)

      It happend again. Somehow when trying to logout KDM crashed. After rebooting I had some zeroed config files in a few KDE apps, log files (pidgin)..
      I coverted / and /home back to EXT3. This is extremely annoying, reminds me of Windows 9x

      Think about it.

    105. Re:Not a bug by shutdown+-p+now · · Score: 3, Insightful

      Close, but no cigar. The data we need safe is the one already on the disk: if you don't flush, you get to keep the old version already on the disk.

      That's an interesting interpretation of fsync(), but, unfortunately, one that's not supported by the POSIX spec. Nowhere it says that the system cannot flush the data that you've already written so far without an explicit fsync() call. If you're unlucky enough that this happened after you've truncated the file, but before you wrote anything into it - well, too bad. As I understand, ext3 could also exhibit this behavior, it was simply harder to reproduce because the implicit flushes were much more frequent.

      Anyway, this post seems to explain what's actually going on there in the (very specific) case of KDE.

    106. Re:Not a bug by Kjella · · Score: 1

      ext4 did not crash. Ext4 also did not lose any data it claimed to have gotten to disk.

      Except the data that was in the file before. Sure, technically it's POSIX compliant because you ask it to truncate the file when you open it then write to the file and there's nothing in the standard that forbids you from executing the truncate immidiately and delaying the write forever, unless fsync() is called. His "solutions" to this are major clusterfucks, both the database solution and the constant fsyncs to temp files.

      What's needed is atomic file replacements, because the major point here is not instant sync but rather keeping one good version. In practical programming it's often much easier to read the whole file into memory, manipulate it and write it back out than trying to rewrite it on disk. For that, there should be a "O_REPLACE" flag, which says to replace the original file contents when the data has been written and not before.

      1) fd = open("~/.kde/foo/bar/baz", O_REPLACE)
      2) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
      3) close(fd)
      *** TIME PASSES ***
      4) write delayed allocation to disk using "temp" inodes not yet mapped to a file
      5) when write is done, atomicly point inode to new file instead of old file

      Alternatively, if the computer crashes hard before steps 4) and 5) the file is NOT truncated or modified in any way.

      Pros:
      - No fsyncs = Maximum performance
      - No truncated-but-not-written files = No total meltdowns
      Cons:
      - Must create new open() flag
      - Need free space for temp copy
      - Can still lose data up to 60 seconds or whatever after write

      Sure this isn't perfect since the calling code never checks if things worked out, it doesn't cover multiple file changes that should be all-or-nothing but we can gladly assume the calling code sucks and that there are ways to report errors to code that really wants to confirm it's on disk.

      --
      Live today, because you never know what tomorrow brings
    107. Re:Not a bug by Scott+Wood · · Score: 1

      Maybe if the elaborate sequence required to do a reliable read/write were put in a standard library, rather than expecting it to be correctly open-coded by every bit of application code, it would be used a bit more often.

    108. Re:Not a bug by caseih · · Score: 1

      You need to actually read the bug report and the FA before you comment. It's not that open files become zero bytes merely because it is open. It's that badly behaved apps open an existing file with the O_TRUNC option (that's were the zero bytes comes from), then write to it, and then close the file (or not). If an power is lost before the write is synced to disk, then it's lost and the file is zero bytes. This is clearly bad programming anyway, because any sane programmer should know that it's better to make a new file and then, after checking for OS errors, rename it back over the original file if no error occurred. I should think this would be obvious to any programmer. I mean even if I didn't lose power, if something went wrong during the write (some OS error), then I've lost my original file and replaced it with garbage. This doesn't seem like a good practice to me.

      On another note, I don't believe ZFS has this problem because if I opened a file and truncated it, I believe we get new blocks being allocated. After the write is committed, the old blocks are freed. IF a power failure happens before the metadata is written, then the newly written blocks are still unallocated technically, and the old blocks still in place.

    109. Re:Not a bug by shutdown+-p+now · · Score: 1

      ... you need to use fsync. And if that causes performance problems, then perhaps you should rethink how your application is doing things.

      That's the moment logic breaks down. Why is it unreasonable to expect the OS and the FS to handle this seemingly common case of writing lots of small files, by introducing a new "fsync all" API call if necessary?

    110. Re:Not a bug by Bronster · · Score: 1

      Let me know when you get that working in ext4 ;)

      (seriously, I should have said "the POSIX filesystem API" to cover myself from pedants. Oh well)

    111. Re:Not a bug by snemarch · · Score: 1

      "dotfiles - because every application having it's own configuration syntax is such a joy" :-)

      --
      Coffee-driven development.
    112. Re:Not a bug by Anonymous Coward · · Score: 0

      Whether it's a bug or not. It's a poor design. The ext4 implementation pretty much says that all applications must call fsync themselves. Where does that get you?

      Now the applications are beating the snot out of the file system with fsync calls. That sounds like a step backward to me

    113. Re:Not a bug by snemarch · · Score: 1

      It's actually a pretty damn good thing this has happened, since it will allow the KDE developers to go over their code and *FIX THEIR BUGS*, in turn making it more stable even on EXT3.

      --
      Coffee-driven development.
    114. Re:Not a bug by snemarch · · Score: 1

      Afaik, _write() is simply synchroneous to write() on *u*x - which means you have to do FlushFileBuffers() on the handle in order to do the sync.

      Btw, "enable advanced performance" is a really really really bad idea - it turns FlushFileBuffers() into a null call. You *do* get filesystem writeback cache and disk write-back cache without this option...

      --
      Coffee-driven development.
    115. Re:Not a bug by sjames · · Score: 1

      The solution presented is an ideal solution for performance reasons.

      That doesn't mean you can't take the easier but lower performing option of using temporary names and fsync.

      KDE wants the security of transactions on a non-transactional storage (a filesystem). The good news is that it is possible, but it has to take a few steps to make it happen.

      You do NOT want the entire filesystem to be made synchronous. The performance would be horrible.

    116. Re:Not a bug by ozphx · · Score: 1

      Hrmmm I was fairly certain that I read that (excluding buffers) the default is write-through. This means that a flush will actually hit metal, with all the latency you'd expect. You are right about AP on making flush useless.

      Regardless - its a shitload faster when dealing with bunches of small files with it turned on. This disk tops out at like 20megs a second - to compile at any decent speed I need it on. As for reliablity - well, its battery backed and stable, been going fine for a year at least :)

      --
      3laws: No freebies, no backsies, GTFO.
    117. Re:Not a bug by snemarch · · Score: 1

      Well, the long story: _write doesn't use an interim buffer like fwrite() does, so it hits WriteFile instantaneously (for files in binary mode - code for text/utf files != nice). Doesn't call FlushFileBuffers().

      However, the OS still does caching for WriteFile(), unless FILE_FLAG_NO_BUFFERING was specified on CreateFile() - which isn't a standard thing to do (big performance hit, write buffers have to be sector-aligned (in memory too!), et cetera).

      fflush() does call FlushFileBuffers() :) - ms libc version of fsync() is called _commit().

      As for "advanced performance" helping, imho it just isn't worth it. Disabling NTFS last-access timestamp + %TEMP% on ramdrive == win.

      --
      Coffee-driven development.
    118. Re:Not a bug by Anonymous Coward · · Score: 1, Insightful

      You need to actually read the bug report and the FA before you comment. The problem isn't that the first operation truncates the file and then the later operations never make it to the disk. The problem is that the metadata operations make it to the disk but the data operations, even though they came first, don't. That's why writing to a new file and renaming it to replace the old file is not sufficient. You have to fsync() before you rename the file to ensure that the data is actually there. Otherwise a crash might occur and you end up without data because the new file (with zero length) replaced your old file.

    119. Re:Not a bug by scotch · · Score: 1

      It's easy to just call the patent an idiot however unless you're going to go around and fix thousands of apps with this problem there's not much that can be done.t.

      Those apps are broken no matter what you do. If having a lower risk of that brokenness manifesting in filesystem X versus filesystem Y make you happy, then you are a fool.

      --
      XML causes global warming.
    120. Re:Not a bug by scotch · · Score: 1

      The idiocy is really breaking existing applications that used to work...

      I think you meant "applications that were already broken but perhaps had a smaller window of failure". If having a small window of failure is ok with you, then sure, let's not fix those broken applications and continue to pine for unrealistic filesystem semantics that for some reason filesystem developers over the last 40 years have been too stupid to come up with but.

      --
      XML causes global warming.
    121. Re:Not a bug by ozphx · · Score: 1

      *nods*, informative. I'm mostly up in .Net-land, but unlike most of my brethren I think its necessary to understand whats going on under the hood :)

      As for "advanced performance" helping, imho it just isn't worth it. Disabling NTFS last-access timestamp + %TEMP% on ramdrive == win.

      I'll probably be agreeing with you after the first time everything goes to shit ;)

      --
      3laws: No freebies, no backsies, GTFO.
    122. Re:Not a bug by Anonymous Coward · · Score: 0

      Yeah, why have a single unified database when you can have multiple incompatible databases that all store the same thing. You can store them in different formats and even have different APIs for them. Ideally they would even have different schemas so there would be no way to automatically convert an app to use a different config DB.

      dom

    123. Re:Not a bug by Anonymous Coward · · Score: 0

      "fflush() does call FlushFileBuffers()"

      But it does not do call _commit() (FlushFileBuffers) for the default case presented below. Only if "c" is added to the open mode which I presume is MS specific.

      None of the following functions call _commit() (FlushFileBuffers):

      FILE * fp = fopen("d:\\aaa.txt", "at");
      fprintf(fp, "aaa");
      fflush(fp);
      fclose(fp);

    124. Re:Not a bug by Anonymous Coward · · Score: 0

      A file system should take my data buffer, and after saying "Ok, I got it", *guarantee*, that I can turn off the system in that very moment, without losing data or corrupting the file system in any way.
      If it does not, it failed its purpose, and I could also directly write do a /dev block file.

      So, you would execlude the speed to gain trust. Ok

    125. Re:Not a bug by Anonymous Coward · · Score: 0

      Linux reinvents windows registry???

      Seems that you do not know anything about Linux.
      The Linux kernel is the operating system.
      The Gnome or KDE are desktop environments for *nix operating systems, like Linux.
      And if Gnome use XML format registry, it does not mean at all that Linux would use such registry.

      This is like you would say that Linux use MySQL to store files metadata if you have application what use MySQL to store files metadata.

      Dont be a silly, and you are even modded as "Informative +5". Hah.

    126. Re:Not a bug by cryptoluddite · · Score: 1

      But you never create and write to a file as a single operation, ... Having the file system spot what your application is trying to do and reversing the order of the operations would be... tricky

      Generally in these cases you do:

      1) create a new file
      2) write contents of new file
      3) close it
      4) mv file onto original file

      -or-

      1) open existing file
      2) trunc(0) it
      3) write new data
      4) close

      In the first (better) approach the FS only has to do something special at step 4... it only has to finish writing the data for the source file before saving the rename.

      In the second, the FS only has to associate the trunc with the data to some extent. For instance, the trunc shouldn't be written to the log if the data is going to stay around for 120 seconds... the trunc can be written to the log when the data actually starts being saved to disk. But certainly if the whole data + close is done before anything needs to be written the FS can do something intelligent to avoid data loss.

      So I don't really see why it's so hard as you are saying it is.

    127. Re:Not a bug by cryptoluddite · · Score: 1

      That brings up another point, almost nobody is ready for the second remark either (write might return after a partial write, necessitating a second call)

      That's because read/write to disks is considered a 'quick' operation, so it never returns with a partial write. And they are never interrupted because they are restartable and automatically restart.

      The code you gave is necessary for pipes and sockets, not for files.

    128. Re:Not a bug by Anonymous Coward · · Score: 0

      It could only be worse if they allocated space, linked the space into place and then placed the data into it.
      That would make data thieves' life all that happier.
      Maybe ext5 will have this feature.

    129. Re:Not a bug by cryptoluddite · · Score: 1

      Pretty sure no one in their right mind would call using fsync() barriers a "huge burden". There are an enormous number of programs out there that do this correctly already.

      Name one scenario where fsync should be necessary other than:

      1) to give a third party assurance the data is on disk, for instance so the user can run 'sync' then hit the power strip without shutting down (assuming they also wait N seconds for the actual drive to save the data).
      2) rewriting data back to an existing file... like for instance something a database might do.

      Are there any other case where fsync should be necessary? fsync is like a crypto program zero'ing the memory where you password was stored... it's a good thing, but shouldn't be required in order to have the FS work correctly.

    130. Re:Not a bug by Hal_Porter · · Score: 1

      One of the reasons people stopping calling fsync was because on ext3 it was not necessary and

      http://lkml.org/lkml/2007/4/27/300

      On a good filesystem, when you do "fsync()" on a file, nothing at all
      happens to any other files. On ext3, it seems to sync the global journal,
      which means that just about *everything* that writes even a single byte
      (well, at least anything journalled, which would be all the normal
      directory ops etc) to disk will just *stop* dead cold!

      It's horrid. And it really is ext3, not "fsync()".

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    131. Re:Not a bug by caerwyn · · Score: 1

      Every scenario where data integrity is of importance to the user.

      Note that this is not every scenario. Many, many programs use temporary files for storage, and in those cases, the corruption of those files during a crash is of no consequence. I've written and used applications like this in the past.

      At the same time, I've written and used programs for which I could not countenance data corruption. And, for at least the programs I choose to use, those programs take the necessary measures to prevent it.

      Consider an application like PostgreSQL. It doesn't much care what it's running on- because it uses the API appropriately, it can get verifiably correct behavior on *any* operating system (barring some hardware that actively lies- no app can be expected to be correct in the face of things that lie to it).

      I *am* a developer, including of software that requires data guarantees. I have no sympathy for application developers who don't take such considerations and then complain when some other software doesn't compensate for their own bugs.

      --
      The ringing of the division bell has begun... -PF
    132. Re:Not a bug by caerwyn · · Score: 1

      This really is an ext3 issue, not "fsync()".

      The above says it all.

      The fact that ext3 was broken in this regard is a performance bug in ext3.

      Also, it is very much necessary on ext3. The commit delay was short on ext3 (5 seconds), but a *lot* can happen in 5 seconds, and a lot of data can still be lost in that time period.

      There's exactly one thing to take away from this. If you want data to be guaranteed to be on disk, call fsync(). You'll note that Linus doesn't at all disagree with this- what he's basically saying is that "ext3 implements fsync() in a braindead way that makes in painful for the whole system if people do the right thing."

      That's not an API issue. That's an ext3 issue, and then it's an issue with application developers coding with the expectation of ext3.

      --
      The ringing of the division bell has begun... -PF
    133. Re:Not a bug by Hal_Porter · · Score: 1

      Well you could clearly get away with not calling fsync() with ext3. And if you did call it it made the machine grind to a halt.

      Is it really surprising that people didn't do it?

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    134. Re:Not a bug by QuoteMstr · · Score: 1

      get away with not calling fsync()

      It was never necessary in the first place.

    135. Re:Not a bug by Anonymous Coward · · Score: 0

      They will call it GConf?

    136. Re:Not a bug by EsbenMoseHansen · · Score: 2, Informative

      No. Writing software properly means calling fsync() if you need a data guarantee.

      But neither Gnome nor KDE needs this. What they need is that the file in question is either left in the old state or in the new state. The problem is that ext4 rushes in to complete the truncation, but lazily after 1-2 minutes (!) writes the actual data. That is quite broken, in my opinion. The obvious solution would be to bundle the truncation with the writing out the data.

      Pretty sure no one in their right mind would call using fsync() barriers a "huge burden". There are an enormous number of programs out there that do this correctly already.

      And then there are some that don't. Those have problems. They're bugs. They need to be fixed. Fixing bugs is not a "huge burden", it's a necessary task.

      In KDEs case, it would be as simple as reverting a patch. The fsyncs() were removed because of the bugs associated with it, including killing laptop batteries. Dig through kde-core-devel for the gory details. The code in question is posted elsewhere.

      The bug is in ext4, like it was in XFS --- where it was finally fixed. And it looks like ext4 has introduced a hack to sort of fix this problem there, too.

      --
      Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
    137. Re:Not a bug by master_p · · Score: 1

      I recognize that loop...it's "standard" code when dealing with tcp/ip streams.

      I have a question: shouldn't 'write' itself do this, i.e. making sure data all data are sent down the tube?

    138. Re:Not a bug by Anonymous Coward · · Score: 0

      Ah, I am waiting the register cleaners and de-fragment applications to appear, to solve problems what windows-kind register brings. Gnome is using XML format register and you already have those applications for it.

      How many times it is need to say that Linux is a Unix-like operating system and system settings are kept on the text documents. Linux is not trying to be a NT copy what is like a tree in someones ass when it comes to modularity.

      I would like to see you to do all the innovative things on commandline with the monolith binary blob with somekind tool.

    139. Re:Not a bug by Peeteriz · · Score: 1

      Stuff doesn't dissappear - the only thing that is changing is worse recovery in case of power failure and crashes. And if your application wants to secure data from power failure, then it should do these additional (slower) things to ensure that.

    140. Re:Not a bug by Anonymous Coward · · Score: 0

      No, that's the bug. It loses ALL data. You get 0 byte files on reboot.

      Stop being ignorant. The reason that you get a 0-byte file on reboot is because KDE opened the file using O_TRUNC. To put it in simple terms: KDE asked for the file to be truncated, or in other words to destroy the file's contents. Then it wrote new data into the file, but didn't bother to verify if the write made it to disk, and it didn't create a backup copy.

      That's why you end up with a 0-byte file. It is not a filesystem bug. It is a KDE bug.

    141. Re:Not a bug by m50d · · Score: 1

      "No mainstream Linux file system supports versioning". That's a travesty, right there. VMS had filesystem-level versioning before I was born, it's an obviously useful feature (heck, is there anyone who *hasn't* accidentally deleted an important file?), yet there is no linux that will support it out of the box.

      --
      I am trolling
    142. Re:Not a bug by m50d · · Score: 1
      Except the filesystem API doesn't have any way to says "commit these 500 little files in a single transaction", unfortunately.

      This, right here, is the real problem. I believe reiser4 was adding calls to let one do this, but it really should be something possible in all filesystems.

      --
      I am trolling
    143. Re:Not a bug by grahamm · · Score: 1

      Or have KDE not keep rewriting these configuration files. If the user has not changed the application's configuration, why do the files have to be re-written? Surely, configuration files should be opened read-only when the application starts, closed and only opened for write (or create temp file, write, rename) when the configuration is changed.

    144. Re:Not a bug by dkf · · Score: 1

      Quoting T'so:

      "The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, and that it uses fdatawrite() instead of fsync() to guarantee that data is written on disk. If sqllite had been properly written so that it grabbed new space for its database storage in chunks of 16k or 64k, and released space when it was no longer needed in similar large chunks via truncate(), and if it used fdatasync() instead of fsync(), the performance problems with FireFox 3 wouldn't have taken place."

      The issue with Ted's comment is... sqlite3 does just that; I grepped the source, and it's certainly using fdatasync(). If it's getting anything wrong, that's a bug report that should be filed.

      If you want a masterclass in just how nasty data integrity can get, read the file in sqlite3 that implements these things (for unix). I'm not sure if the parts dealing with synchronization are as scary as the portions on file locking...

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
    145. Re:Not a bug by jcupitt65 · · Score: 1

      No, this is using an application "registry" already. The debate is about how the registry keeps its data on disc.

      It currently stores each set of name/value pairs as a chunk of XML in a hierarchy of files. Updating this large set of files and being safe in case of crashes is difficult.

    146. Re:Not a bug by rtz · · Score: 1

      I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.

      And the point is, it's a burden you can't avoid, and never could

      The magical fairy god mother file system that handles everyting for you doesn't exist, and never did.

      If you want database like data integrity, use a database, and take the performance hit.

    147. Re:Not a bug by Fred_A · · Score: 1

      I am so excited about this that I'm going to start working on it just as soon as I get done rewriting all my userspace tools in TCL.

      I'm right with you on this. TCL + the Athena widgets (for maximum portability and less overhead) with a decent database backend storing system data on a dedicated partition) are a sure win. The year of the Linux desktop is finally at hand !

      What we need now is a newsletter.

      --

      May contain traces of nut.
      Made from the freshest electrons.
    148. Re:Not a bug by drinkypoo · · Score: 1

      For decades now it has been an accepted trade-off that the filesystem can hold back disk writes and do them later,

      Just to pick nits, this has nothing to do with the filesystem, but how it is used. I don't know of any contemporary filesystems that can't be mounted in an immediate-commit mode, including NTFS on Windows NT. You can disable write caching on the drive itself, too, and then every byte written out is, well, written out.

      If you want to be sure your file has reached disk you need to fsync(). This sucks, but it's the Unix way, and has been so for like, forever.

      It's not the Unix way, it's everyone's way, although it might not be called fsync().

      As SSDs replace rotating storage, there is less need to buffer writes (certainly the need to minimize seek time goes away, and that's the biggest reason), so we might see this whole situation resolved within a few years.

      Many SSDs are slow so instead of waiting for seeks you're waiting for writes. Anyway I believe the solution is more to have some battery/capacitor-based RAM on the disk which will act as a buffer; then you're only limited by bus bandwidth. The current buffers on disks are large enough to bone your journal in a power failure.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    149. Re:Not a bug by Ed+Avis · · Score: 1

      Putting versioning in the filesystem is an interesting idea but it has all sorts of rough edges in practice.

      Dave Cutler (the designer of VMS and later Windows NT) was asked whether he regretted not putting file versioning into Windows NT the same as VMS, and I believe he replied that on second thoughts, it's probably better that way. Linux has several very good revision control systems such as git which can track the history of your files without much effort. Certainly, it would be handy to have some kind of filesystem plugin or hook to automatically commit changes to a version control repository when a file is saved.

      --
      -- Ed Avis ed@membled.com
    150. Re:Not a bug by tkinnun0 · · Score: 1

      Not necessary, you say? And what's to say the behavior doesn't change in the future to take advantage of the leeway in the API? I bet KDE developers didn't think it necessary to do things properly either.

    151. Re:Not a bug by Anonymous Coward · · Score: 0

      Most of what you wrote is correct. There's one bit I need to comment on, though:

      You can disable write caching on the drive itself, too, and then every byte written out is, well, written out.

      Some (not all) drives lie: You disable write caching on the drive. The drive reports write caching as disabled. The drive caches writes anyway. Yes, that sucks.

      Except for that nitpick, you're right.

    152. Re:Not a bug by Anonymous Coward · · Score: 0

      That's the difference between an average so-so programmer, and one worth bothering with :-)

      So-so programmers don't know how the underlying stuff works, since that would actually take a lot of work and study to pull off. When confronted with the fact that they're basically doing idiotic things, they find excuses that don't fly.

      Good programmers would know that if they need to take care of tons of small files that have to work as an atomic unit, they need database commit-like semanthics, or a file structure that lends itself to atomic operations.

      In other words, it *REALLY* was just incompetence in userspace applications, nothing else.

      Contrast the 'mbox' format with the 'maildir' format. One was a one-hour effort by some long-forgotten programmer that didn't think far ahead, the other one was engineered to do what it should do to keep mail safe.

    153. Re:Not a bug by hesaigo999ca · · Score: 1

      Actually not really, the left hand didn't know what the right hand was doing, and I live it every day where I work as a software developer. Ext3 was written way before Ext4, the people who wrote Ext4 afterwards were not actually responsible for any other application running on Ext3 trying to migrate to Ext4, even though common sense would have necessitated that anyone creating a new file system, would have to not only try it out in a new environment, but also try it out in a transferred environment, as in "what would happen if we migrated from etx3 to ext4..."

      Clearly this is a case of it's not my job, it's his, ....no it's not my job, it was theirs..."
      I know how this ends up happening, it will be interesting to see how they will fix it, or if they will leave it to the other team to fix their software to run in the new environment, but clearly....this is not a good day for Linux.

    154. Re:Not a bug by Anonymous Coward · · Score: 0

      Unless you wanted your last five seconds (or even possibly more) of writes not on the platter...

    155. Re:Not a bug by davecb · · Score: 1

      The sequence the guis appear to be using is the traditional one

      1. open existing file
      2. read contents
      3. seek to beginning
      4. write new contents
      5. close

      Therefor a traditional filesystem would, in step 4, write the data blocks, change the length in the inode and release any unused blocks if the file was much smaller. The opportunity for an undefined result would be a crash between the first write and the length update. This appears to be what ext3 is doing, and what I (and v6) have done in the past.

      A nontraditional filesystem could improve on this by writing new blocks and then a new inode, then atomically linking the new inode in to the directory block. And yes, such filesystems exist (;-))

      --dave

      --
      davecb@spamcop.net
    156. Re:Not a bug by ukyoCE · · Score: 1

      The quote is recommending an app-specific database, not an OS-wide database for storing everything about both the OS and applications.

      The difference is not insignificant.

    157. Re:Not a bug by Anonymous Coward · · Score: 0
    158. Re:Not a bug by snemarch · · Score: 1

      Thanks for clearing that out, I missed the "if(str->_flag & _IOCOMMIT)" part last night - I blame late hour and lack of coffee... Sorry! :)

      You're probably better off doing explicit _commit(_fileno(fstream)) when necessary, rather than opening with the 'c' flag, for performance reasons.

      --
      Coffee-driven development.
    159. Re:Not a bug by snemarch · · Score: 1

      *grin*

      I can understand that some stuff runs a lot faster with it (http://technet.microsoft.com/en-us/magazine/2007.04.windowsconfidential.aspx), but it sounds weird that it speeds up compiling... guess I should take a look at procmon and see what operations MSVS does. I know that it creates temporary files, but iirc those are done using FILE_ATTRIBUTE_TEMPORARY (ie, "try to keep the file in memory and avoid disk") and don't call FlushFileBuffers :)

      --
      Coffee-driven development.
    160. Re:Not a bug by Flammon · · Score: 1

      This "guy" is Theodore Ts'o and he's one of the most brilliant and respected Linux kernel hackers. http://en.wikipedia.org/wiki/Theodore_Ts'o http://thunk.org/tytso/

    161. Re:Not a bug by caseih · · Score: 1

      Right. That makes sense. I read all of Teo's comments, but I'm not as up on how it all works as you. GP poster, however, clearly did not read the FA or the bug report, though. That much is clear. Simply having an open file doesn't result in files being truncated on reboot.

    162. Re:Not a bug by mzs · · Score: 2, Insightful

      Unfortunately that is case #2 as described here:

      https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54

      rename(2) is not guaranteed to be atomic. There are now some patches that get ext4 to perform what most people expect #2 to do. I got bitten by #2 not working correctly on MacOS X some time back, I just googled and found this:

      http://www.weirdnet.nl/apple/rename.html

      Ever since that time I have been using fsync in my code when I needed it. You just get into a world of hurt when you expect #2 to work right under every OS and fs and set of mount options because it doesn't.

    163. Re:Not a bug by Hatta · · Score: 1

      Nowhere it says that the system cannot flush the data that you've already written so far without an explicit fsync() call.

      Then the spec is wrong. Losing data that hasn't been flushed to disk yet is totally reasonable. Losing data already on disk is not. Delay your writes all day long if you want, just don't delete the old file until you actually write it. What's so hard about that?

      --
      Give me Classic Slashdot or give me death!
    164. Re:Not a bug by Hatta · · Score: 1

      The way you do it is to rename the old file to backup and write to a new file.

      Why doesn't the file system do this automatically? Write the new file, but don't delete the old one until the new one is finished. Seems like an easy solution.

      --
      Give me Classic Slashdot or give me death!
    165. Re:Not a bug by shutdown+-p+now · · Score: 1

      Then the spec is wrong. Losing data that hasn't been flushed to disk yet is totally reasonable. Losing data already on disk is not.

      It's just the way modern filesystems work. They aren't databases, and fsync() is not a transaction border marker. It is only an explicit request to persist all data cached so far, nothing less, nothing more. What you seem to want is true atomicity, and that is just not there, and has never been, spec or not. So far the only mainstream OS/FS combo that I know that has true ACID transactions is Vista/NTFS.

      On a side note, I agree that transacted filesystems would be immensely useful, as there are many, many ways where you really want atomicity at least. The usual response is "well just use a database", but sometimes you don't want relational storage or even a key-value mapping, which most DBMS imply - just ACID on files.

    166. Re:Not a bug by Anonymous Coward · · Score: 0

      You're writing the same data over and over again instead of incrementing the data pointer to (data+written) and subtracting written from the length.

    167. Re:Not a bug by csartanis · · Score: 1

      This is completely wrong. Every modern OS uses write caching to increase perceived write speed. The write will eventually make it to disk but is not guaranteed until after an fsync completes (and sometimes not even then.)

    168. Re:Not a bug by Anonymous Coward · · Score: 0

      Not true. KDE at least does the write-to-temporate-and-rename (except in the case where the file is writeable, but not owned by the user)

    169. Re:Not a bug by davecb · · Score: 1

      Yes, thanks!

      The relevant sequence seems to be

      1. open new file
      2. write
      3. close, updating the inode's length field
      4. rename

      which is the normal V6-era algorithm for atomic update, exactly as per Dwyer's CACM article.

      The opportunity for an undefined result would then be a crash between the first write and either the update or the rename.

      That there is such a large time during which the consistency of the filesystem is at risk seems even less understandable with this algorithm.

      Surely the metadata updates can be ordered first and then the whole sequence deferred, so that the risk is between the data and metadatas writes of a consolidated set of I/Os.

      --dave

      --
      davecb@spamcop.net
    170. Re:Not a bug by Brandybuck · · Score: 1

      In other words, Tso wants to move the problem from the file system to a database, because then he doesn't have to fix the database.

      --
      Don't blame me, I didn't vote for either of them!
    171. Re:Not a bug by Brandybuck · · Score: 1

      The original problem was about all of those tiny configuration files that GNOME and KDE use. At the worst of times (when I get into a reconfiguration frenzy) I will not be modifying more than one or two configuration files per sixty second period. If ext4 can't handle that rate, then ext4 is broken.

      Except for installation time, I will NEVER have the need "a way to write those files in a batch".

      --
      Don't blame me, I didn't vote for either of them!
    172. Re:Not a bug by Brandybuck · · Score: 1

      All those config files are in just two (easily linked to just one). They are $HOME/.kde, and $HOME/.config. All that litter you see in your home directory does not come from KDE apps. If you find exceptions, log a bug.

      p.s. Yeah, I know GNOME does a lot more littering than this, but that's why you're a KDE fanboi.

      --
      Don't blame me, I didn't vote for either of them!
    173. Re:Not a bug by bluefoxlucid · · Score: 1

      That sounds like HAL ... opens and closes the same XML files 4000 times during initialization, instead of reading everything into memory and inotifying the directory.

    174. Re:Not a bug by spitzak · · Score: 1

      The commit delay was short on ext3 (5 seconds), but a *lot* can happen in 5 seconds, and a lot of data can still be lost in that time period.

      I need to keep correcting this misconception:

      The bug did NOT happen on EXT3, even if the machine crashed during those 5 seconds. Making EXT4 commit in 5 seconds will not fix it, the time delay has nothing to do with it.

      The other mistake you are making is the typical one of thinking what is needed is fsync(). Yes that "fixes" the problem, but only by slowing the whole system down considerably. This is because it enforces a much stronger requirement than is actually necessary to fix the bug.

      The fix, as mentioned about a dozen times above, is to have rename(A,B) enforce ordering so that all the blocks of A are committed to disk before the rename is committed to disk. Note that the rename can happen a year later if you want, and the bug will be fixed.

    175. Re:Not a bug by spitzak · · Score: 1

      Make a new file, fill it up, rename it over the other file.

      This is EXACTLY what KDE is doing, and this is EXACTLY what EXT4 is breaking!

    176. Re:Not a bug by spitzak · · Score: 1

      No you do not understand.

      It is not "important that this data get onto disk now"

      What is important is that when I rename(A,B) that "B either has it's old contents or has been replaced by the contents that were in A". This can happen a YEAR in the future, and the important criteria has been met.

      And on modern filesystems your criteria is slow to implement, while the second one (because it can happen far in the future) can be batched with others and is fast.

    177. Re:Not a bug by spitzak · · Score: 1

      You are wrong.

      In EXT3, if it crased during those 5 seconds, this bug did NOT happen!!!!

      In EXT4, the fact that the time is 150 seconds or whatever is irrelevant. Even if shortened to 5 seconds, if the system crashed, the bug DOES happen!

      Adding fsync to "fix" this will actually make EXT4 *slower* than EXT3. Fsync is NOT what is needed. What is needed is for rename(A,B) to insure that A's blocks are on disk before the rename happens. Note that the rename can happen 150 seconds later, or year later. It is the order that is important, and the order is what EXT4 breaks.

    178. Re:Not a bug by spitzak · · Score: 1

      Bullshit. NTFS is not the "only" system. Unix has had atomic rename for 30 years. This is the function that is wanted, and is the function that EXT4 breaks.

      Windows of couse has never implemented atomic rename, and manages to come up with 50 different options and file flags to try to make up for it. And now the same idiots are invading Linux, saying "call fsync!" as though fixing the symptoms byy making the system slow is how to fix a bug.

      This is getting really sad as it is obvious that the knowledgable parites are in the tiny minority here.

      Every single person who mentions "fsync" as a solution is, for lack of a better word, stupid.

    179. Re:Not a bug by harlows_monkeys · · Score: 1

      Its worse than than a KDE problem. It goes to the heart of Linux/Unix which have always been dependent on a multitude of small text files

      But not a multitude of small text files that are frequently changed.

      Anytime you suggest users re-write their entire code base to get around a bug you've created your professional pride should well up, grab you by the wattles and slap you till you spit

      This so-called "bug" has been normal Unix behavior for the 29 years I've been using Unix.

    180. Re:Not a bug by spitzak · · Score: 1

      Wrong. KDE opened another file. This file was truncated to zero at some time. They then wrote the data and closed the file. They then renamed it to replace the file in question.

      At no point did they truncate the file in question. So this result is certainly unexpected!

    181. Re:Not a bug by shutdown+-p+now · · Score: 1

      Bullshit. NTFS is not the "only" system.

      I said that NTFS is the only system that provides full-featured atomic FS transactions that can consist of any number of I/O and FS operations. I didn't say that it's the only one that provides atomic rename().

      Unix has had atomic rename for 30 years. This is the function that is wanted, and is the function that EXT4 breaks.

      Is the atomicity of rename a part of the spec, though? (I do understand that it's a common sense requirement, but still...)

      Windows of couse has never implemented atomic rename

      This makes me wonder. In what way rename in Windows is not atomic? I.e., what possible effects you can get that expose that fact?

      Every single person who mentions "fsync" as a solution is, for lack of a better word, stupid.

      No, merely not familiar with details of this case (this is /. after all - very few RTFA?), or simply not sufficiently educated on the issue. Well, my excuse at least is that I'm a Windows developer, not a Linux one, even if I happen to know bits and pieces about the latter.

      Thankfully, there have been a few posts in the discussion explaining about rename() and why ext4 is in the wrong here, and they've been modded up. So the masses got educated, hopefully. I know I was.

    182. Re:Not a bug by spitzak · · Score: 1

      Wrong. The longer delay is not the bug. If EXT3 crashed during those 5 seconds, the bug did not happen. If EXT4 was changed to flush after 5 seconds, and it crashes during those 5 seconds, the bug still will happen.

      The problem is that rename(A,B) does not force the contents of A to be up to date before the rename happens. Virtually every Unix program in existence that tries to safely update files assumes this is true.

      Sticking fsync() in there, as about a thousand idiots have suggested, is NOT the solution. It will "fix" it but only at the cost of slowing everything greatly. The thing they are missing is that it is ok if the *old* file is still there after a crash. What is unacceptable is that neither the old or new file are there in EXT4.

    183. Re:Not a bug by spitzak · · Score: 1

      You are misunderstanding the bug.

      Any description of the real bug must contain a rename() call.

      The question is what happens if you close the file and then rename() it. Now I know that Windows lacks atomic rename, so lets first delete the destination file. The possible results after a crash and recovery should be:

            1. The old file is still there (OK)
            2. The old file is missing and the new file exists with the new data in it (this is a bug with Windows and does not happen on Unix with either EXT3 or 4. It is not good but at least a good copy of the data is in the new file)
            3. The old file has the new contents and the new file is missing. (OK)

      Something like EXT4's bug results in the following results on Windows:
            4. The old file contains data different than either it's previous contents or what was written to the new file (typically it is empty) and the new file is missing.
            5. The old file is missing and the new file contains data different than what was written to it.

      I suspect Windows does NOT have this bug, as it really is an incredible annoyance that a crash can destroy all copies of the data on your disk and they would have noticed and fixed this pretty quickly.

    184. Re:Not a bug by makomk · · Score: 1

      Yeah, fdatasync is grand - but it means abandoning plain-text configuration and settings files, at least for anything that's going to be modified by the app. If you're using plain-text files and rewriting the entire file, fdatasync is equivalent to fsync, and you get stuck with awful performance on ext3.

    185. Re:Not a bug by spitzak · · Score: 1

      This extra flag is an EXCELLENT idea!

      One error you made is that it can crash between or during step 4 and the file is not truncated. Ie the crash point is before 5, not 4 as you implied.

      I think your cons can be addressed:

      Must create new open() flag:
          I suspect O_CREATE can be reused, and that the vast majority of programs using this flag actually expect the behavior you define. Whether O_CREATE acts this way or not could be controlled with some global switch the program can do at startup.

      Need free space for temp copy:
          This may be tricky but the only safe way to update a file now needs the same space. It is plausable that a file system could be designed to need a lot less free space, it is only needed *during* step 4 (otherwise the file is just in memory). Thus only the space for one file at a time may be needed.

      Can still lose data:
          Make calling fsync() on this file do steps 4+5 above, and after that it is as though the file acts normally.

    186. Re:Not a bug by spitzak · · Score: 1

      Actually one important detail:

      If the program crashes, the new file should disappear as though it had no effect. It should not act like close().

      Also it would be useful, though not a requirement, that there is something that can be done to the file so that it just goes away without changing anything. Programs can use this if they determine something is wrong while they are writing it. I'm not sure what this action should be.

    187. Re:Not a bug by GigsVT · · Score: 1

      Not any more.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    188. Re:Not a bug by Anonymous Coward · · Score: 0

      It seems exceedingly odd that issuing a write for a non-zero-sized file and having it delayed causes the file to become zero-size before the new data is written.

      The trick is that the old file is deleted and a new file created or the old file truncated. All of those meta-date events go into the journal. Any data that is written to the new file is buffered for 60 seconds, just in cast that the file is deleted or the data rewritten again, so that the write access can be completely skipped.

      With old fashioned filesystems like ext3, everything you ever write to disk, even if you delete the file immediately hits the disk, maybe even 5 seconds after the file was deleted. The ext4-style semantics have been in XFS for over a decade, and they have bitten me a few times due to hardware failures, power outages or my own stupidity, but that's a tradeoff that I'm willing to take.

    189. Re:Not a bug by spitzak · · Score: 1

      In what way rename in Windows is not atomic?

      If the file already exists it returns an error and does not do the rename. You have to delete the old file and then do the rename as two separate calls and thus it is not atomic.

    190. Re:Not a bug by speedtux · · Score: 1

      For decades now it has been an accepted trade-off that the filesystem can hold back disk writes and do them later,

      Yes, with a delay of, 100ms. Not with a delay of 100s. Holding back writes for that long greatly increases the risk of data loss and it is the wrong thing to do.

      For decades now it has been an accepted trade-off that the filesystem can hold back disk writes and do them later, giving better disk performance at the expense of losing data if there is a crash.

      Holding back writes that long isn't going to improve performance much, but it is a huge risk. Trading a few percent in benchmark gains for a much higher risk of data loss is a bad idea.

      This sucks, but it's the Unix way

      You can't just take some semantics and arbitrarily stretch it beyond reason. There are lots of guarantees that UNIX doesn't make that you would be annoyed if they changed. For example, there is no guarantee that UNIX outputs characters to your screen without a delay (in fact, there is a little delay). It would be much more efficient if it just buffered all output for 60s before actually sending it to your screen, wouldn't it?

    191. Re:Not a bug by speedtux · · Score: 1

      Linux reinvents windows registry? Who knows what they will come up with next.

      I think it's the influx of Windows refugees that brings over all these lousy ideas.

    192. Re:Not a bug by jc42 · · Score: 1

      And there are better options than syncing for reliability. For example, rename the file to backup and then write a new file. The backup will still be there and can be used for automated recovery. Come to think of it, any decent text editor does it that way.

      Huh? I've rejected several editors due to this behavior, and I know a number of project leaders that do the same thing. If you have any multiply-linked files, it breaks the linking, turning files in different directories into a flock of files, each with a few small changes. Discovering that this has happened and undoing the damage can be a real nightmare. The sensible approach is to only use editors that write when you say to write, to the file you say to write. (The main editors on unix systems have explicit options to control such backups.)

      Of course, if you don't (know how to) use multiply-linked files, this isn't an issue for you. Unless you're working on a project that uses multiply-linked files, in which case you're a danger to the project. And if you don't understand when you should (and shouldn't) use multiply-linked files, I don't want you working on any projects of mine.

      The right behavior is that when the editor writes the changes to a file and calls fsync(), the system writes the buffers as fast as it can. If it doesn't, fsync() is broken. If it does, but the disk controller holds some of the data in volatile buffers, you bought a bad disk controller.

      And, in any case, you should have a UPS that can provide enough power to flush all the file buffers, even if all are marked modified. If not, you didn't buy a good enough UPS. And you should try to use OSs and disk controllers that write the data when the software tells them to. If you have all of these, you shouldn't have problems when the power goes out.

      (It used to be common for unix systems to run a little 1-line C program that called sync() once a minute or so. Is this still at all common?)

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    193. Re:Not a bug by jc42 · · Score: 1

      Why is it unreasonable to expect the OS and the FS to handle this seemingly common case of writing lots of small files, by introducing a new "fsync all" API call if necessary?

      Huh? We've had that API call for decades. It's called "sync". ;-)

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    194. Re:Not a bug by shutdown+-p+now · · Score: 1

      If the file already exists it returns an error and does not do the rename. You have to delete the old file and then do the rename as two separate calls and thus it is not atomic.

      Doesn't MoveFileEx with MOVEFILE_REPLACE_EXISTING flag allow renaming on top of an existing file?

    195. Re:Not a bug by shutdown+-p+now · · Score: 1

      That syncs absolutely everything. What I was talking about is rather something like void fsyncall(int n, int fd[n]); - where you explicitly specify what you want to sync.

    196. Re:Not a bug by spitzak · · Score: 1

      That sounds right but I never learned about it. They need to fix their rename() function to call this.

      I guess I can kludge in a macro to fix this.

      Found the man page and it does not work for directories so this does not do the job. But it gets the majority of uses which is single file updates. It also is not clear what happens if another process has the old file open though I'm guessing it results in an error?

    197. Re:Not a bug by jc42 · · Score: 1

      Name one scenario where fsync should be necessary other than: ...
      2) rewriting data back to an existing file... like for instance something a database might do.

      Hmmm ... The followups imply that some people have a misunderstanding of how the POSIX write() call works, and interpret the above as implying that when process A writes data to a file and process B reads the same spot in the file, B may get a jumbled mixture of the old and new data. But the POSIX spec for write() explicitly states that a write of up to PIPE_BUF bytes is atomic. This was generally true of unix kernels long before the POSIX standard was created. If your kernel requires fsync() in such situations, your kernel isn't POSIX compliant, and you should file a bug report.

      OTOH, if you're writing your data in chunks bigger than PIPE_BUF bytes, you probably should consider that an overlapping read may see an inconsistent state. But fsync() won't fix this; it's irrelevant to the issue of write/read overlap integrity.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    198. Re:Not a bug by shutdown+-p+now · · Score: 1

      It also is not clear what happens if another process has the old file open though I'm guessing it results in an error?

      I think it depends on what type of locking the other process requested when it opened the file with CreateFile. If they specified FILE_SHARE_DELETE, then concurrent delete is allowed (with same semantics as in Unix - file metadata record is deleted, but the data blocks remain until last opened handle to the file is closed). Of course, the lazy devs often don't bother (in fact, all too often I've seen people pass 0 as the corresponding CreateFile argument, which means "exclusive lock").

    199. Re:Not a bug by spitzak · · Score: 1

      No I meant "what happens if another process has the file open for read?".

      On Unix this is irrelevant, it keeps reading the old copy of the file, which is deleted when the file is closed.

      On Windows I know that you can't delete() these files, so I am guessing that rename() even with this flag does not get around this annoyance.

    200. Re:Not a bug by shutdown+-p+now · · Score: 1

      No I meant "what happens if another process has the file open for read?".

      That was precisely what I mean. If the other process has opened the file for reading (or even writing) with FILE_SHARE_DELETE, then you can delete it with the same semantics as on Unix, and I would expect MoveFile to respect that, too. If the other process did not specify FILE_SHARE_DELETE when opening the file, then you can't delete it.

      That said, I remember now that on Windows I could often rename files locked by some process even if I couldn't delete them, so maybe MoveFile is actually even more relaxed than that.

    201. Re:Not a bug by jc42 · · Score: 1

      ... can't imagine many reasons for having hundreds of tiny config files for a single app ...

      Ooh, ooh, I know one!

      Some years back, I stumbled into being the admin of one of what are by now probably several thousand specialized "search engines" for a certain kind of rather technical data whose nature isn't too important here. The "app" includes a search bot and a single-request download program, and both of them have to deal with what is now nearly 400 sites that have the data. Each site has its own web server, and I sometimes suspect that no two of them are running the same server. So there's a "cfg" directory that contains a file for each hostname, and the file contains config info for dealing with that host's web server.

      The most important thing is getting the HTTP level right, because a lot of servers will reject requests if you ask for the wrong level. The default is "HTTP/1.1", but a significant number of the sites' servers require a different level. And this can't be set up at the first access to a host, because sometimes people upgrade or reconfigure their servers. So what the download routine does is look for the HTTP/* entry in a host's cfg file, and uses "HTTP/1.1" if it isn't found. Occasionally a download attempt fails, and the routine goes into a "determine HTTP level" mode, in which it tries a list of level numbers looking for one that works. If it succeeds, it updates the host's cfg file to tell later downloads what level to use.

      Actually, there aren't "hundreds" of config files (yet). The number is only slightly over 100, because most of the servers work fine with the default settings. Most of the cfg files are tiny, because few servers have more than one or two settings that need to be specified. The biggest is 468 bytes; the rest are under 100. I expect that the number of files could reach 200 in a few more years.

      I expect that others will be able to describe some more situations where they use a large flock of tiny config files for some app or set of closely-related apps.

      The app updates a config file by building the new contents in memory, and overwriting the file with a single write() call. It doesn't bother with a flush() call, because such updates are very rare, and if the write fails, it just means that the "determine HTTP level" code will be done again the next time that host is accessed. In a decade or so of life, this has never happened. I can say this with some certainty, because such things get written to a global "special events" log file, so I've seen all the cases where an app updates a config file.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    202. Re:Not a bug by spitzak · · Score: 1

      Gah. Rather unfortunate that this is an option on the file reader, since I have little control over that. I would like to see an option on the rename function instead.

      Renaming files has worked for remote mounts for quite awhile and we rely on that to work around the file locking bug. I think it is broken if SMB/NFS is not being used, however.

    203. Re:Not a bug by shutdown+-p+now · · Score: 1

      Gah. Rather unfortunate that this is an option on the file reader, since I have little control over that. I would like to see an option on the rename function instead.

      I think it makes sense that whoever opened the file first can lock it as needed - for example, I can imagine some code wanting to ensure file existing under the same name it was created at least until it's closed. I just think that FILE_SHARE_NONE is an unfortunate default for lock level. I can understand it from API design perspective if you think of "dumb client" - programmer who doesn't think about possible race conditions, and doesn't realize that the file he is working with can be opened by other processes. In that sense, exclusive lock by default is the safest choice, and someone who knows what he's doing can always specify the precise level anyway; but it sure is annoying in practice when everyone starts using the default blindly, and you have to deal with those undeletable files.

    204. Re:Not a bug by spitzak · · Score: 1

      I think the "dumb client" would be fine with Unix style behavior, where it keeps getting the previous version of the file. Most of them are just opening it to load it into memory and will close it soon afterwards.

      The reason for the default is for back-compatibility with older versions of Windows. But it would be nice if they risked it and changed this. All we want is for delete (and the atomic rename) to work and let the program reading the file to continue reading the old version. It is ok (in fact probably better than Unix) if other programs are prevented from writing the file. I just want to allow rename and delete.

    205. Re:Not a bug by jc42 · · Score: 1

      Yeah; I suspected that that's what you meant. But you can get more humor out of a situation by answering an extreme interpretation of what someone just said ...

      Anyway, I've always sorta thought it was good if a few programs were calling sync() at a moderate rate. On a few systems, I've added a little background process that just sleeps and calls sync() every N minutes. This was usually in response to someone discovering that the system was holding unwritten data in memory for hours because the system load was too light to force flushing. It's always seemed to me that an OS should have a system call that sets the max time that a write buffer can exist before it's flushed. With all the never-used features that are bloating up all our OSs, you'd think they'd have included that one.

      Or maybe they have, and they just haven't told us mere users.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    206. Re:Not a bug by cryptoluddite · · Score: 1

      Every scenario where data integrity is of importance to the user.

      That's scenario 1 I posted, a 3rd party (the user).

      Consider an application like PostgreSQL.

      That's scenario 2 I posted, a program that rewrites part of its file. I even specifically mentioned databases.

      So your answer then is no, you can't think of any other situations. Which means that the vast majority of programs do not need to call fsync in order to be properly coded.

      A program such as KDE or gnome storing settings files is not a case where the user expects data integrity. In fact, if for example you change screen resolution and it crashes the computer it's better for the filesystem to lose the change. Calling fsync is not appropriate for these programs, because the write-new-file + rename-to-original scheme is all that need be done. Except on dysfunctional filesystems such as pre-patch ext4. But that is not a fault of those applications.

    207. Re:Not a bug by Anonymous Coward · · Score: 0

      "It is perfectly OK to read and write thousands of tiny files. Unless the system is going to crash while you're doing it and you somehow want the magic computer fairy to make sure that the files are still there when you reboot it."

      That's kind of the point of the kernel & modules - to be the magic computer fairy.

      Quotes:
      "An o/s should never have been something that people (in general) really care about: it should be completely invisible and nobody should give a flying f*** about it except the technical people."
      "The kernel should be pretty much the invisible magician in the background"

      --Linus Torvalds, 2008

    208. Re:Not a bug by Ed+Avis · · Score: 1

      Classical UNIX ran sync(1) from the crontab every 30 seconds to flush data to disk, so I guess if there is an informal precedent for the maximum time writes can be held back, that would be it.

      You make a good point. Another example would be that when sending packets via UDP, there is no guarantee they will reach the other end, but obviously any network which dropped all UDP packets would be considered broken. And yet... if an application were written so that it relied on the UDP delivery succeeding for correct operation, and left corrupted data otherwise, it would be the application that's considered buggy. I would be annoyed if my terminal output were delayed by 60 seconds, but if you were writing a safety-critical real-time application you could not use Linux terminal output for just this reason, that there is no upper limit to the delay before the output reaches the user.

      I think both sides, OS and application, need to follow Postel's maxim: be conservative in what you send, liberal in what you accept. In this case the filesystem should try harder to meet traditional expectations of safety, even if the letter of the law doesn't require it. And the applications should be written more carefully, such that they won't lose data even under the worst circumstances and do not rely on filesystem guarantees that aren't really guarantees.

      --
      -- Ed Avis ed@membled.com
    209. Re:Not a bug by davecb · · Score: 1

      Actually I understood what was happening, but I found it exceedingly odd that it escaped from the lab (;-))

      --dave

      --
      davecb@spamcop.net
    210. Re:Not a bug by Just+Some+Guy · · Score: 1

      Since you're not incrementing "data" in the loop, if write() only managed a partial write, wouldn't that just keep writing the first few bytes of "data" over and over again?

      --
      Dewey, what part of this looks like authorities should be involved?
    211. Re:Not a bug by BobPaul · · Score: 1

      gconf is pretty terrible, but... it's not a database. It's made of flat text files in a folder hierarchy. You can use echo and cat to read and write values to gconf. You can search it using "find ~/.gconf -name term", grep, fgrep, etc.

      One of the major problems with the Windows registry is that it's a proprietary binary format that easily corrupts. Were it flat files, like gconf is, then only those files that were being changed could be lost.

    212. Re:Not a bug by Logic+and+Reason · · Score: 1

      /proc/sys/vm/dirty_expire_centisecs
      /proc/sys/vm/dirty_writeback_centisecs
      (from https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/7)

    213. Re:Not a bug by renoX · · Score: 1

      [[ But lets back up here, because there's more than just people too lazy to call fsync() in order to ask the file system to write the data to the disk and say "Ok, I got it". ]]

      Except that not too long ago there was a row about Firefox's performance because it was using too many fsync..

      fsync is a bitch on Linux: don't use it? You may loose data. use-it? As it's syncing not just the file but the whole FS, the performance may suck..

    214. Re:Not a bug by jc42 · · Score: 1

      Yeah. Now if we could just make code that uses such things portable to systems like *BSD, Solaris, OS X, etc. It's a real pain trying to track down how you do it everywhere (except MS systems, of course ;-). You have to write code that explores the system to discover which gimmicks they've included and whether they actually work. And you need copies of each kind of system for testing.

      Just calling sync() or fsync() is a lot easier, though as others have observed, some systems (and some disk controllers) have found ways of subtly shooting down that approach.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    215. Re:Not a bug by H0p313ss · · Score: 1

      It's no longer abandonware lurking at the heart of gnome but it's still a nightmare.

      Surely that's a gnightmare?

      --
      XML is a known as a key material required to create SMD: Software of Mass Destruction
    216. Re:Not a bug by DavidRawling · · Score: 1

      Dammit, I've posted already and I have mod points. Otherwise I'd have modded you Funny :)

  2. bah? by negRo_slim · · Score: 1

    He advises that "this is really more of an application design problem more than anything else."

    --
    On the Oregon Cost born and raised, On the beach is where I spent most of my days
  3. Don't worry by sakdoctor · · Score: 5, Funny

    Don't worry guys, I read the summary this time, and it only affects the German version of ext4.

    1. Re:Don't worry by Daimanta · · Score: 3, Funny

      Makes perfect sense: Germans are rediculously punctual, if the allocation is delayed you just KNOW something is terribly wrong.

      --
      Knowledge is power. Knowledge shared is power lost.
    2. Re:Don't worry by microbee · · Score: 1

      And only on a specific distro. (haha Ubuntu users)

    3. Re:Don't worry by migla · · Score: 1

      "And only on a specific distro."

      Except, the first commenter on launchpad, who is not the bug reporter, is running Gentoo.

      --
      Some of my favourite people are from th US; Vonnegut, Chomsky, Bill Hicks.
    4. Re:Don't worry by microbee · · Score: 2, Funny

      OMG, you expect me to RTFA??!! In a BUGzilla?

    5. Re:Don't worry by Anonymous Coward · · Score: 0

      Which once again proves my theory -- Germans love David Hasselhoff

  4. If in other "modern" filesystems.... by ducomputergeek · · Score: 0

    Newer !== better

    --
    "The problem with socialism is eventually you run out of other people's money" - Thatcher.
    1. Re:If in other "modern" filesystems.... by Daimanta · · Score: 1

      Yes, that's why I'll wait for ext4 SP2.

      --
      Knowledge is power. Knowledge shared is power lost.
    2. Re:If in other "modern" filesystems.... by internerdj · · Score: 3, Insightful

      It is a trade-off between reliability and performance. In this case, Older!== better either. A lot of OS design decisions are trade-offs.

    3. Re:If in other "modern" filesystems.... by CannonballHead · · Score: 3, Insightful

      I'll take "I didn't lose my data" over "ext4 runs 1.5x faster than ext3," thank you. What use is performance to me if I have to be absolutely certain that it won't crash, or I lose my (in my very high performance filesystem) data?

      Also, ext4 is toted as having additional reliability checks to keep up with scalability, etc... not less reliable at expense of performance.

      Reliability

      As file systems scale to the massive sizes possible with ext4, greater reliability concerns will certainly follow. Ext4 includes numerous self-protection and self-healing mechanisms to address this.

      (from Anatomy of ext4)

      I can only imagine the response if tests were done on Windows 7 beta that showed a crash after this or that resulted in loss of data. :)

    4. Re:If in other "modern" filesystems.... by internerdj · · Score: 2, Insightful

      Thing is that ext3 is using the same strategy on a smaller scale. The same argument could be made to say that 3 seconds is far too long to be out of date. How many instructions are you going to run in 3 seconds? Defects run at 5-8 per/kloc on average. Certainly not all are fatal, but how long of a delay is too long to avoid a potentially fatal defect? Obviously the delay they have chosen is too long, but is the performance hit that ext3 takes for having a 3 second delay rather than a 5 or 10 or 15 second delay worth it?

    5. Re:If in other "modern" filesystems.... by CannonballHead · · Score: 1

      That probably depends on the application(s) running (presuming some sort of server application). Of course, a competent admin in that situation would hopefully choose a suitable filesystem, which may not be ext4 if the delay remains too high.

      I don't know what the performance difference is from a "home user" perspective between ext3 and ext4, but if it isn't really noticeable, then why not stay with a more reliable delay time? If most users wouldn't be able to notice the performance increase, it might be better to cater towards reliability, not performance, in that situation.

      Admittedly, this is all out of my hat, since I haven't done any performance tests (have you? I'd be interested in hearing first-hand experience of performance increases... will have to look online, too...)

    6. Re:If in other "modern" filesystems.... by Quantumstate · · Score: 1

      No you are misunderstanding the main issue in the bug report. ext4 is truncating the file opened with the O_TRUNCATE then there can be a long delay before the new file is written. Thus the system crashes and the old file got wiped clean and the new file isn't written. Other filesystems do both operatiosn at the same time which would seem to be the logical way to do this to avoid data loss.

      Technically the ext4 system is working perfectly correctly since you asked for the file to be truncated when it was opened. However this interpretation of the spec is perhaps not the most sensible since I would consider it reasonable that if I give the two operations to the file system 1ms apart that it would not decide to run the truncate more than 30 seconds before the new write.

      Also if you look carefully even using a method of saving a file than renaming the new file to the old file so that at all times you have at least 1 file on the disk in a complete state. This can also lose you data because the filesystem may optimise by doing the rename before the new file is written. So to make it would safely you need to use an fsync() command to make sure the new file is written before you rename. So basically as the application developer you are being asked to jump through loops because the filesystem has decided that it does not wish to group operations to a file and can do things like renaming a file before it has been written. It follows the spec but it appears the spec is not the epitomy of perfection.

  5. pr0n by Quintilian · · Score: 5, Funny

    Real reason for the bug report: Someone's angry and wants his porn back.

    1. Re:pr0n by PIBM · · Score: 1

      Well he didn't have much to start with if he would have managed to copy all of it in less than 2 minutes..

  6. Bull by Jane+Q.+Public · · Score: 4, Insightful

    Blaming it on the applications is a cop-out. The filesystem is flawed, plain and simple. The journal should not be written so far in advance of the records actually being stored. That is a recipe for disaster, no matter how much you try to explain it away.

    1. Re:Bull by Lord+Ender · · Score: 5, Funny

      In fact, there is no such thing as an OS bug! All good programmers should re-implement essential and basic operating system features in their user applications whenever they run into so-called "OS bugs." If you question this, you must be a bad programmer, obviously.

      --
      A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
    2. Re:Bull by wild_berry · · Score: 5, Insightful

      The journal isn't being written before the data. Nothing is written for periods between 45-120 seconds so as to batch up the writing to efficient lumps. The journal is there to make sure that the data on disk makes sense if a crash occurs.

      If your system crashes after a write hasn't hit the disk, you lose either way. Ext3 was set to write at most 5 seconds later. Ext4 is looser than that, but with associated performance benefits.

    3. Re:Bull by Anonymous Coward · · Score: 5, Informative

      This is NOT a bug. Read the POSIX documents.

      Filesystem metadata and file contents is NOT required to be synchronous and a sync is needed to ensure they are syncronised.

      It's just down to retarded programmers who assume they can truncate/rename files and any data pending writes will magically meet up a-la ext3 (which has a mount option which does not sync automatically btw).

      RTFPS (Read The Fine POSIX Spec).

    4. Re:Bull by Jane+Q.+Public · · Score: 1, Insightful

      Delayed allocation is like leading a moving target when shooting. The more distant the target, the more you have to lead, and the greater chance there is of something happening between the time you pull the trigger and the time the bullet reaches its target zone: the wind may shift, the target may change speed, or direction, etc.

      The longer you delay allocation after writing the journal (and Ext4 seems to take this to extremes), the more chance there is of something -- almost anything really -- going wrong between the time the journal is written and the files being written. And here is just such a case of something changing state (whether it should or not) between those times. You many call it an anomaly but a competent engineer would have to expect this to occur.

    5. Re:Bull by Eugenia+Loli · · Score: 4, Insightful

      Rewriting the same file over and over is known for being risky. The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync. In other words, app developers must be more careful of their doings, not put all blame to the filesystems. It's so much that an fs can do to avoid such bruhahas. Many other filesystems have similar behavior to the ext4 btw.

    6. Re:Bull by Jane+Q.+Public · · Score: 1

      That contradicts TFA, which clearly states that there is a delay of up to 150 seconds between the time the journal is written and the time the data is actually written to disk.

    7. Re:Bull by gweihir · · Score: 1

      Replacing critical files without syncing and without keeping backups is non-robust behaviour, plain and simple. Seems to me some KDS implementers where getting lazy. Most text editors do it better.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    8. Re:Bull by Jane+Q.+Public · · Score: 2, Interesting

      That does not make it any less of a filesystem limitation. While it is true that a well-written app should be aware of potential timing issues, all the application itself should ever suffer is delays in the I/O. Anything else is a flaw. Other FSs may share the flaw, but it is still a flaw.

    9. Re:Bull by pc486 · · Score: 5, Informative

      Ext3 doesn't write out immediately either. If the system crashes within the commit interval, you'll lose whatever data was written during that interval. That's only 5 seconds of data if you're lucky, much more data if you're unlucky. Ext4 simply made that commit interval and backend behavior different than what applications were expecting.

      All modern fs drivers, including ext3 and NTFS, do not write immediately to disk. If they did then system performance would really slow down to almost unbearable speeds (only about 100 syncs/sec on standard consumer magnetic drives). And sometimes the sync call will not occur since some hardware fakes syncs (RAID controllers often do this).

      POSIX doesn't define flushing behavior when writing and closing files. If your applications needs data to be in NV memory, use fsync. If it doesn't care, good. If it does care and it doesn't sync, it's a bad application and is flawed, plain and simple.

    10. Re:Bull by gweihir · · Score: 2, Insightful

      The problem is KDE not doing syncs and not keeping backups on updates of critical files. Any competent implementor will try to keep these to a minimum with critical files and if they have to be done, do them carefully. Seems to me the KDS folks have to learn a basic lesson in robustness now.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    11. Re:Bull by Profane+MuthaFucka · · Score: 1

      Except fsync on a Mac is a null operation. The fsync(), it does nothing!

      --
      Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
    12. Re:Bull by erroneus · · Score: 1

      An application should not care what file system it is running on. If applications written "normally" have less chance of catastrophic failure, then you can most certainly blame the apps for attempting to take advantage of a particular feature of a particular file system. And if an application does behave this way, it should be written to first determine if it is, in fact, running on that particular file system and if not should disable any features that utilize that filesystem's advantages.

      This is a problem of applications making faulty assumptions.

    13. Re:Bull by knewter · · Score: 1

      You're wrong about this. The second comment covers the appropriate way to write the code, and via POSIX can guarantee that you don't lose data.

      Hoping you've done something right isn't enough.

      --
      -knewter
    14. Re:Bull by rastos1 · · Score: 1

      The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync.

      An now imagine that the original file had hard-links. Your suggestion breaks them.

    15. Re:Bull by dedazo · · Score: 1, Interesting

      If I want asynchronous or lazy writes to the disk, I'll code that myself. The filesystem should be hitting the metal about 0.001 microseconds after I call write() or whatever the function is. I cannot believe this is actually being spun as a "feature" that application developers should code against. It's just mind boggling.

      --
      Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
    16. Re:Bull by Anonymous Coward · · Score: 1, Informative

      You are quite simply wrong. The GP states the correct POSIX behaviour. If anything is a flaw it is a flaw in POSIX, *not*the filesystem.

      This kind of crap coupled with the recent Active Directory question where the Slashdot community proved that it does not know what the hell group policies do is the reason that GNU/Linux/GNOME/KDE will not get a (significant) share of the enterprise desktop - Linux fucking weenies who don't know jack.

    17. Re:Bull by kithrup · · Score: 1

      Untrue. What fsync() doesn't do is tell the hard drive to write it to the platter, so the data can be lost in the event of a failure between the fsync() and when the drive actually flushes it. This is spelled out in the man page for fsync on Mac OS X.

      You can verify this by using the fs_usage command to see what is going on -- when the fsync is called, data are indeed written to disk.

    18. Re:Bull by Anonymous Coward · · Score: 5, Insightful

      Bullshit. It is not a filesystem limitation. POSIX tells you what you can expect from file system calls. Data committed to disk as soon as an fwrite or fclose returns is not something you can or should expect. (And this is true of every OS I've used in the last 20 years.)

      A great many crap programmers think APIs ought to do what they'd like them to. But APIs don't. At best they do what they are specified to do.

    19. Re:Bull by Jurily · · Score: 0, Offtopic

      This is NOT a bug. Read the POSIX documents.

      Ok, POSIX does not tell us it's bad. Yet it hurts end users. You know, the ones who you don't want fleeing back to Windows?

    20. Re:Bull by Anonymous Coward · · Score: 0

      Jane, you ignorant slut! Systems crash. All you are doing is moving the amount of data loss from 5 seconds for ext3 to 40-150 seconds for ext4. The system maintains its guarantees about the filesystem's integrity. It also maintains its guarantees about filesystem data (in that there is no guarantee data is written until fsync() is called). It's not a flaw in that the standards are particularly clear about what is guaranteed and what is not.

    21. Re:Bull by Waffle+Iron · · Score: 4, Insightful

      The filesystem should be hitting the metal about 0.001 microseconds after I call write() or whatever the function is.

      If that's the behavior you expect, then you need to be running your apps under an OS like DOS, not POSIX or Windows (which both clearly specify that this is *not* how they function).

    22. Re:Bull by vadim_t · · Score: 1

      You can't optimally code it yourself, because your application doesn't know about other activity in the system.

      And this is standard behavior on all OSes since a long time. Even DOS had this, with write caching in smartdrv. By default, smartdrv would force a sync before showing the command prompt after exiting an application, if you hit the button before that you risked data loss.

      The only new thing here is that the delay is bigger than it used to be.

    23. Re:Bull by Anonymous Coward · · Score: 0

      No. Seriously, read through Theo's comments and it all starts to make more sense. It seems obvious that the file system is to blame, but if you take a step back and open your mind and consider all the facts, then applications *have* to change to work together with "modern/recent" file systems. Please, go and RTFBC [read the fscking (haha, no pun indended) bug comments].

    24. Re:Bull by billcopc · · Score: 1

      I have to disagree. I find it absolutely ridiculous that the average X11 app litters my home directory with a gazillion tiny little files. Say what you will about the Windows Registry, at least it got rid of (most of) the old .INI files of yore.

      I'd much rather have a centralized database with all my user preferences, using a standardized API. Maybe then we could be rid of those perverse text parsers that love to break whenever they encounter an unamerican character or some dangly white space. It would put less pressure on the entire filesystem, and improve performance across the board.

      IMO, the filesystem works fine. Machines usually don't lock up at random, and if yours does, you need to fix it! If you're doing business-critical stuff and you're too damned lazy to do the most basic error checking on commits, you deserve to lose data!

      --
      -Billco, Fnarg.com
    25. Re:Bull by Anonymous Coward · · Score: 5, Insightful

      Does anyone else think that 150 second is a bit over the top in terms of writing to disk?

      I could understand one or two seconds as you speculate more data might come that needs to be written.

      5 seconds is a bit iffy, as with ext3.

      150 seconds? That's surely a bug.

    26. Re:Bull by billcopc · · Score: 2, Insightful

      Why should synchronous writes be the default ? Programmers are already too lazy and/or stupid to add a simple fsync() where needed, why should we all drop what we're doing, make the slowest option the default, and then have to jump through hoops to make things workable again ?

      If asynchronous writes are the biggest of your problems, you need to find yourself a new career. One that hopefully doesn't require meticulous attention to detail.

      --
      -Billco, Fnarg.com
    27. Re:Bull by Hurricane78 · · Score: 1, Insightful

      You mean unlike those Windows "Admins" who tell me how great the Windows "Event Manager" (log files + viewer) is, and tell me "Ha, I bet Linux does not have such a great tool!". I had to explain them, that Unix had those features before Windows even existed. He told me I was talking shit.

      Then he tried to enter "some obscure command" into the "black command window", that someone told him would create a VPN. What he meant were a few routing commands at the shell.

      And this is not rare. It rather is the normal case with "Windows Admins".

      Of course we got our POSIX ACLs and security labels. And PaX, RSbac, SElinux, GRsecurity, LDAP, PAM. And whatever the fuck you want.

      And of course, "Active Directory" is -- again -- just a fancy name for a bad copy of those technologies, that existed in Linux/Unix for years before they were "invented" by Microsoft.

      So I ask you: Who does not know jack?

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    28. Re:Bull by Bronster · · Score: 1

      [citation needed]

      (as in: why the fuck are you rewriting a file with hardlinks in the manner? Use symlinks if you want "follow the changes" and hardlinks if you want "copy on write", easy)

    29. Re:Bull by i.of.the.storm · · Score: 1

      I'm pretty sure any sane OS does the same thing. And I wasn't aware this was some sort of war to gain users; I use both Windows and Linux comfortably for their respective strengths.

      --
      All your base are belong to Wii.
    30. Re:Bull by bluefoxlucid · · Score: 1

      Yeah, and you'd have the slowest system ever. Disk cache exists for a reason and direct writing is slow.

    31. Re:Bull by NotPenny'sBoat · · Score: 2, Funny

      The more distant the target, the more you have to lead, and the greater chance there is of something happening between the time you pull the trigger and the time the bullet reaches its target zone: the wind may shift, the target may change speed, or direction...

      Or your mother-in-law may step between the barrel and the target. Darn.

      --
      What's #FFFFFF and #000000 and #FF0000 all over?
    32. Re:Bull by Jane+Q.+Public · · Score: 1, Insightful

      Please explain how that makes it not a filesystem limitation. Regardless of whether POSIX specifies that behavior or not, it is still a limitation, and it has to do with the filesystem. That means it is a filesystem limitation.

      Further, it appears that in the name of "efficiency", for a given execution thread Ext4 does not queue disk I/O calls chronologically, as it should. (I.e., it does not delay calls for data that has not yet been flushed to disk.) That is a design decision and most definitely has to do with the filesystem.

      Could I write a better one? Likely not, but that is irrelevant. I do not manufacture automobiles either but I know when mine is not working the way it should.

    33. Re:Bull by Mr.+Slippery · · Score: 1

      An application should not care what file system it is running on.

      Ex-fscking-actly. The whole point of a file system is to abstract this stuff away.

      --
      Tom Swiss | the infamous tms | my blog
      You cannot wash away blood with blood
    34. Re:Bull by frieko · · Score: 2, Insightful

      Except that NTFS does exactly the same thing. Perhaps GP meant it's not a filesystem bug.

    35. Re:Bull by dedazo · · Score: 1

      Back when 10MB HDDs with 100ms access times where prevalent and floppies were all the rage, buffered I/O was a good idea. If I find that an application is somehow overwhelming my 3.0GB/s SATA bus and 10,000 RPM hard drive, I'll be sure to turn this "feature" on.

      --
      Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
    36. Re:Bull by DigiShaman · · Score: 4, Insightful

      Wish I had mod points for you AC as I agree with you. 150 seconds is 2.5 minutes! I don't know of any file system, let alone a RAID controller that waits that longs to commit the data.

      If this is a feature and not a bug, better be sure your computer is connected to a UPS. Damn!

      --
      Life is not for the lazy.
    37. Re:Bull by Dahamma · · Score: 2, Insightful

      Oh great... basing ext4 performance gains on caching writes in the OS for 2 minutes just means they will focus their optimizations in ways that will suck even worse than ext3 does for applications that can't afford the risk of enabling write caching...

    38. Re:Bull by LWATCDR · · Score: 4, Informative

      It isn't a flaw. It is documented and the programmers didn't follow the docs. There is a specific command called fsync to flush the buffers to prevent the problem.
      In fact here is a link to that call http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html

      Yes if we had a prefect world we would have instant IO but we do not. The flaw is in the application plan and simple.
      They didn't use the api properly and it really is just that simple.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    39. Re:Bull by Anonymous Coward · · Score: 2, Informative

      Right... that way a single error can brick the whole system at once.

    40. Re:Bull by hdparm · · Score: 1

      It's all over now. You clearly lost the argument when you used automobiles analogy.

    41. Re:Bull by Jurily · · Score: 1

      Except that NTFS does exactly the same thing. Perhaps GP meant it's not a filesystem bug.

      "When you say "I wrote a program that crashed Windows", people just stare at you blankly and say 'Hey, I got those with the system, *for free*'."

      -- Linus Torvalds

    42. Re:Bull by icebike · · Score: 4, Insightful

      Its not a KDE issue. Its not a Gnome issue.

      Its a file system risk issue, and it affects everything running on the bos.

      The EXT4 developers have decided its ok to increase the risk window by 3000% and
      risk a crash for a minute and 20 seconds in an attempt to gain a little
      performance. (Damn little performance).

      With EXT3 the risk window was 5 seconds. Now its 150 seconds.

      Its ridiculous to move what should be a low-level data integrity function
      out of the File System and inflict it on user-land code.

      --
      Sig Battery depleted. Reverting to safe mode.
    43. Re:Bull by kasperd · · Score: 1

      The filesystem should be hitting the metal about 0.001 microseconds after I call write()

      It most certainly should not. At that time the drive head will be located over some data that you do not want to be overwritten. Background writes exist for a reason. If applications have specific needs, they will have to sync the data themselves, otherwise it is up to the user. I don't think there is any good excuse for waiting multiple seconds before starting the writes if the disk is otherwise idle. But if the disk is busy, some delay is to be expected.

      --

      Do you care about the security of your wireless mouse?
    44. Re:Bull by Jane+Q.+Public · · Score: 0

      My point was and still is: if the data is not flushed to disk yet, it should either be accessible from the buffer, or not at all. Would that be efficient? It would certainly cause delays in programs that tried to do I/O on files that hadn't been flushed yet. It would probably not be easy to implement that in code, and perhaps it would be an overall performance sapper. I don't know; I do not make filesystems.

      Nevertheless, the fact that this situation is not addressed by the filesystem itself is a limitation. Perhaps that missing "feature" is a tradeoff against something that would be worse; again I do not know. But even if it is 100% by design, and even if it is a conscious, beneficial tradeoff, it is still a filesystem limitation.

    45. Re:Bull by niw · · Score: 1

      Why should synchronous writes be the default ? Programmers are already too lazy and/or stupid to add a simple fsync() where needed, why should we all drop what we're doing, make the slowest option the default, and then have to jump through hoops to make things workable again ?

      Not only that, we would end up in the same position that the IE8 team was complaining about with HTML and the doctype with new developers copy-pasting.

    46. Re:Bull by BikeHelmet · · Score: 5, Insightful

      Ahh yes, I love developers like you. You assume your app is the only one running, and it must have full access to the entire IO bandwidth an HDD can provide.

      And then an antivirus program updates while Firefox is starting and a video is transcoding, and your program either slows to a crawl or crashes after 30 seconds of not receiving or being able to write any data.

      Recently I was playing Left4Dead when one of my HDDs in my RAID array died in a very audible way. All the drives spun down, then 3 of them came back online. IOPS went to zero for over 60 seconds. No data in or out to those devices!

      Interestingly, Ventrilo kept running fine. Left4Dead completely froze, but a minute or so after the 3 drives came back online, it unfroze. (CPU catching up?) All the while I was freaking out on Ventrilo, much to my friends' amusement.

      Pretty much everything else crashed, except for Portable Firefox... uTorrent crashed, but first it left corrupted files all over - appearing as undeletable folders, which require a format to remove.

      Time for a disk wipe. Thank you, shitty developers! Next time, use the API properly, and if you must have it written to disk, sync it immediately after you write!

    47. Re:Bull by Anonymous Coward · · Score: 0

      again with the security aspect. this is what i'm talking about. it's not about that - it's about pushing configuration to users. we *could* have both but cockchokers like you always conflate the issue.

      sure users can run some app that doesn't use GPs and good fucking luck to them, let them work out their own configuration nightmare. then proper security as listed is good, but you've just proved the point that weenies don't know what ADGPs are (should) be used for.

    48. Re:Bull by gweihir · · Score: 1

      Oh great... basing ext4 performance gains on caching writes in the OS for 2 minutes just means they will focus their optimizations in ways that will suck even worse than ext3 does for applications that can't afford the risk of enabling write caching...

      Then use ext2? I understand it is still being maintained? In fact I use it in several places roght now without problems. What is your complaint here, exactly?

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    49. Re:Bull by Anonymous Coward · · Score: 0

      Don't worry KDE 5 will fix this and it's going to be awesome.

    50. Re:Bull by Anonymous Coward · · Score: 0

      Let me get this straight - a story about a bug somewhere in the linux stack (argue amongst yourselves whether the fault is the fs or kde, neither I nor any other normal user would give a shit) leading to massive data loss - and here we have a +4 insightful post explaining why windows suxxx because it's only bad copies of unix. Only on slashdot.

    51. Re:Bull by dedazo · · Score: 1

      Ahh yes, I love developers like you. You assume your app is the only one running, and it must have full access to the entire IO bandwidth an HDD can provide.

      No, not really. I do assume that writing 40 small files to disk immediately instead of waiting 10 seconds to do it is not going to task a modern system in any meaningful way, so the file system shouldn't be trying to "help" me by aggressively buffering that sort of trivial operation.

      As for your data point, I probably wouldn't code for the event of a disk physically dying, and I don't think developers are "shitty" because they don't plan properly for that. That's a very different scenario than the power going out and the computer shutting down immediately, just in case for some reason you stupidly managed to confuse the two. You know, because the applications are also gone, so no one is trying to write or read from the disks to begin with.

      --
      Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
    52. Re:Bull by gweihir · · Score: 2, Insightful

      Back when 10MB HDDs with 100ms access times where prevalent and floppies were all the rage, buffered I/O was a good idea. If I find that an application is somehow overwhelming my 3.0GB/s SATA bus and 10,000 RPM hard drive, I'll be sure to turn this "feature" on.

      Use the "sync" option on a mount some day and be surprised. Synchronous I/O is dog-slow.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    53. Re:Bull by Anonymous Coward · · Score: 1, Insightful

      If you want to be really picky it's a limitation of the POSIX API. I don't just mean that POSIX allows this behaviour, but that this behaviour is necessary to get reasonable performance while conforming to the API.

      Filesystem transactions (as seen in Windows 2008 and Vista) provide a much better balance of performance and data integrity than the POSIX API.

      Could I write a better one? Likely not, but that is irrelevant.

      Nobody could. The API isn't part of the filesystem, but it limits what can be done in the filesystem. It isn't possible to create a filesystem that offers the guarantees you want, uses the existing API and performs even half as well as existing filesystems.

      You might think I'm being pedantic but this is absolutely not a problem with ext4. It's fundamental to the way Linux interfaces with filesystems. If you want to program on Linux (or any UNIX) you have to deal with it. Or work to get it changed :)

    54. Re:Bull by BikeHelmet · · Score: 1

      I didn't confuse the two. It's just that I have a UPS, so the power going out and everything shutting off doesn't apply to me. :P

      But I did find it interesting that different apps had different responses. Left4Dead just... waited. Ventrilo was fine. Firefox was fine, even though it failed to write. Maybe this is why it works *okay* from CD-roms.

      uTorrent though... I was seeding quite a few files(TV shows, OpenOffice versions, linux distros), and it turned them all into corrupted folders. >_>

      Methinks the API wasn't quite followed to the letter! Perhaps it doesn't make the developer shitty, but it certainly makes that piece of code shitty. I can't imagine what it's doing for it to have such a vastly different result from everything else that was running.

    55. Re:Bull by vadim_t · · Score: 5, Insightful

      It's not going to happen immediately in any case. Some optimizations can only be done if you introduce a delay, and once introduced you have to deal with that there's a delay. Just because it's one second instead of a minute doesn't mean your computer can't crash in the precisely wrong moment.

      While I'm not an expert in filesystems, I'd expect writing a single file to be at least 4 writes: inode, data, update the directory the file is in, and a bitmap to show space allocation. If there's a journal add a write for the journal. Each of those will require a seek due to all of these things being in different places on the disk in most filesystems.

      So your 40 small files just turned into 400-500 seeks, which at 8ms each will take 1.6 to 2 seconds to complete.

      Now let's suppose we can batch things up. We need to write the inode and data for each file, and can do just one seek for the directory (the same for all), and the bitmap and journal can be updated in one operation. Now we're down to 2 writes per file, giving 80 seeks, plus 3 for metadata, giving 83 seeks, which can be done in 0.6 seconds.

      But what if we do delayed allocation and create the all the inodes and write all the data as one large contigous area? We're now down to 5 writes total, with a seek time of 40ms. The time needed to write the data can probably be disregarded, since modern disks easily write at 50MB/s, and those 40 files with metatata probably amount to less than 32K.

      And with some optimization, we just reduced the time it takes to write your 40 files to just 2% of the unoptimized time.

      You're not going to get this sort of improvement without some sort of delay. If you insist on a per-file write you'll get really, really awful performance on the sort of workload you're using as an example. And you can even see it in practice, just boot a DOS box, and do benchmarks with and without smartdrv. Running something like a virus scanner should show a huge difference in the presence of a cache.

    56. Re:Bull by Eskarel · · Score: 2, Insightful

      That application developers don't always get to choose what filesystem their application is being run on would be my guess.

      Disk caching is a good thing(well at the moment, if/when SSD's become large enough and cheap enough to replace regular old spinning disks for speed dependent applications, then it probably won't be all that useful), it makes everything faster and more efficient. That said, 2.5 seconds is an absolutely huge amount of time in computer terms, even on a really slow PC these days that's thousands of operatings being executed before any attempt is even made to write the data to disk. It's a huge, and unecessary risk. Average latency on normal hard drives now is easily below 5 ms, queueing up for 30 times that to try and make things more efficient is just stupid.

    57. Re:Bull by gweihir · · Score: 3, Insightful

      It is a KDE issue. Only userland knows which data is critical. Only userland knows whether data can ba backed up or not. The OS cannot enure full data integrity without massice negative performance impact, however much you may wish for it. So what the OS does is give you a way to tell it which data needs to be on disk and which data should be on disk in a while if nothing goes wrong.

      There really is no other way of doing it. Unless you think fundamentally defective code is acceptable if the risk of getting hit is a bit smaller?

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    58. Re:Bull by LWATCDR · · Score: 2, Informative

      Just use fsync()
      Problem solved. Read the Posix docs, or the clib docs and you will never run into this problem.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    59. Re:Bull by Eskarel · · Score: 1

      Just because you can do something, doesn't mean it's not stupid.

    60. Re:Bull by Anonymous Coward · · Score: 0

      two words: straw & man

    61. Re:Bull by mr_walrus · · Score: 2, Insightful

      only userland knows WHICH data is critical.
      dude, ALL data is critical.

      no, this is a serious implementation stoopidity in ext4, et.al.

      blame the victim. eeesh. data rape is still rape.

      and saying programs should be calling fsync is absurd.
      i'm old enough to remember when programmers were admonished
      to NOT call fsync, or it would "slow down the system."

      sync/flushing data already written by userland standard i/o calls
      should never be a userland responsibility.

      [shaking head...]

    62. Re:Bull by LWATCDR · · Score: 4, Insightful

      No. That is why we have fsync().
      No file system will promise you data integrity with a power failure. That is why you should run with a UPS.
      You can not depend on the write delay time. What happens if you get a really fast processor and say a really slow drive? Unless you are building software that only runs on ONE set of hardware you just can not do that.
      This is a bug that was always in KDE and they got lucky up till now.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    63. Re:Bull by LWATCDR · · Score: 1

      That is a bad assumption. It isn't the way Posix works. It is documented and easy to work around.
      After every write, close, and any other disk operation in your program call fsync() as documented.
      Heck even under DOS you couldn't be sure that the file was written until you called close.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    64. Re:Bull by gweihir · · Score: 4, Insightful

      dude, ALL data is critical.

      If you really think that, then you should leave the aera of modern disk access and mount all your partitions with the "sync" option. Then none of your software will have to think about syncing. Of course all file access will be so slow that nobody will want to work with that system either.

      Hmm. I wonder why "sync" is not a default mount option?

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    65. Re:Bull by Anonymous Coward · · Score: 0

      You many call it an anomaly but a competent engineer would have to expect this to occur.

      any competent engineer would have read the fricking documentation!

    66. Re:Bull by gweihir · · Score: 1

      I agree. This is not way to treat startup-critical configuration files.

      With adept use of fsync, you can make sure that one valid copy of the file is still on disk, even with a power failure. May be the old one, but in most cases that is far better than nothing.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    67. Re:Bull by Profane+MuthaFucka · · Score: 1

      Didn't I just fucking say that? It's allowed behavior under POSIX to have a null fsync implementation, and that's what the Mac does.

      Everyone else writes the goddamn data to the platter, but you also got to fsync on the directory too.

      --
      Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
    68. Re:Bull by shutdown+-p+now · · Score: 1

      Back when 10MB HDDs with 100ms access times where prevalent and floppies were all the rage, buffered I/O was a good idea.

      It still is. On Windows, at least, it's possible to turn buffering off, and it's off by default for removable drives (UMS, for example). This allows to unplug them without unmounting, but the speed penalty is pretty serious, which can be easily demonstrated by enabling caching and doing a few simple tests.

    69. Re:Bull by Lord+Ender · · Score: 1

      You followed my mocking argument with the exact same argument, and you threw in a bad analogy to boot; but you got modded interesting while I got modded Funny? I think yours is far more humorous!

      --
      A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
    70. Re:Bull by amirulbahr · · Score: 3, Informative

      Who modded this up? Jane Q. Public is completely clueless on this topic, but she manages to sound like she has an idea to fellow clueless moderators. She should be called out for the karma whoring ignoramus she is.

      Some choice quotes from her on this thread.

      Delayed allocation is like leading a moving target when shooting.

      BadAnalogyGuy would be proud. Probably also worth mentioning that without delayed allocation, the system would be unbearably slow.

      The longer you delay allocation after writing the journal (and Ext4 seems to take this to extremes), the more chance there is of something -- almost anything really -- going wrong

      A kernel crash or power outage is certainly something that could go wrong. Modern journalling file-systems handle this gracefully by making sure the file-system is in a consistent state when it comes back up.

      The filesystem is flawed, plain and simple.

      You'll realize why that one is a gem when you read her next quote. As the discussion continues, she begins to realize how far off the mark she is and begins to correct...

      It most definitely is a filesystem limitation. That is different from saying that it's the filesystem's fault.

      Still off the mark, but perhaps she is beginning to figure out what a file system should offer and what the issue being discussed is.

      If an application that reads and writes lots of small files fails under Ext4, then it is Ext4's fault, not the application. An application should be able to read and write lots of small files if it wants... I can think of a great many practical examples.

      Go ahead and do that. But if you want to make sure you're data is written, in case of a kernel crash or power outage, then you had better understand what is going on at the FS level.

      As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter.

      No, but you should understand the API of the language you are dealing with. Since when does a compiler handle disk I/O anyway? As for your interpreter, it is free to call fsync whenever it wants, but what has that got to do with the FS again?

      Excuse the heck out of me, but the issue being discussed was a failure that was NOT due to a power loss or other such system problem. It was a crash caused by this very issue. If it were simply missing data due to power loss or some such, there would be no point in this discussion at all.

      The purpose of this quote is to demonstrate that she both has no regard for TFA and also has no idea what this issue being discussed is. I encourage anyone looking to give her mod points actually RTFA and also do a bit of background reading on file systems and in particular delayed writes.

      My point was and still is: if the data is not flushed to disk yet, it should either be accessible from the buffer, or not at all.

      This sentence alone deserves a -1 Huh? If you do a write, and it is successful, then you can do a read on the same file and it will return what you wrote, whether or not it had been flushed to disk. This is the way it is supposed to work. Think about it for like 10 seconds and you'll begin to get it.

      not supposed to have to worry about OS-specific details

      WE ARE TALKING ABOUT UNEXPECTED KERNEL CRASHED AND POWER OUTAGES. If you care about that situation then you should get a clue before you start coding. If not, then what is the problem, or was it fault... er, sorry limitation?

      One should not have to know about syncing to do something like a few simple file writes

      And one doesn't need to if she is not concerned with the rare possibility that the system CRASHES OR LOSES POWER in the next few minutes.

      Anyway, I've never called out another poster like this before and now I feel dirty.

    71. Re:Bull by Dr.+Smoove · · Score: 1

      And what fs would this be? If the standard filesystem repair utility didn't fix that, then ill remember to never use it.

      --
      "If you plant ice, you're gonna harvest wind."
    72. Re:Bull by theshowmecanuck · · Score: 0

      Holy crap... this is like, "Whoooosh!!!" AND "good analogy," at the same time.

      --
      -- I ignore anonymous replies to my comments and postings.
    73. Re:Bull by hairyfeet · · Score: 1

      Did you try booting into a Linux Live CD to delete the folders? Every time I have run into the "undeletable file/folder" issue and couldn't get around it through CMD I was able to destroy them with a Live CD. While it would depend on your RAID and how well it is supported, I always look at format/reinstall as a last resort. I have had some customers do some seriously stupid things, like game during a thunderstorm and get half the box fried, but so far a little time and the right tools ( I have found a Live Cd on USB and Computer Repair Utility Toolkit V2) will fix pretty much anything.

      Oh and sorry about having to use Megaupload, but the makers of Computer Repair Utility toolkit got told they weren't allowed to make the toolkit anymore or host the files. But as we all know once something is out in the wild it is out there for good. Yay Internet! You would think someone would be happy to have their programs advertised especially since the toolkits includes easy to use links to all the home pages and all the tools were freeware. I guess it shows even geeks can be as stupid as PHBs.

      --
      ACs don't waste your time replying, your posts are never seen by me.
    74. Re:Bull by Richard_J_N · · Score: 1

      That's all very well, but in most cases, the machine is a single-user desktop, and it's sitting *idle* during the 5-40 seconds in which the filesystem writes are being batched up. It seems to me that the writes should be sent to disk immediately, unless the disk is already busy, in which case they may be delayed for (at most) 40 seconds, while other reads take place ahead of them.

      There is still noflushd, which allows the disk to spin down for up to 5 minutes - if you really want the power-saving at the expense of risking your data. Also, with a modern SATA disk supporting Native Command Queuing, the OS should immediately write the data to the disk's buffer, and the disk's firmware gets to decide about re-ordering.

      As for the argument about using sqlite - why have yet another abstraction? After all, the filesystem is already a sort of database!

    75. Re:Bull by Zero__Kelvin · · Score: 1

      "Ahh yes, I love developers like you. You assume your app is the only one running, and it must have full access to the entire IO bandwidth an HDD can provide."

      I was very surprised to find out that dedazo wrote something that stated the above claim, until I looked back and saw that he wrote nothing of the sort:

      "If I find that an application is somehow overwhelming my 3.0GB/s SATA bus and 10,000 RPM hard drive, I'll be sure to turn this "feature" on." [Emphasis Added]

      If you missed it, the word "an" implies a selection from a group. Indeed, dedazo was clearly writing from a sysadmin perspective, and not as a developer. There is no such thing as a sysadmin that only thinks of one application.

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
    76. Re:Bull by EvanED · · Score: 1

      Of course we got our POSIX ACLs and security labels.

      By my understanding, doing something as simple as saving a file in Emacs will kill any ACLs associated with it. It definitely will kill other extended attributes associated with the file. Am I wrong on this?

    77. Re:Bull by theshowmecanuck · · Score: 1

      Me, I'm not going to worry. I'll stay on ext3 for a good while yet... if this really is a problem, it is a self correcting one from my stand point. You see, if a large number of people move their data onto ext4 formatted drives and then lose their data because of system crashes that normally would not have caused data loss on ext3... then they will eventually move back to ext3 and ext4 will dry up and blow away (the old computer adage about people not complaining about really bad software is true here too... you won't hear anything about the bad software because people will just stop using it entirely and use something that does work). Or the developers will eat crow and fix the issues with ext4. Or nothing will happen, the sky won't fall, and everyone will love their new ext4. I'm wearing a hard hat while I wait.

      --
      -- I ignore anonymous replies to my comments and postings.
    78. Re:Bull by phantomlord · · Score: 2, Interesting

      I just bought a new laptop that, unfortunately, came pre-installed with Vista. I spent the better part of the day creating settings by hand, tweaking this and that, to get things setup how I wanted them to be. I don't know of any handy way to copy my XP registry over from my old laptop to Vista on the new laptop(I could be wrong, I don't use windows for anything of importance so I haven't taken the time to learn all the power user tricks). That's to say nothing of all my application settings that were lost since they were written to the registry in my old laptop.

      I installed Linux on it as well. You know what it took to copy over all of my settings and data?
      cd /hpme
      cp -a /mnt/nfs/home/user .

      <sarcasm>That registry sure does make everything so much easier...</sarcasm> and that cp works even across different architectures, Linux distributions, etc.

      --
      Don't leave your mind so open that your brain falls out. Don't close it so much that you cut off the blood.
    79. Re:Bull by icebike · · Score: 1

      All True.

      And I'm running ReiserFS 3 for a while longer. I live in an area with crappy power and
      I've had many power failures on machines with and without UPSs, and I've never lost any
      data.

      --
      Sig Battery depleted. Reverting to safe mode.
    80. Re:Bull by snemarch · · Score: 1

      As others have mentioned, try mounting your filesystems in synchronous mode and check performance.

      There's another thing though, which people seem to miss: this isn't just about caching, it's also necessary in order to reduce file fragmentation... which is a thing that kills performance even on 10k-rpm drives.

      --
      Coffee-driven development.
    81. Re:Bull by setagllib · · Score: 1

      A UPS that lasts more than 1 minute on a high-watt machine is already going to cost several hundred. So you'd need to pair this with a "forced write" (is sync(1) enough?) as soon as the power cut is detected.

      --
      Sam ty sig.
    82. Re:Bull by Waffle+Iron · · Score: 1

      Back when 10MB HDDs with 100ms access times where prevalent and floppies were all the rage, buffered I/O was a good idea. If I find that an application is somehow overwhelming my 3.0GB/s SATA bus and 10,000 RPM hard drive, I'll be sure to turn this "feature" on.

      Your bus bandwidth and drive RPM have nothing to do with track-to-track seek time, which is the bottleneck for synchronous writes, and which hasn't improved much more than 10X since those days. Meanwhile, your CPU is almost 10000X faster. So you're exactly wrong: you need more buffering than ever in order to actually use your additional CPU performance as more than a space heater.

    83. Re:Bull by vadim_t · · Score: 4, Interesting

      That's all very well, but in most cases, the machine is a single-user desktop, and it's sitting *idle* during the 5-40 seconds in which the filesystem writes are being batched up. It seems to me that the writes should be sent to disk immediately, unless the disk is already busy, in which case they may be delayed for (at most) 40 seconds, while other reads take place ahead of them.

      Mount your filesystem with the "sync" option, that should do what you want I guess. Performance will be bad though.

      There are only two ways to do this: either you do it completely synchronously, and get a guarantee of the write being done when the application is done writing, or you have a delay of arbitrary length. If you have a delay, even if it's 1ms, and you care about the possibility of something going wrong at that moment, the application has to deal with the possibility. Reducing the delay only makes it less likely, but given enough time it'll happen.

      Even doing it fully synchronously you can run into problems. A file can be half written (it's written by the block, after all), and of those 40 files, perhaps one references data in another.

      Point being, if such things are really a problem for the application, the application must do things correctly, by writing to temporary files, renaming, and writing in the right sequence so that even if something is interrupted in the middle the data on disk still makes sense.

      Even if the FS does like you want and starts writing immediately, that won't save you from the fact that it has no clue how your file is internally structured, and will perform writes in fs-sized blocks. So your 10K sized file can be interrupted in the middle and get cut off at 4K in size after a crash. If your application then goes and chokes on that, there's no way the FS can fix that for you.

      Also, with a modern SATA disk supporting Native Command Queuing, the OS should immediately write the data to the disk's buffer, and the disk's firmware gets to decide about re-ordering.

      NCQ doesn't take care of half that's needed for safe writing to disk. Two problems for a start:

      1. Your hard disk doesn't know about your filesystem's structure. Unless told otherwise, the HDD will happily reorder writes and update ondisk data first, journal second, leading to disk corruption. The hard disk can't magically figure out what's the right way to write the data so that it remains consistent, only the OS and the application can ensure that.

      2. NCQ is limited to 32 commands anyway, the OS has to do handling on its own anyhow.

      As for the argument about using sqlite - why have yet another abstraction? After all, the filesystem is already a sort of database!

      Because it's a simpler abstration. If you're not willing to learn or deal with the POSIX semantics, such as fsync and rename, and checking the return code of every system call, you can use something like sqlite that does it internally and saves you the effort, and returns one unique value that tells you whether the whole update worked or not.

    84. Re:Bull by snemarch · · Score: 1

      Export HKEY_CURRENT_USER registry hive from old machine, import on new. Copy %APPDATA%. That will get you partially there... but of course a lot of applications insist on doing stuff in wacky non-spec ways, write config to HKEY_LOCAL_MACHINE (equivalent to wanting per-user settings in /etc on *u*x), et cetera.

      The registry is a really neat thing, both performance and security wise (it's even separately journalled), but a lot of developers unfortunately misuse it.

      --
      Coffee-driven development.
    85. Re:Bull by JumboMessiah · · Score: 1

      man mount

      Look for data=journal

      There's more than one mode of journaling in ext[3-4]...

      Or if you're too lazy.

    86. Re:Bull by moonbender · · Score: 1

      Unless you're running a laptop, in which case it might save non-trivial amounts of power. ... And of course, non-flushed buffers aren't even the issue with the Launchpad bug reports, it's really more the fact that the cache wasn't commited to disk while at the same time the truncate operations WERE commited, but whatever.

      --
      Switch back to Slashdot's D1 system.
    87. Re:Bull by slamb · · Score: 2, Interesting

      RTFPS (Read The Fine POSIX Spec).

      I've RTFPS (well, not quite - the Single Unix Specification; where do I find the Fine POSIX Spec free online?).

      I am...dissatisfied with this answer because POSIX appears to provide so few guarantees that applications basically have to assume more than it promises to get anything done. The Linux documentation doesn't appear to promise anything more. For instance,

      • If I create a new file and fsync it, am I guaranteed that it hit disk? (Hint: on Linux this isn't true according to the #ifdef linux block of this file. It says I must fsync the directory, and nothing in Posix even says it's possible to open() or fsync() a directory; you have to use opendir().)
      • If I overwrite or append just a few bytes of an existing file and lose power before calling fdatasync(), what is guaranteed about the contents of the file? If you say "nothing", the only safe approach to updating anything is to write a complete replacement for the file, fsync() it (but pay attention to the special Linux case described above), and rename() it into place. Of course, that's a pretty significant performance hit and basically screws over any reasonable way of implementing shadow paging or write-ahead logging.

      So...where is the specification that describes the filesystem's behavior in a useful way?

    88. Re:Bull by Anonymous Coward · · Score: 0

      Sorry, what? What is high wattage?

      My core2quad machine with 3 SATA disk RAID runs for about 20 minutes on a tiny APC UPS I bought from newegg for less than $100.

      There is something seriously wrong with your system or UPS if it doesn't monitor the AC and battery condition and automatically do a clean OS shutdown when the battery gets to a reasonable charge threshold.

    89. Re:Bull by slamb · · Score: 2, Insightful

      Rewriting the same file over and over is known for being risky. The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync.

      Except on Linux you must sync the parent directory as well. None of this behavior is usefully documented anywhere, so it's upsetting when kernel developers tell application developers they're doing it wrong.

    90. Re:Bull by slamb · · Score: 2, Interesting
      To clarify my own question:

      # # If I overwrite or append just a few bytes of an existing file and lose power before calling fdatasync(), what is guaranteed about the contents of the file?

      I'd like to know which of the unmodified bytes are guaranteed to be preserved. None of them? All of them? Ones not in the same block as new bytes? (And what's a block? Is it st_blksize, or is it possible that block size varies within the file or changes over time?)

    91. Re:Bull by gweihir · · Score: 1

      While I agree that this is mostly a userland issue, 150 seconds seems absurd without some HUGE performance gains. Even then, they should make it configurable so people can adjust the performance/reliability aspect to their liking. I'm curious if the exceptionally long wait time is a requirement from their overall ext4 design though...

      Agreed. The 150 seconds seem excessive to me too. While I still think that the KDE folks are responsible for the reported issue, 150 seconds seems like a time a lot can happen in. And while in ext3s 5 seconds it is unlikely that you start and finish an entire new task before flush, with 150 seconds that seems possible. Maybe all this is (besides non-impressive code by KDE), is a not well-chosen default.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    92. Re:Bull by Anonymous Coward · · Score: 0

      So it is 2.5 minutes. How is that a problem? You know you have to shutdown your computer instead of yanking the power cord or there's a high chance you will lose data, don't you?

      The problem is bad assumptions, as described in the mailing list posts. The application assumes that the order of the operations is strictly preserved: "If I, the application, write this new file and then rename it over the old, and a crash happens sometime after that sequence, then upon reboot the filesystem will either have the old file, the old file and the new file or the old file replaced by the new file." In most cases this assumption is WRONG. That's because normally data is not journaled, but metadata is. Thus the rename operation can take effect without the data being there, leaving you with the old file replaced by an empty new file. Not journaling data is a performance tradeoff with a known risk.

      The application can ensure that the data is on the disk before it issues the dangerous rename operation: It should fsync(). That's the filesystem way of doing it. If you frequently need to update small bits of information, fsync() is slow, so you want a database with transaction support. Then you can handle all the dependencies you want: For example, the filesystem can never guarantee that two files have either both been updated or are both in the old state in case of a crash. That kind of integrity requirement is a job for a database.

    93. Re:Bull by Malc · · Score: 1

      What's to stop application developers saying that all of their data is critical and putting in that call everywhere? Surely that will bugger up the whole idea of delaying the writes?

    94. Re:Bull by icebike · · Score: 2, Insightful

      On way to test if your argument makes sense is to extend it to absurdity.

      What if the FS NEVER wrote anything until a fsync was called?
      All applications would then have to add these calls.

      The net affect would be uncontrolled write management at the application level with no hope of IO management or optimization at the FS/OS level.

      Is this what you propose? Is this technically correct? Be careful what you wish for.

      If this was done, the FS would (sooner or later) have to ignore fsync totally and re-assert control of commits in order to achieve any reasonable performance.

      So you see, I believe you are recommending something that is not in the best interests of the OS or the users in the long run. (However technically correct it might be at the moment). This functionality really does belong at the OS/FS level. I could go further and say it would be nice if it could be done at the hardware level. If disk drives could manage this by themselves it would be great. A write would get immediately sent to the disk, and it would cache as needed but never more than it could write with stored power after feed power fails.

      --
      Sig Battery depleted. Reverting to safe mode.
    95. Re:Bull by fatp · · Score: 1

      I suppose KDE / Linux should add a "Last Known Good Configuration" option when startup?

    96. Re:Bull by poliopteragriseoapte · · Score: 2, Insightful

      I think it is a brilliant idea to write less frequently to disk, and even 10 mins would not be bad. Much easier on power consumption, the drive can spin off, less wear on flash memory, etc. Coders who forget to positively flush critical data are just asking for problems. After all, what is the difference between 5 secs and 120 secs? Just 24 times. And if disaster can strike with high probability p, then p/24 is notmuch better.

    97. Re:Bull by fl1ckmasterflex · · Score: 0

      You should go read the bug again. If applications keep on re-writing the same file again and again, they will loose data. Here it is for your benefit...

      "So the difference between 5 seconds and 60 seconds (the normal time if you're writing huge data sets) isn't *that* big, but for certain crappy applications that apparently write huge numbers of small files in users' home directories. This appears to be the case for both GNOME and KDE. Since these applications are rewriting existing files, and are apparently doing so *frequently*, the chances that files will be lost is high."

      https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45

      And calm down !!

    98. Re:Bull by russotto · · Score: 3, Insightful

      It is a KDE issue. Only userland knows which data is critical.

      Data that userland applications WRITES TO DISK is critical. If the filesystem takes its sweet time about actually doing the write, it's not the application's fault. And no, calling fsync() or fdatasync() constantly is no good, because that really does make your performance poor.

    99. Re:Bull by amirulbahr · · Score: 2, Informative

      They are referring to the case when the system isn't shut down cleanly. This means a kernel crash or a power outage. What is your point exactly? Seriously, and I really am doing my best to hold back on the personal insults (even when you something as annoying as "And calm down !!"), what is so difficult that you fail to comprehend what the real issue being discussed here is?

    100. Re:Bull by DigiShaman · · Score: 4, Insightful

      Apparently, Microsoft and Intel don't think so. You can enable write-caching in both the device manager (volume) and Intel's Matrix Storage Manager (RAID), but they will both provide respective warnings about doing so when not connected to a UPS.

      Granted. Write-back caching is independent of the file system in use. However, both are based on the idea "writing" the data, just not committing until a later period. It's a trade off that can be put on a sliding scale. The more often you commit the data, the less chance of data loss at the expense of performance. The less often you commit the data, the greater your chances are of data loss. Your performance improves however. The key is finding that optimum balance that suits your needs.

      --
      Life is not for the lazy.
    101. Re:Bull by netcrusher88 · · Score: 1

      Actually, Active Directory is a REALLY nice configuration frontend for LDAP and Kerberos, among others. Of course, it uses a nonstandard schema and is a pain in the ass to integrate with because of it, but that doesn't change the fact that AD is nice to use, and is in fact a rather good implementation.

      --
      There's an old saying that says pretty much whatever you want it to.
    102. Re:Bull by Skapare · · Score: 1

      This algorithm is effectively to queue up data and let a trigger from a clock decide when to unleash it to the hardware drive. A better way would involve the same queuing, but let the drive going into an idle state trigger unleashing it. Also, don't unleash it all at once; just unleash a bunch of writes that are close to each other. If the drive is idle when a process does write() then the drive starts immediately for that write. In the mean time the process continues unblocked. When the drive is done, if there are any more writes now queued from that process or any others, select something for writing, using a good selection algorithm (elevator, for example), and proceed to write it and keep the drive busy. By the time the clock would have triggered 45 to 120 seconds later, many if not most of the writes processes did via write() calls are now done.

      Additionally, there needs to be limits on the number of write pages queued per device and queued overall total. If processes are writing a lot, beyond a certain point, there is little if any gain to massive queuing. Let the writes be blocked for lack of buffers and unblock as soon as a buffer is available (n the order blocked). If the write queue is too large, these small queuing gains will be exceeded by I/O demands forced elsewhere due to memory being taken for use in write queuing. If you force out too much read cache or virtual memory mapping, you drive those I/O rates up in excess of the small gains of massive write queuing.

      --
      now we need to go OSS in diesel cars
    103. Re:Bull by Anonymous Coward · · Score: 0

      The fact that it is frequent means that it is opening up more 60 second windows of opportunity for data loss in event of a crash. Merely rewriting without a crash is not a problem. With the other system, the window was 5 sec so a crash was less likely to occur in that interval. Get it?

    104. Re:Bull by Anonymous Coward · · Score: 0

      Congratulations - this must be the most uninformed post about a subject I have ever seen on /. - completely brilliant. And no, I am not new here and I have been writing software such as filesystems for 17 years.

    105. Re:Bull by mysidia · · Score: 1

      If your system crashes after a write hasn't hit the disk, you lose either way. Ext3 was set to write at most 5 seconds later. Ext4 is looser than that, but with associated performance benefits.

      Ext3 was set to write 5 seconds later by default. I and many other Linux users tune that to a higher value for performance reasons. 5 seconds is just way too short.

      If your system loses power while or shortly after data is being written, there will almost always be data loss, anyways. The real bug is the system wasn't on an UPS.

      Journalling at least keeps the filesystem crash-consistent, so there is not corruption severe enough to break the OS install (unless you were updating the system during the crash, as Journalling doesn't really guarantee application transactional integrity).

      As for a bunch of small files... well, if you were silly enough to truncate a bunch of small files and then write them, you still haven't lost very much data.

      The bug is in the Application failing to anticipate that truncating and rewriting a bunch of files at once is extremely dangerous.

      This is true, even if they were large files. You could truncate a file, the filesystem commits the transaction, and power goes out just while you were about to write back the new file contents....

    106. Re:Bull by mysidia · · Score: 1

      Its ridiculous to move what should be a low-level data integrity function out of the File System and inflict it on user-land code.

      It already IS a burden on user land code; it's purely an act of imagination that the filesystem is somehow supposed to commit things to disk in a short timeframe; they're not, the "duration to write" IS a function of the amount of cache available for improving write performance.

      Including Disk cache. By now applications should know better than making a bunch truncates, then small writes with no backups, and no sync()'ing. It's a recipe for disaster, no matter what filesystem you use, be it UFS, FFS, XFS, EXT, NTFS, etc...

      If you do a bunch of critical writes, you better sync them when you're done, because the filesystem is not obligated to commit those writes in a reasonable time frame.

      This makes sense given that 90% of small writes are not time critical or that important. When you're just appending a file, for example a log file (typical case of a small write), or creating a temporary scratch file for text, a few lost log entries is probably not a huge issue (if it was, you would configure syslog to sync).

      This enables vastly better system performance, and as always, applications SYNC, when the write needs to be done to ensure data integrity.

      The delay is worth the performance gain. Applications like KDE, need to fix their mishandling of important data, properly backup critical files before truncating them, AND sync()'ing changes to critical files when done.

    107. Re:Bull by mysidia · · Score: 1

      A lot of these settings are tunable, if you look up the sysctls and tune2fs options.

      You can in general change commit delay and periodic sync options, or at least you could in ext3.

    108. Re:Bull by QuoteMstr · · Score: 1

      If I, the application, write this new file and then rename it over the old, and a crash happens sometime after that sequence, then upon reboot the filesystem will either have the old file, the old file and the new file or the old file replaced by the new file.

      Exactly! That's a conceptually elegant guarantee, and I think it's worth tweaking the filesystem to ensure it works. Look: open-write-fsync-close-rename is not merely the "correct" way to spell open-write-close-rename. The sequences mean entirely different things:

      1. open-write-close-rename is asking that either the old or the new version should be in place at any one time, but also saying it's not important right now that the change happen right away. In database terminology, this sequence asks for atomicity without durability.
      2. open-write-fsync-close-rename is asking for durability as well as atomicity. It's a much stronger guarantee. It's asking for durability as well --- the filesystem has no way of knowing that the rename is coming after it sees fsync-close, so of course it needs to write the new data right away.

      You can't simply say "add an fsync!". That converts a weaker request into a stronger request, and results in a absofuckinglutely stupid performance hit for no good reason. Sequence 1 is perfectly reasonable, and it's the filesystem's job to ensure it works as intended.

    109. Re:Bull by dirtyhippie · · Score: 1

      That's straight up wrong. If nothing is written to disk, including the journal, how would the file change to size zero? Read the article again -- the journal *is* written before the data. Usually that's not a problem, but with this particular idiom (which by the way predates linux by a long shot), it's a big'un.

    110. Re:Bull by QuoteMstr · · Score: 1

      Data committed to disk as soon as an fwrite or fclose returns is not something you can or should expect

      First of all, those are C buffered IO functions and have NOTHING TO DO WITH POSIX. They have buffer issues of their own. You meant write(2) and close(2), so let's talk about those. It's true that POSIX doesn't require IO to hit the disk until the system sees an fsync(2) or sync(2) call. However, it is perfectly reasonable to for atomic filesystem updates, and this feature can be accommodated in the existing POSIX API. There's no good reason a filesystem can't guarantee that any data blocks for a file renamed on top of another file aren't synced before the rename. Filesystems can implement this guarantee far less expensively than fsync(2). That's how the atomic rename facility was meant to be used, and it's a very elegant conceptual mechanism for asking the filesystem for atomicity without durability.

      In short, requiring an fsync before an atomic rename removes an important word from the vocabulary of application developers, and ultimately does more harm than good. Performance will be horrid, and for no good reason.

    111. Re:Bull by mysidia · · Score: 2, Informative

      5 seconds might reduce the probability of problems, but it doesn't make the assumption a non-bug.

      That's like saying if my code has a buffer overflow in it, but if it's only by 5 bytes, everything's ok, whereas if it's by 150 bytes, I should panic...

      One way to test if your argument makes sense is to extend it to absurdity.

      And the result has absolutely no bearing on the issue. Extending 5 seconds to infinity is nothing like extending 5 seconds to 150 seconds.

      If this was done, the FS would (sooner or later) have to ignore fsync totally and re-assert control of commits in order to achieve any reasonable performance.

      On some systems you may actually find this to be the case. On certain kernels, certain hard drives had write cache, and sync() would not force the drive itself to flush its own cache, data could be in there for minutes, to be lost in the event of an untimely power failure..

      Most applications handle this reasonably; maintain transactional integrity, and sync() when it is critical that a write finish on a timely basis, and in event of a crash, revert to the last 'good' state.

      Transactional database software like PostgreSQL are exceptional at this, and they do use sync.

      If you have a lot of critical data, the right place to put it is in a DBM, that will handle and manage syncing correctly and optimally for the OS.

      If you have small amounts of critical data, then you write them to flatfiles, and sync. The small size of the files, and the small number of writes you do to them will make performance a non-issue.

      Maintaining integrity of critical data requires a lot more than a good filesystem, and the ability to ensure data is sync'ed to disk.

      Because even 5 seconds is non-zero, which is all the time in the world, if you leave the files on disk such that they would be corrupt or inconsistent (should the system crash at that moment)

      Filesystems don't and never did totally relieve application developers of having to worry about what might (or might not) be written to disk by the OS.

      Certainly it's unreasonable they make particular assumptions about the exact nature of the duration it takes, since there are so many filesystems available, including some unusual ones like NFS.

      (void)sleep(5); after a write is not, and never was a substitute for fsync(); for assuring data is written before writing more.

    112. Re:Bull by pchan- · · Score: 1

      FINALLY! Thank you for cataloging Jane's endless stupidity in this thread. How someone modded her up is beyond me.

    113. Re:Bull by mysidia · · Score: 1

      You may have a 3GB SATA bus, but that doesn't mean your hard drive with accept 3GB/s of writes.

      In fact, every disk command burns CPU time on the host itself, as well as on the disk controller.

      If you are writing synchronously, the OS has to make a decision for every write, for each individual sector, regarding what sector on the disk to write it too (instead of making a bulk decision for this series of writes, regarding which sector ranges to use)

      And very likely there will be a seek.

      When you are running multiple apps, there will be either a very large number of seeks, or terrible disk fragmentation.

      The resulting throughput will be horrible, even though your drive is 10,000 RPMs, it's really not all that fast.

      Even with delayed writes, I don't see SATA2 drives exceeding 30 megabytes/sec, once you turn off write caching on the drive (with write caching, maybe 50 mb/s, but that's not synchronous).

      Switching to synchronous writes at both hardware and OS level, will cut your write speed to about 10 megabytes/sec on, for a well-balanced load, that includes database work and a fair number of non-linear writes to random sectors.

      That might be fast enough for all your needs, but some people want faster speeds out of their 10k RPM SATA/SAS drives, or they have more demanding workloads than Vista and Solitaire....

    114. Re:Bull by suckmysav · · Score: 1

      Well, firstly, I would think that the users of those applications would think "godamm this app is slow, maybe I'll go look for an alternative"

      --
      "You can't fight in here, this is the war room!"
    115. Re:Bull by Anonymous Coward · · Score: 0

      I can not do anything else than just agree.

      It is weird, we are talking about computer technology and almost every commputer related discussion the analogy gets used (car, or anything else), just like we would not have technical terms to explain things. Like computer science is somekind "black magic" what needs to get discovered all over and everytime we start talking about it.

      It is enough that you need to explain computer technology for normal users (Advanced example of this is that Linux kernel is the OS and nothing else is includd to get such), but that when you are talking with so called experts, they use analogies.

      I do not know any other science area where analogies are used as much, especially between experts or advanced users.

    116. Re:Bull by kripkenstein · · Score: 1

      With EXT3 the risk window was 5 seconds. Now its 150 seconds.

      Its ridiculous to move what should be a low-level data integrity function out of the File System and inflict it on user-land code.

      But even 5 seconds of risk is unacceptable! Ext4 makes the problem much more apparent, but it was there all along. Userspace apps should be written to not have even a 5 second risk window.

    117. Re:Bull by QuoteMstr · · Score: 1

      Userspace apps should be written to not have even a 5 second risk window.

      No, filesystems shouldn't force the application developer to choose between unsafe and unusable. read-write-close-rename is a perfectly reasonable way for an application developer to ask for atomicity without durability, and it's the filesystem's job to do what application developers actually want. fsync here asks for something entirely different and imposes a stupefying performance penalty. No, it's the filesystem's job to give this sequence of operations atomic properties because that's what the application developer obviously requested.

    118. Re:Bull by kripkenstein · · Score: 1

      So you're asking for all open-write-close operations to do an fsync at the end? Or only if a rename is done?

      If the former, then that's not very efficient. If the latter, then I'd rather write an fsync than a rename, and not just because it works, because it's less hackish (the rename method assumes that a rename will 'trick' the OS into writing to disk so there is something to rename).

    119. Re:Bull by QuoteMstr · · Score: 1

      So you're asking for all open-write-close operations to do an fsync at the end?

      Of course not. That's silly.

      Or only if a rename is done?

      You're getting closer. Please try to follow. The difference between what you're thinking and what I'm saying is subtle.

      What I'm demanding is that when renaming file B over file A, that file B's data blocks be written to disk before the record of the rename itself. That way, if the system is shut down at any point, the restarted system sees under the name A either the original contents of A or the complete contents of B.

      That is not the same as asking for a sync to disk on every rename. It's asking for the operating system to introduce a dependency on the relative ordering of data block writes and renames.

      Say you perform a open-write-close-rename operation, then sleep. The operating system might write out the changes to disk after five seconds, after 120 seconds, or after six months, but no matter when the changes are written, the data blocks for the opened file come first, and only after these blocks are written should the record of the rename itself be written. You could write the data blocks after three seconds and the rename operation after six months, and it wouldn't matter: so long as the relative order is preserved, a restarted system sees either the original contents of the file or the complete new contents.

      What you're imagining, which is simply forcing rename to sync to disk, would satisfy the constraints above, true. But it's certainly not the best way to satisfy them. In fact, sync-on-rename would perform as badly as the fsync approach I'm railing against. The filesystem can do a better job than an application armed with fsync can, and dammit, they should. "Applications should use fsync" is not an excuse for breaking atomic rename.

    120. Re:Bull by Hal_Porter · · Score: 1

      The filesystem is FINE but could do with a small tweak, even though this is not in any way a bug

      Fixed that for you. These filesystems types are notoriously thin skinned.

      Look at his response.

      Since ext3 became the dominant filesystem for Linux, application writers and users have started depending on this, and so they become shocked and angry when their system locks up and they lose data --- even though POSIX never really made any such guaranteed. (We could be snide and point out that they should have been shocked and angry about crappy proprietary, binary-only drivers that no one but the manufacturer can debug, or angry at themselves for not installing a UPS, but that's not helpful; expectations are expectations, and it's hard to get people to change those expectations, even when they aren't good for themselves or the environment --- such as Americans living in exburgs driving SUV's getting shocked and angry when gasoline hit $4/gallon, and their 90 minute daily commute started getting expensive. :-)

      Not only do Americans have unrealistic expectations that data passed to a write() be on disk, they also have unrealistic economic expectations caused by a severe moral turpitude.

      Let's all be careful when talking to Ted. The Linux community can ill afford another Reiser case.

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    121. Re:Bull by cryptoluddite · · Score: 1

      The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync.

      You forgot to sync the folder containing the file...

      It's kind of retarded that without gratuitous syncs you can write a file, rename it over the original, crash, and end up with a zero-byte file and deleted original. That's what happens when metadata and content data are not 'in sync', so to speak.

      If the program finishes writing then renames the file overtop of the old file version there should never be a problem, even if the application does not ever sync anything (should, not won't). The filesystem should guarantee that a crash at any time during "cat >new; mv new old" results in either just the original old, old and new, or new renamed to old. In any case, 'old' should be data-complete regardless of any syncs the program does or doesn't do.

      The application doing a sync all the time is just throwing a monkey wrench into the works. On ZFS you might get lots of tiny records causing overhead. On some other FS you might cause it to need to write lots of other data first. At least in this case, and for trunc(0), the application shouldn't be messing with the FS.. the FS should do the right thing.

    122. Re:Bull by binford2k · · Score: 1

      Because Jane Q. Public knows so much more about filesystem design than Ted T'so.

      http://en.wikipedia.org/wiki/Theodore_T'so

    123. Re:Bull by cryptoluddite · · Score: 1

      POSIX tells you what you can expect from file system calls. Data committed to disk as soon as an fwrite or fclose returns is not something you can or should expect.

      But you should expect that a rename that deletes another file doesn't get saved onto the drive before the data is saved. Why? Because the most important job of any filesystem is not losing data. And that kind of data loss is completely preventable with little (if any) performance cost.

      POSIX formalizes a 30+ year-old view of the filesystem. Just because it says you have to use fsync several times in order to safely get the data saved doesn't mean we can't or shouldn't improve on that.

    124. Re:Bull by Anonymous Coward · · Score: 0
    125. Re:Bull by QuoteMstr · · Score: 1

      What if the FS NEVER wrote anything until a fsync was called? All applications would then have to add these calls.

      Right. Those who demand we use fsync for atomic rename are telling us to ask for a guarantee we don't need.

      Here's a rule of thumb: if you're calling fsync because you need to ensure data has been preserved, you're using fsync correctly. If you're calling fsync to preserve integrity even when it's not critical to save your data, you're abusing fsync and the filesystem you're using is broken.

      open-write-close-rename already asks for atomic but asynchronous rename under all sane systems. XFS and ext4 break that perfectly sane sequence of operations. These filesystems are doing the equivalent of replacing your bicycle's tires with cats, and then telling you to buy a Mack truck when you complain.

    126. Re:Bull by QuoteMstr · · Score: 1, Insightful

      Or to be more precise, POSIX lays out the bare fucking minimum for a half-sane system. It's a set of requirements, not a golden holy tome!

      This is Slashdot, so here's a car analogy. POSIX is the law that says what's street-legal. A car needs two headlights, two tail-lights, emissions below a certain point, and so on. Both a base-model Chevy Aveo and a Ferrari are street legal, but I'd rather drive the Ferrari? Why? Because it makes guarantees that go beyond street legality.

      Now say you drove Ferrari and the air conditioning malfunctioned. Image how angry you'd be if the Ferrari dealership said, "nope, sorry. We're not going to fix this. Your car is still street legal, so you should have just gotten used to driving without the air conditioning."

      You know what I'd say? Fuck you.

      You can guess what I'm saying about filesystems that break the perfectly reasonable open-write-close-rename sequence.

    127. Re:Bull by gzipped_tar · · Score: 1

      You are right about the Emacs gotcha. This is because the default behavior of C-x C-s is to save the modified file to another file (read: another inode number) and rename the old filename to something like "foo~" as a backup. The new file has a new inode number thus the old ACL imposed on the old file doesn't apply (but still applied to the backup file). Not only ACLs but also a great number of other file management operations have to work it around.

      However, usually admins seldom impose ACLs on a single file. Well-designed directory ACLs are what is expected to be used for this kind of situation.

      Of course, one can change Emacs' behavior by using C-u 0 C-x C-s so that no backup is created and the modifications are written in-place (really in-place).

      --
      Colorless green Cthulhu waits dreaming furiously.
    128. Re:Bull by kripkenstein · · Score: 1

      I think I see your point. But isn't it a bit odd to do a rename operation for this purpose? It isn't what renames are for.

    129. Re:Bull by QuoteMstr · · Score: 1

      But isn't it a bit odd to do a rename operation for this purpose? It isn't what renames are for.

      Yes and no. Atomic rename has been part of Unixish systems for a very, very long time. Everyone has always used it to implement changes that need to be atomic, and for the most part, atomic rename has been all that's been needed for that purpose.

      I'd also like to see a fbarrier system call that would preserve relative ordering of writes within a single file, but it's not a critical piece of missing functionality. In most of the situations I imagine you'd want to use fbarrier, you'd really rather be using a database engine anyway, and database engines implement the moral equivalent to fbarrier internally.

      Vista's Transactional NTFS is also interesting, and I wish we had something like it in unixland. But on the other hand, it's not absolutely essential in the way atomic rename is.

    130. Re:Bull by Hal_Porter · · Score: 1

      Actually that's not true

      http://www.ddj.com/database/184416281

      To maintain the integrity of the registry hives, the operating system performs transaction logging. For all hives except the system hive, when a change is made to that registry hive, the change is written to the corresponding log file. When that log file is flushed to the disk, the first sector of the log file is marked to indicate that a registry change is in progress. After the flush, the changes are written to the actual hive also. If the transaction succeeds, the hive and log file are marked to indicate the transaction successfully completed. If the machine should crash during the transaction, on the next boot, the operating system would detect an incomplete transaction (log file still marked) and perform a recovery by restoring the previous values stored in the log file. Transaction logging is also performed on the system file, but the operating system uses a slightly different process. The .alt files are complete backup copies of the corresponding hive file. The operating system only keeps a complete backup copy of the crucial system hive, so the only .alt file youâ(TM)ll find is the system.alt.

      I.e. the registry has its own transaction system. It has to be this way because at least as far back as XP it was possible to boot off a FAT32 drive, so it was impossible to only use NTFS transactions.

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    131. Re:Bull by rastos1 · · Score: 1

      The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync.

      An now imagine that the original file had hard-links. Your suggestion breaks them.

      (as in: why the fuck are you rewriting a file with hardlinks in the manner? Use symlinks if you want "follow the changes" and hardlinks if you want "copy on write", easy)

      1) I don't check the number of hardlinks on a file I'm going to write to. Perhaps I should. Do you?
      2) if the file you are going to update is a symlink, then your suggested action "rename the new file on top of the old one" means removing the old file (the symlink) and changing the name of the file to the name that the symlink had - the result: symlink is gone.

    132. Re:Bull by kripkenstein · · Score: 1

      Ok, if atomic rename is a *NIX tradition, then I guess that might be justification for using it, and for resolving this issue in the way you suggest.

      Personally, it seems 'cleaner' to me to do fsync on files that I absolutely want to get written to disk, especially since it's portable over OSes to some extent - e.g., Python's os.fsync() calls whatever Windows function is necessary to ensure writing to disk. So that's how I write my own code.

    133. Re:Bull by BZ · · Score: 1

      > Nothing is written for periods between 45-120 seconds

      Actually, the article (if you read carefully) says that the major "OMG all my files are truncated" issues arise when file renames are committed to disk before the data is written to said files, so that this coding pattern:

      1) Open a new file x.new
      2) Write your data to x.new
      3) mv x.new x

      leads to a zero-length file x because the mv is committed before the write and then the crash leaves a 0-length file.

      No comment on whether the rename commit has to do anything with the journal; I don't know enough about ext4 to say. ;)

    134. Re:Bull by QuoteMstr · · Score: 1

      Personally, it seems 'cleaner' to me to do fsync on files that I absolutely want to get written to disk, especially since it's portable over OSes to some extent

      There's a big difference between ensuring relative ordering and asking for a disk sync. Relative ordering corresponds to atomicity, a database property. Pure disk-syncing is durability, a different database property.

      Under traditional Unix filesystem semantics, you have three options for a filesystem operation:

      • atomic and durable: open-write-fsync-close-rename
      • atomic, but not durable transactions: open-write-close-rename
      • durable, but not atomic: open-write-close

      As it turns out, Windows without Transactional NTFS only gives you one option: durable, but not atomic. (Where you open the file in synchronous mode and just write to it.)

      Windows does not support an atomic rename. If you have files A and B, in order to rename A to B, you need to delete B first. If the system crashes between the deletion of B and the renaming of A to B, you end up without a B at all!

      Only you can tell whether you need durability, atomicity, or neither in your application.

      For example, if you're writing a mail server, you need both: you need to tell the mail program on the other end that you received a message and stored it securely, so you need to know that the message hit the disk. Thus, you need a durable transaction. It won't do to have partial messages laying around, so you also need an atomic transaction*.

      If you're writing something like KDE's configuration stuff, it really doesn't matter whether you end up with the old configuration file or the new one. In the worst case, the user loses a setting he just made. So you don't need durability. However, you can't have corrupt configuration files laying around: the desktop environment might not come up at all then. So you do need atomicity.

      If you're writing a compiler, the user can just re-run the compilation if the machine crashes. Since the output can be regenerated at any time, you don't need durability. You'll never have multiple compile jobs writing to the same output at the same time anyway, so you don't need atomicity either. Therefore, a compiler can just open its output directly without fancy rename tricks.

      If you're writing a log recorder, you want to make sure log messages hit the disk. But since your program controls all access to the log file, you don't need atomicity from the filesystem. You can just open the logfile and write, syncing to it after each entry. That way, you get logs up to the instant before the machine crashes. (This is precisely what syslog does by default.)

      What atomic rename under Unixish systems gives you is the ability to have atomicity. Without atomic rename, we can only achieve atomicity by using a locking system that's not part of the filesystem itself, which is a whole other level of pain and complexity.

      I don't know what you're writing in Python. But under Windows, all that fsync is giving you is durability. It can't give you atomicity. Are you sure you really need it?

      * (Which you get for free with maildir, incidentally. But if you're delivering to an mbox file that contains multiple messages, you need to lock the mbox file to make the message delivery atomic. One popular method of locking is to use atomic renaming...)

    135. Re:Bull by kripkenstein · · Score: 1

      Thanks for the detailed explanation! In particular I wasn't fully aware of how Windows does this stuff.

      Durability is my prime concern in the Python example I mentioned. For atomicity I generally would use something like SQLite anyhow (assuming the performance penalty is acceptable, at least).

    136. Re:Bull by gnasher719 · · Score: 1

      They are referring to the case when the system isn't shut down cleanly. This means a kernel crash or a power outage. What is your point exactly? Seriously, and I really am doing my best to hold back on the personal insults (even when you something as annoying as "And calm down !!"), what is so difficult that you fail to comprehend what the real issue being discussed here is? The real issue is that the same code and the same external event lead to data loss when using ext4, and don't lead to data loss when using ext3. Reported as a bug often enough that it is likely to affect any user eventually, and developers trying to explain why it's not their fault instead of trying to fix the problem.

      I'd say the ext4 developers are 100 percent correct, and I would recommend to anyone not to use their product.

    137. Re:Bull by Bronster · · Score: 1

      No, but then I don't tend to write applications that write files that I don't "own".

      So - if I own the space where the file is going, then I figure that people who want a pointer to the most recent copy can use symlinks.

      2) what?? No way. You resolve the symlink, and rename the target over itself. Assuming that you have permission to do that, otherwise you replace the symlink with the new file, and it's no longer a symlink, assuming your goal is to have the new file appear in the place.

      Or - you fdatasync.

      Or - you append the new data to the file rather than truncate, and finally update a record at the start of the file saying "valid content starts here and ends here" - eventually you have enough stale blocks at the start that you can overwrite them instead. Gosh, looks like we've invented a database format.

      POSIX has no "atomic truncate and write new content to the same inode" operation. Denying reality and wishing it weren't so is a recipe for failure, the only difference is how often you see that failure.

    138. Re:Bull by gnasher719 · · Score: 1

      That's all very well, but in most cases, the machine is a single-user desktop, and it's sitting *idle* during the 5-40 seconds in which the filesystem writes are being batched up. It seems to me that the writes should be sent to disk immediately, unless the disk is already busy, in which case they may be delayed for (at most) 40 seconds, while other reads take place ahead of them.

      The write delay is not the problem. It is only a problem because the file system is fucked up and a crash at the wrong moment messes things up, and the write delay increases the interval that is critical.

      You have to assume that there can be a crash. You have to assume that this crash can happen at any point in time (for example, just before a call to fsync for the fsync fanatics). Apparently KDE opened, truncated, wrote, and closed a few hundred files. The problem isn't that this happened 120 seconds delayed instead of happening immediately, the problem was that the stupid file system managed to do the truncates immediately and the writes 120 seconds later. If the _whole_ operation had been delayed, nothing bad would have happened.

    139. Re:Bull by theapeman · · Score: 1
      But doing the modifications in place is not safe if you might get a system crash (or power failure). What you could do is to create a backup file and then do the write in place. You would need some kind of procedure to restore the backup if the original write did not complete properly.

      I suppose you really want to have a file system primitive which says 'open a temporary file for writing as a new version of this other file'. When you close the temporary file the filesystem would atomically replace the contents of the original file with the temporary file, preserving all attributes such as ACLs (unless you modified them on the temporary version). You could go further and arrange that the temporary file started off with the same contents as the original file - so the entire update sequence became a single transaction.

    140. Re:Bull by Anonymous Coward · · Score: 0

      On[e] way to test if your argument makes sense is to extend it to absurdity.

      Way to go. Arguments never make sense when extended to absurdity.

      What if the FS NEVER wrote anything until a fsync was called?

      What if the FS wrote every single byte as an atomic operation?

      All applications would then have to add these calls

      Not then. Always. When you have program-critical files, that you (1) truncate (2) rewrite (3) not sync , you are asking for disaster to happen. Please read the second link provided by the AP, where Ted Ts'o clearly explains what KDE is doing wrong, and what they should be doing.

    141. Re:Bull by ais523 · · Score: 1

      And always remember to check the return value of the close, under any OS. It's entirely possible for an attempt to a close a file to fail if the disk is full. (And yes, there are ways to recover from this; save on a different filesystem, don't delete the backup/original, keep the file in memory so the user can free up space so you can save it.) Although IIRC this wasn't guaranteed, in my experience DOS tended to do nothing at all to a file opened for write until the file was closed.

      --
      (1)DOCOMEFROM!2~.2'~#1WHILE:1<-"'?.1$.2'~'"':1/.1$.2'~#0"$#65535'"$"'"'&.1$.2'~'#0$#65535'"$#0'~#32767$#1"
    142. Re:Bull by DrSkwid · · Score: 1

      The truth will out

      --
      There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    143. Re:Bull by Anonymous Coward · · Score: 1, Informative

      open-write-close-rename already asks for atomic but asynchronous rename under all sane systems

      I'm not sure what you're saying here. Are you arguing that such a sequence should be treated specially by the OS? Why?

      XFS and ext4 break that perfectly sane sequence of operations

      It isn't sane. It's like replacing your tires with the engine running and your kid sitting behind the wheel. Sure it might work 9 out of 10 times, until your kid switches the car into gear.

      KDE (and Gnome) are truncating critical system files without a backup available. How is that sane? Sure they will immediately rewrite the file, but who will guarantee that the system will not crash between the truncate and the write?

      And finally, they aren't doing open-write-close-rename. They're doing truncate-write-close. What they should be doing is create-write-close-sync-rename, i.e. do not overwrite the old config file before the new content is safely stored on disk. And I think the reason that they did not go the correct way (assuming they were aware of the issue) is because the "safe" way sucked performance-wise. Well duh, if you write hundreds of 50-byte files, performance will suck, unless you skip safety protocol.

    144. Re:Bull by Anonymous Coward · · Score: 0

      I'd just wrap writing to a file in a class that when you call close (or when it goes out of scope) closes the file and calls fsync.

      Although to be honest I would have expected a fclose to fsync that file, since you have said "I am finished with this file" to the file system/OS. (assuming no other handles are open to that file)

      *shrugs*

    145. Re:Bull by Anonymous Coward · · Score: 0

      While Jane Q. isn't fully correct as you pointed, you are dead wrong.

      ext4 is flawed, for two reasons:

      1) The window of opportunity for a dataloss at crash time have been extended. Hence, the filesystem is less reliable for that very reason.

      2) It looks like the meta-data journaling have the side effect of guaranteeing the reallocation, so a crash means a 100% sure dataloss, for every file rewritten

      Asking application writer to change the way they use APIs to take into account ext4 is just denial.

      open() + write() + close() guarantee that the data is accessible to other applications. It does not guarantee that the data survives a crash.

      HOWEVER, unix application writers are NOT expected to do anything more, in particular, they are not expected to use fsync() on each file.

      FURTHERMORE, fsync have performance issues, in particular, on ext3, fsync() will flush ALL the data for the specified file system, resulting in extraordinary bad performance.

      > WE ARE TALKING ABOUT UNEXPECTED KERNEL CRASHED AND POWER OUTAGES. If you care about that situation then you should get a clue before you start coding.

      No. What does it means ? Every coder of every unix application everywhere should start caring about the fact that ext4 is fragile ? You are mixing applications that NEEDS integrity (databases, where coders WILL have to care about the situation) with applications that don't (Gnome, KDE where coders MUST not care, because that is the job of the OS)

      ext4 is clearly flawed, and even Theodore Ts'o call the behaviour a "problem" and is working on a "solution". See https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45

    146. Re:Bull by 7+digits · · Score: 1

      Yes, yes, yes, yes, and yes.

      Wish I had mod points today...

    147. Re:Bull by Roman+Mamedov · · Score: 1

      > IOPS went to zero for over 60 seconds. No data in or out to those devices!
      Heh, I regularly have a situation when a couple of desktops is running with Root-over-NFS, and I have to shutdown the NFS server.
      ...for like half an hour, or so.
      Nothing crashes from this. Bring up the server, all the desktops instantly unfreeze and continue where they left off. :)

    148. Re:Bull by 7+digits · · Score: 2, Insightful

      > I agree. This is not way to treat startup-critical configuration files.

      This is bull. Most files are critical to someone. This would means that most processes that write data must use fsync.

      Are you arguing that cp should use fsync for every file it copies ? In that case, you'd better tell the maintainers of coreutils-7.1, because copy_internal (used by cp.c) does not. (And you'll be laughted at)

      So, right, now, on ext4, the sequence:

      > cp /disk1/file1.data /disk2/file1.data

      wait a few seconds

      > rm /disk1/file1.data

      crash

      will probably cause the file to be lost. That you choose to blame it on cp is funny, but most of the rest of the world will blame it on ext4.

    149. Re:Bull by rtz · · Score: 1

      [...] do what application developers actually want. fsync here asks for something entirely different and imposes a stupefying performance penalty.

      Do you realise that what you want is equivalent to having the FS do the fsync for you? If it is "stupefyingly slow" when you do the fsync, why should the fsync that the FS does be any faster?

    150. Re:Bull by cheater512 · · Score: 1

      All my critical stuff where I wouldnt want this bug to occur has a UPS.

      Whether its a laptop with a battery or a proper UPS.

    151. Re:Bull by complete+loony · · Score: 1

      God damn it. It's about time we had application level control of file system transactions, so a commit to the FS must either completely happen or be completely rolled back. With the option of a soft checkpoint, ie the application may or may not care if the changes are actually on disk, provided they are rolled back in the event of a power failure.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    152. Re:Bull by Anonymous Coward · · Score: 0

      There's an O_SYNC flag for open(), Linux also has the O_DIRECT flag, which ensures write only returns after data hits the platter (O_DIRECT has some buffer and offset alignment requirements though).

    153. Re:Bull by vadim_t · · Score: 1

      ReiserFS 4 was going to do that.

      But it's again the same issue. Your program has to correctly use the functionality available so that things work the way you want them to. Even ReiserFS 4 won't save you from the scenario being described here, if the application simply writes a bunch of files and doesn't try to ensure the changes are committed by setting up a transaction.

      It's the same thing in this article's case. The functionality to do what is needed is available, right now. The applications simply aren't used it correctly, because it's one of those things that mostly work even if you do it wrong. To do it right, you have two choices: Implement it following POSIX semantics, which means correctly using fflush and rename, as well as coding application checks to ensure there was enough disk space, the flush actually worked, and so on, or use a simpler abstration on top like a database.

      But even in the case of the database, the application still has to understand how the DB works and do the right things in the right order, and issue a COMMIT at the right time. If the programmer does it half assed, no filesystem, database or any other system will work right, because they can read the programmer's mind to figure out what s/he wanted.

    154. Re:Bull by marcosdumay · · Score: 1

      I'd agree that they are 100% correct, but I wouldn't recommend people to not use it, I'd just ignore it. There are lots of applications where it could be reliably used and would give a good performance.

      Too bad the FS name is ext4, what implies that it is fully compatible with the ext family. Maybe a name change would be good, or we need an ext5 already...

    155. Re:Bull by Anonymous Coward · · Score: 0

      open-write-close-rename already asks for atomic but asynchronous rename under all sane systems. XFS and ext4 break that perfectly sane sequence of operations.

      That is incorrect. The only atomic thing in that sequence is the rename. The problem is that the file itself may not be synced at the time of the rename.

      Now, remember that a filesystem has two parts - data and metadata. The rename only cares about metadata, where as the write is about data. Or, the rename cares about the directory, the write cares about the file content. These are different things, and an operation on the directory does not do anything to the content of the files in that directory.

      To make things worse, we have journalling. The journal records every change to metadata, but (by default) does not record data. Neither does it on ext3. And that is the way it should be, because the journal records transaction, and there are no system calls for data transactions.

      So, what will happen in the case of a power failure? The atomic write is recorded in the journal, and committed (the commit makes it atomic). Now the system crashes before writing the actual data. What happens? Journal replay. The system walks through the journal, and makes sure that every committed transaction is written to the filesystem. Yes, the rename happens on journal replay. The write does not because it was not journalled.

    156. Re:Bull by marcosdumay · · Score: 1

      Well, if it is that important that your data survives a power outrage, you need durability. There is nothing stupid on asking for durability when you need it.

      Now, of course, losing some desktop configurations isn't as huge a loss, but who am I to imply that the article is sensationalist?

    157. Re:Bull by vadim_t · · Score: 1

      No, the problem is that the KDE code was written with the assumption of that file truncation + write is an atomic operation, when it actually isn't and never was.

      Even if the FS did write immediately after truncating, you could still have a power failure in the exact moment after the disk finished processing the truncation but hadn't started writing the data yet.

    158. Re:Bull by ciderVisor · · Score: 1

      And I'm running ReiserFS 3 for a while longer. I live in an area with crappy power and
      I've had many power failures on machines with and without UPSs, and I've never lost any
      data.

      You're lucky, then. In my area (Oakland, CA), Reiser just seems to kill things stone dead.

      --
      Squirrel!
    159. Re:Bull by Anonymous Coward · · Score: 0

      rename and fsync are two different things.

      fsync does not give you atomic anything. It only guarantees that once it returns, the data is flushed to disk. Between open and fsync, there is no guarantee.

      rename guarantees atomic replacing one file with another. That is, the file either has the old content, or the new content. However, it only operates on metadata, it does not care about writes at all. If the write to the new file has not been written to disc yet, you are replacing the old file with the new, zero length file.

      In short, you need both. First fsync to guarantee that the data is on disc, and then rename to atomically replace the file.

    160. Re:Bull by marcosdumay · · Score: 1

      Oh, now that is clearer. I've replied to another one of your posts, but I thought you wanted durability...

      Now, I have a question for you. Do NFS and AFS support atomic renames? They don't look like supporting any ordered operation.

    161. Re:Bull by Anonymous Coward · · Score: 0

      Durability is not needed, just that either the old set of configurations or the new set of configurations survive the crash. As it is, the configurations are wiped by the crash.

      What the application really wants is a sort of transaction, with a crash triggering a rollback if the transaction has not been completed before the crash.

      The critics of the filesystem behavior argue that the filesystem should provide that kind of integrity, because application authors expect it. The others argue that the filesystem has never provided and does not provide that kind of guarantee, so the application should handle the problem. The problem with that is that there is no API for specifying a dependency on other operations. The application would need a "make sure A is done before doing B" instruction. That's not the same as "make sure A is done" followed by a "do B", because the former does not require immediate action, but the latter leaves the filesystem no choice: It has to go to the disk.

    162. Re:Bull by davecb · · Score: 1

      An engineer would minimize the time between the two writes, even if the two were delayed for an arbitrary period: it's the time between the first data write and the metadata write which is the period in which a crash would destroy data, not the time before the first write.

      --dave

      --
      davecb@spamcop.net
    163. Re:Bull by everett · · Score: 1

      Mr Cheney? So this is what you've chosen to do in all your free time now? Troll Slashdot?

      --
      Sig withheld to protect the innocent.
    164. Re:Bull by Hurricane78 · · Score: 1

      It's what we do for fun. Who are you, AC nonetheless, to judge us?

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    165. Re:Bull by Hurricane78 · · Score: 1

      [citation needed] ^^

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    166. Re:Bull by QuoteMstr · · Score: 1

      Well duh, if you write hundreds of 50-byte files, performance will suck, unless you skip safety protocol.

      Blame, blame, blame the victim. This is 2009. There's no fucking reason for a modern filesystem to not cope well with hundreds of small files.

      KDE (and Gnome) are truncating critical system files without a backup available. How is that sane? Sure they will immediately rewrite the file, but who will guarantee that the system will not crash between the truncate and the write?

      Under ext4 and XFS, you lose when doing an atomic rename.

      Are you arguing that such a sequence should be treated specially by the OS? Why?

      Yes. It's not that the sequence as a whole should be treated specially: it's that rename should be sane enough to introduce ordering constraints on the source file's data blocks. Why? Because it's useful, application developers have fucking relied on it for years, and the alternative, fsyncing all the time, is a different request that's slow as hell.

    167. Re:Bull by QuoteMstr · · Score: 1

      Do you realise that what you want is equivalent to having the FS do the fsync for you?

      That's not the same at all. How many times do I have to explain the difference between an ordering constraint and an explicit sync?

      rename(A,B) must flush A's data blocks before writing the record of the rename itself. The filesystem does not need to do this when you call rename! It could do it six minutes or six months in the future, as long as the data blocks for B are written before the rename record itself. This property preserves atomicity. Sometimes you want atomicity without durability, and this is how you get it.

      Performance is vastly improved over the fsync-everything model because the filesystem can still batch updates. "Do A and B when you have a chance, but make sure to do A before B.=" is saying something fundamentally different from "do A now. Do B."

    168. Re:Bull by Simetrical · · Score: 1

      Point being, if such things are really a problem for the application, the application must do things correctly, by writing to temporary files, renaming, and writing in the right sequence so that even if something is interrupted in the middle the data on disk still makes sense.

      RTFA -- that is exactly what programs are doing, and it's still considered broken. ext4 reorders the operations so that the new file's contents can be written to disk after the rename is. That means that there's an interval where the new, empty file has overwritten the old file's contents, before the new file's contents are written. This issue causes loss of old data, not just recent changes. According to Ted Ts'o, this sequence of operations is buggy:

      2.a) open and read file ~/.kde/foo/bar/baz
      2.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
      2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
      2.d) close(fd)
      2.e) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

      The correct solution, according to him, is to use fsync() if you don't want the old file to possibly be truncated on filesystem crash. But this leaves no option I can see for programs that want to say "I don't really care if the data gets to disk safely, just make sure you don't overwrite the old file until it does."

      --
      MediaWiki developer, Total War Center sysadmin
    169. Re:Bull by Simetrical · · Score: 1

      Does anyone else think that 150 second is a bit over the top in terms of writing to disk?

      I could understand one or two seconds as you speculate more data might come that needs to be written.

      5 seconds is a bit iffy, as with ext3.

      150 seconds? That's surely a bug.

      No, I don't think so at all. If your system doesn't normally crash, and in most cases it doesn't unless you use buggy binary-only drivers, it's not a significant risk. Losing 2.5 minutes of non-critical data once every couple of years, say, isn't a big deal. Good file editors, databases, and so on all use f*sync(), so you won't lose data from them.

      The problem is when you not only lose a few minutes of changes, but critical old data too (like an entire preferences file). That's what this bug is about.

      --
      MediaWiki developer, Total War Center sysadmin
    170. Re:Bull by QuoteMstr · · Score: 1

      The problem is that the file itself may not be synced at the time of the rename.

      Historically, it has worked that way, and for a good reason: it's sane behavior. That's my whole point. More precisely, the file's data blocks need to be committed before the rename operation itself is committed to disk. That's not saying that the syncing must occur before rename returns.

    171. Re:Bull by Richard_J_N · · Score: 1

      >> That's all very well, but in most cases, the machine is a single-user desktop, and it's sitting *idle* during the 5-40 seconds in
      >> which the filesystem writes are being batched up. It seems to me that the writes should be sent to disk immediately, unless the disk
      >> is already busy, in which case they may be delayed for (at most) 40 seconds, while other reads take place ahead of them.

      > Mount your filesystem with the "sync" option, that should do what you want I guess. Performance will be bad though.
      > There are only two ways to do this: either you do it completely synchronously, and get a guarantee of the write being done when the
      > application is done writing, or you have a delay of arbitrary length. If you have a delay, even if it's 1ms, and you care about the
      > possibility of something going wrong at that moment, the application has to deal with the possibility. Reducing the delay only makes
      > it less likely, but given enough time it'll happen.

      Thanks for your reply - I stand enlightened about NCQ. What I still don't get is, why not begin a write immediately? Batching makes sense if the system is loaded, but what is the advantage in waiting for a long interval when nothing else is happening?

      Also, if we are going to wait for long times, why not apply some kind of performance-training heuristic (like TCP does with retransmission intervals)?

    172. Re:Bull by Anonymous Coward · · Score: 0

      And your UPS protects against kernel crashes? Where can I get one?

    173. Re:Bull by swilver · · Score: 1

      5 seconds seemed reasonable. 2.5 minutes does not. I have serious doubts that this will have much performance benefits at all. Furthermore, the performance benefits are likely to evaporate when used on SSD's, which are likely to become popular soon.

      Even though you may wish that all applications treated filesystems as a transactional database, the reality is that almost none do, unless they actually are implementing a transactional database.

    174. Re:Bull by vadim_t · · Score: 1

      You'd probably miss chances to optimize.

      For instance, suppose Alice (wants the 9th floor), Bob (10th), Carol (8th) and Dave (15th) get into an elevator.

      A hard disk to my knowledge can't change its mind in the middle of a head movement, so suppose the elevator has a design that once it decides to go to a floor it can't stop anywhere in the middle.

      If Bob is the first to hit the button, and the command is immediate, under this design, the elevator will go to the 10th floor. Meanwhile the rest of the passengers input their choice. Bob gets out, now elevator goes down to the 9th floor, since it's the nearest, then 8th, then 15th.

      But if you had waited a bit for everybody to specify their floor, the optimal route could be calculated: straight up.

      IMO, the "start immediately" idea doesn't make things a lot safer. There's nothing that stops a computer from crashing right in the middle of this immediate write, and it can be left off at any possible point, like stopping at truncation or writing half the file.

      But OK, suppose this goes on fine. But while this immediate write was in progress, the rest of those 40 files queued up, and the optimizer decided the most efficient way to do things is to first write all the truncations, then write the data. And if things crash somewhere in the middle of that you've got exactly the same issue again, immediate writes or not.

    175. Re:Bull by Just+Some+Guy · · Score: 1

      Hmm. I wonder why "sync" is not a default mount option?

      It is on the BSDs.

      --
      Dewey, what part of this looks like authorities should be involved?
    176. Re:Bull by belphegore · · Score: 1

      There's another "use case" too which is not a kernel crash or power outage (well, not in the sense you mean). I have seen many non-technical computer users power off their machines when they're done using them by just holding down the power button until the screen goes dark. Going through the whole "click menu -> shutdown -> wait 10 minutes" routine is too much trouble. And they're not wrong -- why should I sit there and wait for 10 minutes to turn my machine off when I'm done using it?

    177. Re:Bull by grumbel · · Score: 1

      How would using the sync option differ from every program being implemented correctly and using fsync()? Wouldn't both of them just result in 'dog-slow'?

    178. Re:Bull by Anonymous Coward · · Score: 0

      (Sigh). The code in KDE doesn't do that.

    179. Re:Bull by Anonymous Coward · · Score: 0

      Yeah, disks have gotten a lot faster these days, but the files being written to them have gotten bigger, not to mention more and more apps running simultaneously. If anything, I think the gap in speed/latency as you go up and down the tiers of memory (cache, main RAM, HDD) has actually gotten larger, meaning that buffered I/O has become *more* important for performance.

    180. Re:Bull by gweihir · · Score: 1

      Data that userland applications WRITES TO DISK is critical. If the filesystem takes its sweet time about actually doing the write, it's not the application's fault. And no, calling fsync() or fdatasync() constantly is no good, because that really does make your performance poor.

      Your argumentation if faulty: If the filesystem writes everything immediately, that is equivalent to calling fsync() constantly. You can either have it fast or reliable but not both.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    181. Re:Bull by gweihir · · Score: 1

      I Said "start-up critical". If you system does not boot properly if a write fails, that is a different level than application data being lost.

      I think you did not understand my posting at all.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    182. Re:Bull by Anonymous Coward · · Score: 0

      Someone with at least a hint of logic rather than pure cognitive dissonance swilling round my brain.

    183. Re:Bull by Anonymous Coward · · Score: 0

      Well, there are a few fundamental differences between L4D and uTorrent on the one hand and Firefox and Ventrilo on the other.

      L4D and uTorrent are both designed to deal with fairly large chunks of data that, on most systems, cannot fit in the computers memory all at once. Firefox and Vent (especially) deal with much less data that starts or ends up on your system's hard disk.

      L4D deals with lots of 3D and 2D assets. All those assets are on your hard disk. The source engine no doubt has it's own internal logic to make sure that only the more important stuff is actually in memory at any given time. In addition, some of those assets may have been paged out and just caused a page fault. Whatever caused it to hang, it shouldn't be suprising that Valve didn't decide to damn the torpedoes and keep running when the asset's they're trying to draw are not currently available.

      uTorrent deals with data of an unknown size (could be 40 kilobytes, could be 40 megabytes, could be 40 gigabytes) that will be accessed in a semi-random fashion. In addition, it tries to do so without consuming huge amounts of memory. This means a few buffers in memory that get written to disk fairly often, but otherwise it tries to keep most of that data on disk. How it failed so spectacularly, I couldn't say, but that it failed shouldn't be surprising.

      Firefox, in contrast, is usually only dealing with less then a few hundred megabytes of text and images. It also has the advantage that most of the resources it's dealing with are available from sources other then your hard drive if it comes down to it. And as you mentioned, you were using a version with extra thought put into working without disk access.

      Ventrilo should be the least surprising. It's a very light-weight application in terms of hard-disk access. It loads itself into memory when you start, and that's probably the end of it. The only on-disk resources are the executable and a few tiny GUI assets and default sounds.

      To go with a good old slashdot standby and make a car analogy, it's like your comparing how a guy on a bicycle and Mack truck handle an unexpected road hazard. You're criticizing the truck when it can't stop as fast as the bicycle.

    184. Re:Bull by Prof.Phreak · · Score: 1

      Depends.

      If my program wrote data, terminated, I did ls on the folder and saw file, and then computer lost power... and file was gone. Yes, it's a bug and is an OS issue. (ie: upon program termination, when all open files are closed, they should be flushed to disk, before program successfully "exits").

      If my program had a file open, writing, and computer lost power, I really wouldn't expect anything meaningful in the file after the crash, so... 2.5 minutes isn't an issue. (ie: database apps should be careful to do a flush [and flush should do just that] after a write to open files).

      --

      "If anything can go wrong, it will." - Murphy

    185. Re:Bull by Prof.Phreak · · Score: 1

      Eh. I always assumed that closing the file implicitly flushes disk. Is that not true?

      --

      "If anything can go wrong, it will." - Murphy

    186. Re:Bull by spitzak · · Score: 1

      It certainly IS the intention of rename in any program I have ever seen that just wrote the file!

      I agree with the GP, a rename should force the file's contents to be up to date before the rename is committed. This is not going to impact the performance of anything that does not rename a file or anything that renames a file that is just sitting flushed on the disk.

    187. Re:Bull by spitzak · · Score: 1

      Do NFS and AFS support atomic renames?

      NFS does, but it may require the host to support atomic renames.

      Atomic rename is a HUGE part of POSIX so breaking it in a commonly used system like NFS would be bad and was certainly supported in original implementations. However I would not be suprised if NFS support on Windows or some filesystems just implements it with a non-atomic rename.

      It is a complete mystery why Windows refuses to add an atomic rename, since it is REALLY useful. My suspicion as to why this and a few other quite easy things are never fixed (like newlines) is that ther is a concerted effort to make sure that code cannot be too easily ported between Windows and Unix.

    188. Re:Bull by jonadab · · Score: 1

      Personally, I still consider write cacheing to be a fundamentally bad idea. The process doing the writing should block until the data is ON THE DISK, physically (or until an error comes back saying it can't be written).

      Other applications, on the other hand, should be able to continue about their merry way as if nothing were happening. This is where most current operating systems completely fall down, IMO. If a background process is flogging the disk, the whole system becomes unresponsive. That shouldn't happen. Only the application that's doing all that heavy disk work should be slow. (Frankly, I don't care how slow that background process gets; let the thing that updates locatedb run for a month and a half, so what?) Other processes should still get their normal time slices AND should have access to the disk during that time.

      The other thing is that the window manager should have a way (a privileged way not available to other programs) to inform the kernel which process has keyboard focus at the moment, because that process should get a significantly larger portion of the available CPU time (if it needs it) than anything in the background.

      --
      Cut that out, or I will ship you to Norilsk in a box.
    189. Re:Bull by jonadab · · Score: 1

      > Why should synchronous writes be the default?

      Because it's what the user wants and expects 99.9987849782% of the time, and almost no applications bother with it every time they should.

      Of course, Unix systems generally have a sync mount option, which forces *all* writes to be done synchronously. But then that includes stuff like the cron job that updates locatedb, which SHOULD be asynchronous.

      --
      Cut that out, or I will ship you to Norilsk in a box.
    190. Re:Bull by spitzak · · Score: 1

      That has been false for a very long time, and not just on Linux.

    191. Re:Bull by spitzak · · Score: 1

      The problem is not the speed of the write, it is the order.

      If a system with EXT3 crashed inside those 5 seconds, the result is NOT as bad as this.

      In EXT3, a program doing rename(A,B) while the system crashed would result in either B having it's old contents, or the rename having completed and thus having A's contents.

      In EXT4 the crash could result in B being a partially-written version of A. Thus the old contents of B are lost, and so are the contents of A.

      The reason this is critical is that the usual method of changing one of the fields in file B is equivalent to this:

          sed -e s/Foo/Bar/ <B >A
          mv A B

      After a crash it would be nice if B was exactly the same as before, except maybe it contains "Bar" or maybe it contains "Foo". People are confusing this with some requirement that it always contain "Bar", which is impossible (you can always pick a point at which the system crashes where this won't happen). The problem with EXT4 is it can result in B containing *nothing*, not "Foo" or "Bar"!

      The correct solution, as pointed out by about 10 intelligent posters above, is for EXT4 to make rename force the blocks of A to be put on disk before the rename is committed. Note that this could happen a week later!

    192. Re:Bull by Abcd1234 · · Score: 1

      HOWEVER, unix application writers are NOT expected to do anything more, in particular, they are not expected to use fsync() on each file.

      Huh? Unix application writers who develop against the POSIX spec are *specifically* expected to call fsync() if they want to guarantee that the contents of their files are written out to disk.

      FURTHERMORE, fsync have performance issues, in particular, on ext3, fsync() will flush ALL the data for the specified file system, resulting in extraordinary bad performance.

      Sure. So I would strongly suggest only doing that for cases where it's really important to ensure the contents of the file are actually written out to disk. For example, the Firefox disk cache need not call fsync(). After all, who cares if a little data is lost, there? But OpenOffice, when saving a file, probably should if they want to provide a guarantee that a document written out to disk really is written out.

      Every coder of every unix application everywhere should start caring about the fact that ext4 is fragile ?

      No, unix application writers should stop writing shitty applications that don't handle failure cases properly. In short: is it Ext4's fault that stupid developers didn't code to spec? No, of course not. Ext4 just made it blatantly obvious that they were making invalid assumptions about the behaviour of the POSIX I/O functions. Their code could just as easily fail on a network filesystem where the ethernet cable popped out of the jack before the file contents could be written out.

    193. Re:Bull by spitzak · · Score: 1

      I respecfully disagree.

      The filesystem *IS* that "centralized database using a standardized API". The filename is the key, the file contents are the value.

      The Windows registry, INI files, and the majority of files in /etc that store more than one thing, are simply symptoms of the misdesign of file systems so that many small files are inefficient.

      The files don't have to "litter" anything. Put them in damn subdirectory.

      This bug (and the Windows registry) indicates that people just don't get it and still don't get it. Even attempts for Unix to make many files keep doing stupid things like making the file contents be XML or have comments. The "value" should be the bytes in the file!

      Yea, one of the biggest proponents killed his wife, but it does not mean the idea is wrong. It is correct.

    194. Re:Bull by kithrup · · Score: 1

      Ah, I see, a communications failure -- see, most people don't call it a "null [...] implementation," when it actually does work. Work that is quite demonstrable.

      I shall add this new, unique to trolls, definition of "null implementation" to my dictionary.

    195. Re:Bull by makomk · · Score: 1

      The trouble is that, for ext3 with the default settings, fsync doesn't just flush the one file - it flushes all pending data for that filesystem, often in a particularly seek-heavy order. Under some fairly common circumstances, this kills I/O performance until the fsync finishes, glitches any sort of heavy streaming I/O such as video playback or recording, wrecks system responsiveness, and generally causes havoc. (It also often takes over a second, during which time your application is totally unresponsive.)

      As I recall, one Firefox release used fsync to protect its database from issues due to crashes. It was promptly disabled due to all the issues it caused.

    196. Re:Bull by harlows_monkeys · · Score: 1

      Blaming it on the applications is a cop-out

      These applications can lose data even on a filesystem that uses no caching at all, but immediately writes every change to disk. That means the apps are broken. All that's different here is that modern filesystems have a bigger window in which the application bug can be exposed.

    197. Re:Bull by makomk · · Score: 1

      Sure. So I would strongly suggest only doing that for cases where it's really important to ensure the contents of the file are actually written out to disk. For example, the Firefox disk cache need not call fsync(). After all, who cares if a little data is lost, there?

      The trouble is, there are circumstances in-between. For example, suppose you have a config file containing some relatively frequently-updated data. The correct approach is to create a new file, write the contents to that, call fsync, and atomically rename it over the old file. On ext3, you could skip the fsync - the worst that would happen is that, if the system crashed, the latest update would be lost, which isn't a big deal. On ext4, there's at least a 2 minute interval after doing so where any crash will result in the file contents being totally lost. It turns out that a lot of apps do this - hence all the trashed config files.

      What we really want is a guarantee that, after a crash, the rename won't go through unless the new file contents have been written OK. Unfortunately, file systems don't actually guarantee this.

    198. Re:Bull by 7+digits · · Score: 1

      > I think you did not understand my posting at all.

      I think I did, but we can agree to disagree. The point is that your "startup-critical files" are actually not critical at all, compared to the 50 pages thesis that the user is working on. So, if you want to argue that fsync() should be used for gnome/kde configuration files, then it should also be used for cp...

    199. Re:Bull by makomk · · Score: 1

      No, it's worse than that. A common way of saving a file is to write the data to a new file, then atomically rename it over the original file. Under ext3, this is safe. Under ext4, the rename will still go through after the same 5 seconds or so, well before the data is actually written, so there's a window of up to 2 minutes where the file on disk has been replaced by an empty file. If the computer crashes within that window, you lose the file contents totally.

    200. Re:Bull by jammyd · · Score: 1

      It says I must fsync the directory, and nothing in Posix even says it's possible to open() or fsync() a directory; you have to use opendir().)

      *Bzzt*

      BigMac:~ james$ cat > open_dir.c
      #include <fcntl.h>
      int main() { return open(".", O_RDONLY); }
      BigMac:~ james$ gcc -o open_dir open_dir.c
      BigMac:~ james$ ./open_dir ; echo $?
      3
      BigMac:~ james$ uname -a
      Darwin BigMac.local 9.6.0 Darwin Kernel Version 9.6.0: Mon Nov 24 17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386 i386

      The "Single Unix Specification" you mentioned says you can open() a directory with O_SEARCH, too. MacOS doesn't have that.

      Either way, I expect you'll be able to fsync() that fd; that's all you need.

    201. Re:Bull by Profane+MuthaFucka · · Score: 1

      OK, then on the Mac, what the fuck is this for then?

      ret = fcntl(file, F_FULLFSYNC, NULL);

      Besides, you already admitted that fsync does NOT write the fucking data to the disk. Getting the data anywhere but the disk is ALLOWABLE by POSIX, but it's fucking stupid anyway.

      Whatever work that fsync() on the Mac does, actually writing data to the disk is NOT PART OF IT.

      So, you call me a troll. I call you an ass because you're just like ever other toe-picker assburger's syndrome programmer I've ever met.

      http://www.scribd.com/doc/4246806/eat-my-data

      Read that shit and learn something, infant.

      Oh, and I jerk off all over your face too. Listen to me grunt!

      --
      Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
    202. Re:Bull by Eskarel · · Score: 1

      True, but a laptop is an even less appropriate place to not commit data for 2.5 seconds because it's much more likely to have an unexpected power fault.

    203. Re:Bull by gweihir · · Score: 1

      There are different ways to deal with it. Having written a 150 page thesis recently, I can assure you that I had it in a subversion repository and had several checkouts in different places.

      In addition, most text editors rename the old file and write to a new one. Data is not lost with this, as file rename is atomic in POSIX and with good reason. At most the last edit is lost, and that is acceptable in my book.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    204. Re:Bull by 7+digits · · Score: 1

      > There are different ways to deal with it. Having written a 150 page thesis recently, I can assure you that I had it in a subversion repository and had several checkouts in different places.

      I remember the good old ext2 days, where my linux box could just die when the power was cut during disk activity, while my freebsd one was rebooting without issues. Of course, the bsd box was journalled, but linux guys were basically saying "if you care about your data, use an UPS" (that was waaay before ext3).

      So, my point is that not everyone will use subversion and backup to different devices just to guard against a low occurrence event (power failure) that is generally harmless. But ext4 just changed from "generally harmless" to "potentially lethal" by growing the window of data loss from 5 seconds to 120 seconds (of course, Theodore T'so is already working on a patch, so I guess it means that, yes, it is a bug...)

      > In addition, most text editors rename the old file and write to a new one

      Renaming the old file to write a new one kills hard links, so I know at least some editors that used to avoid that...

      > At most the last edit is lost, and that is acceptable in my book.

      Unless you save twice in 120 seconds, of course...

      Have a nice day.

    205. Re:Bull by complete+loony · · Score: 1

      Yeah, it's a tough problem.

      While moving config to a database would help, it adds a dependency on the database driver and GUI wrapper if you ever want to change something.

      I guess what I was thinking of could be done using POSIX semantics with a journal, and you could write a friendly wrapper around it.

      It would be nice to have an optional kernel interface implemented in the filesystem, falling back to POSIX semantics when it isn't supported.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    206. Re:Bull by jonadab · · Score: 1

      > > Why should synchronous writes be the default?
      > Because it's what the user wants and expects

      I suppose I should elaborate on this a bit...

      Old Macs used to allow the user to drag a diskette icon to the trash and eject it even if there were files on that diskette that were currently open. Naturally, when you then tried to do anything at all with the windows in question, including close them, the Mac would ask for the disk back, because it wanted to cleanly close the file, write any pending changes, and so on. Well, guess what? The disk is gone. Why do you think the user ejected it in the first place? They were leaving the computer lab and they wanted to take their disk with them. The disk has left the building. You may as well ask for a billion dollars as ask for the disk back.

      Fast forward to 2002. Windows XP has built-in support for creating data CDs, if you've got a CD-R drive. You just open My Computer, and you drag the files you want to put on the data CD, and you drop them on the little CD icon in the My Computer window. And then, if you're the sort of person who understands the relevance of the number 9660 to this workflow and knows that the phrase "random access" does *not* apply, you double-click the CD icon to get a window with the "write these files to disk" option in the sidebar, which you click. But normal users don't do this. Normal users drag the files onto the CD icon and drop them there, and then when the operation appears to have completed they eject the disk, take it, and leave. This is why Windows XP computers always have "files waiting to be written to the CD". Later the users are absolutely mystified that the files they put on the disk are not there. Did they bring the wrong disk? They *thought* this was the right one...

      And while we're at it, we may as well talk about USB mass storage devices, because, again, as soon as the operation appears to complete, they yank the thing out of the port immediately. Always. "Remove safely"? I've NEVER seen a normal user bother with that. Ever. (*I* do it, but I have also been known to compile Emacs from source because binaries were not yet available of the version I wanted, so obviously I don't count as a normal user.)

      Users expect the data to be on the disk when the save or copy or whatever appears to complete. This is ALWAYS what they expect. It's always what they WANT. Under no circumstances would they ever prefer for the actual write to happen later in the background for improved performance.

      --
      Cut that out, or I will ship you to Norilsk in a box.
    207. Re:Bull by fl1ckmasterflex · · Score: 0

      The point is when there is a system crash or lockup, you expect the the old version of the file, not 0 byte files.

      I suspect people beta testing a kernel would run into this often..

    208. Re:Bull by fnord_uk · · Score: 1

      The more distant the target, the more you have to lead

      Are you sure about this bit?

      If a target is 100m away and moving perpendicularly to my line of sight at 1m/s, and my bullet travels at, say, 100m/s, i'd have to lead him by 1 milli-radian, roughly.

      If the same target was 1km away, and moving in a similar fashion, my bullet would take 10 seconds to reach him, and he'd have travelled 10m in that time, subtending an angle of, wait for it ... 1 milli-radian.

      Now I suck at trigonometry, so I could be wrong, but I'd be willing to bet your life on it. Please stand next to the 100m marker (I'm not sure I'd make the shot at 1km) and get ready to run when I shout 'now'.

      --
      In theory, theory and practice are the same. In practice, they're not.
    209. Re:Bull by slamb · · Score: 1

      Yes, you can open() a directory on Mac OS X. That in no way contradicts my statement about the standard.

      The "Single Unix Specification" you mentioned says you can open() a directory with O_SEARCH, too. MacOS doesn't have that.

      Where did you get that idea? At least in the version I saw (http://www.unix.org/single_unix_specification/), the open documentation does not mention O_SEARCH or directories at all.

      Either way, I expect you'll be able to fsync() that fd; that's all you need.

      The Subversion code I linked to explicitly said that opening a directory fails on some platforms, so your expectation is incorrect.

    210. Re:Bull by gweihir · · Score: 1

      I basically agree with you. However, I think the 150 secobds are less of a bug and more bad judgement. Your "svae twice" example is actually a very good argument against a commit interval this long: Unlikely to happen within 5 Seconds, relative likely within 150 second.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    211. Re:Bull by moonbender · · Score: 1

      Err... it is? I had the impression that laptops have their own internal UPS.

      The only way to unexpectedly remove the power (apart from software problems) would be to physically remove the battery, so, uh, don't do that. Of course the battery could run down, but that's the opposite of unexpected -- I agree that you should reduce the commit delay if the battery is below a certain threshold (say, 10%).

      --
      Switch back to Slashdot's D1 system.
    212. Re:Bull by Galactic+Dominator · · Score: 1

      only for metadata, data IO is done asynchronously by default.

      --
      brandelf -t FreeBSD /brain
    213. Re:Bull by Nick+Ives · · Score: 1

      So you think it'd be better for KDE to be constantly telling your filesystem to sync? What about when you have lots of non-KDE apps open too, some of them non-KDE? Then you'll have lots of apps telling the disk to sync all according to their own schedule.

      --
      Nick
    214. Re:Bull by Nick+Ives · · Score: 1

      Ignore what I just said, I completely misunderstood this issue. Ts'o is right on this; I use XFS (best performing FS IMO) and get this behaviour. Most application writers have decent sync behaviour because of XFS users like myself complaining when something goes wrong, time for the rest to catch up!

      --
      Nick
    215. Re:Bull by billcopc · · Score: 1

      A sync mount option would be good, for those scenarios where it is absolutely critical, but it is no replacement for proper I/O programming.

      There is no OS tweak that can magically solve all the problems idiotic developers foist upon the world on a daily basis.

      --
      -Billco, Fnarg.com
    216. Re:Bull by billcopc · · Score: 1

      Users expect the data to be on the disk when the save or copy or whatever appears to complete.

      Yes, for removable media. A hard drive is not removable media, unless you're me, and I'm pretty sure you're not me otherwise who the hell am I ?

      USB/portable drives should always be synchronous, because they're so easy to unplug. They're designed for it, so the OS should accommodate that usage pattern. But on a hard drive ? Normal people don't eject an internal drive, and if you're in the 0.1% who do (gung-ho techies), then you're responsible for telling the OS to safely unmount the damned thing, because you're doing something out-of-the-ordinary.

      The Windows CD burning issue is a particularly dumb one, because it is so easily solved by locking the drive. If the user tries to eject, it's trivial to pop up a dialog asking to finalize the disc (or cancel and ditch the files). The fact that it is not done is a design failure.

      --
      -Billco, Fnarg.com
    217. Re:Bull by Harik · · Score: 1

      I'm sorry, what exactly is so horrible about holding off the metadata write until after the data is on disk? The fact that the filesystem CAN reorder the transactions and stay within spec doesn't mean it SHOULD. It's pretty fucking stupid to write 'this data is here' into the log before, you know, actually putting the data there. And that includes blocking other updates - you may in memory rename the file over the original, but on disk the dirent shouldn't be overwritten until the the replacement metadata is written, which shouldn't be written until the data is committed to disk.

      And hell, that's the exact reason why fsync in userspace DOESN'T work - you can't "just write one file", it has to flush ALL remaining metadata, in a bad seek-order. Hans and his "dancing trees" were supposed to do something about this, but reiserfs is one of the worst offenders in this regard.

      Per the spec - it's entirely possible that if I do the write/rename trick on two seperate files, then the system crashes, the one I did second would have "taken", and the first would not have. THAT is where you use fsync, and it's an accepted part of design. What's not acceptable is doing a write/rename and having the filesystem trash both copies!

    218. Re:Bull by Harik · · Score: 1

      Yup, your math is right, it pretty much cancels out.

      World class runner can move at 4.5m/s, .22 round may move at say 400m/s.

      so the formula is atan(running * time / distance)

      atan(4.5 * .25 / 100) and
      atan(4.5 * 2.5 / 1000)

      Which are identical, obviously. So your lead would be identical for both shots.

    219. Re:Bull by Harik · · Score: 1

      You're misunderstanding. Atomicity means what you SHOULD lose is the fact that you changed your desktop background, not everything.

      Either old Or New - atomicity. NOT random data, NOT gone entirely.

    220. Re:Bull by Harik · · Score: 1

      The problem IS NOT that the update was lost. The problem is that when you do a write/rename sequence, ext4 happily reorders that to rename/wait 2 minutes/write. That means ext4 is deleting the 'old' copy of data then waiting two minutes to write the new.

      And that's absurd. There's absolutely ZERO reason that rename (and the destruction of the original file) should happen on disk before the data is committed. It's just a horribly bad design decision that needs to be fixed. I don't care if it is "in the spec", filesystems have had such horrible fsync performance for so long that you can't blithely ignore the fact that userspace is doing write/rename to preserve files. ESPECIALLY when you STILL have the fsync performance issue!

      This is absolutely NOT a userland issue. If every application that updates a file has to use fsync or EXT4 will happily delete the old and new copies, well, then you might as well just mount the partition sync then. It's a ridiculous requirement.

    221. Re:Bull by DavidRawling · · Score: 1

      While the maths is correct, it supposes that humans determine angles rather than distances when estimating "how far" to lead. I submit this is not the case.

      We don't determine the angle by which we need to lead the target. We estimate the distance to the target, its speed, and the distance for the projectile to travel and ITS speed, and thusly determine the distance by which we must lead the target.

      This leads inexorably to the conclusion that we must lead the target by a greater distance - which, holy crap, we must. The target really does travel further in the greater time. The runner runs further away before getting shot. And the journal guarantees more transactions before the originals are committed.

    222. Re:Bull by Harik · · Score: 1

      KDE doesn't care that you may have to change your settings again.
      KDE cares that when given two files, one which was guaranteed to be on disk, and one that might be on disk, the kernel perfers to instead give you a zero byte file after a crash.

      The fact there's a 5 second window in EXT3 is a bug too, there should be ZERO TIME where the rename is committed to disk but the blocks are not. That's a faulty design by arrogant filesystem designers hiding behind "POSIX SAYS IT'S UNDEFINED SO I DEFINE IT THIS WAY."

      There's absolutly zero reason to ever use a journaled filesystem if it's DESIGNED to trash data like this - why? Because if the only alternative is to sync every write, why have the fucking journal at all?

      The state on disk and state in RAM need to be considered seperately, and every write needs to leave the disk in a sane state. And no, deleting a file and replacing it with a zero-byte file, _THEN_ eventually getting around to filling in that file? That's not a sane state. Forcing every application to write sync? Dude, if you thought atime updates slowed down a filesystem...

    223. Re:Bull by Harik · · Score: 1

      I challenge you to try it yourself before spewing opinions out your gaping anus.

      Take a linux box with say ext2 or ext3. Run on it a while, get used to how it works.

      Now, mount -o sync your partitions. Enjoy the blazing speed of your modern hardware!

    224. Re:Bull by mysidia · · Score: 1

      Journalling and logging are filesystem integrity preservation features, not application transactional integrity assurance feature.

      Even in Ext3, by default, the filesystem is in 'ordered' mode, and essentially, only the filesystem metadata has assured integrity.

      NTFS, XFS, and others, are the same way.

      There's nothing broken about Ext3 or Ext4.

      Your expectations aren't consistent with the basic requirements of filesystem design -- that is, acceptable performance is required.

      If you want the super-guarantee that all data is instantly on disk, you can tune your filesystem accordingly, there are advanced options in Ext3 and Ext4, that you can set, when creating the filesystem, when tuning it and setting your journal parameters with tune2fs, and when mounting it with data=journal,commit=1.

      And finally, with system sysctls, like

      sysctl vm.dirty_expire_centisecs=10
      sysctl vm.pagecache=5

      If integrity to the millisecond in case of power failure were really in much demand, there would of course be more precise options to achieve the best integrity (granted horrible filesystem performance).

    225. Re:Bull by mikiN · · Score: 1

      "Those who sacrifice reliability for responsiveness, deserve neither." 'Nuff said.

      --
      The Hacker's Guide To The Kernel: Don't panic()!
    226. Re:Bull by fnord_uk · · Score: 1

      Perhaps. But, you don't get up and move (strafe?) sideways to take the shot. You just point your weapon at a different angle.

      --
      In theory, theory and practice are the same. In practice, they're not.
    227. Re:Bull by Harik · · Score: 1

      You're still not getting it. What people are bitching about is the catistrophic data loss due to committing the metadata change before the data. If neither hits the disk, you get the old copy. If both hit the disk, you get the new copy. It's the middle that's the hairy case - if you commit the data first, then the metadata, everything is fine. If you commit the metadata first THEN the data, it means you're pointing at garbage if it crashes in a two minute perod - AND it's already deleted the previous file.

      This promotes a simple 'regression of one change on a complete crash' to a 'wipe out of all configuration data'. THAT is the problem.

      POSIX being "unspecified" isn't good enough to absolutely guarantee dataloss on a system crash. If I don't fsync my changes, they may not go through, but if they don't go through they shouldn't destroy something else! That's just bad design.

  7. Works as expected... by gweihir · · Score: 5, Insightful

    The problem here is that delaying writes speeds up things greatly but has this possible side-effect. For a shorter commit time, simply stay with ext3. You can also mount your filesystems "sync" for a dramatic performance hit, but no write delay at all.

    Anyways, with moderen filesystems data does not go to disk immediately, unless you take additional measures, like a call to fsync. This should be well known to anybody that develops software and is really not a surprise. It has been done like that on server OSes for a very long time. Also note that there is no loss of data older than the write delay period and this only happens on a system crash or power-failure.

    Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    1. Re:Works as expected... by Anonymous Coward · · Score: 0

      Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.

      Sounds like the average Ubuntu user to me ...

    2. Re:Works as expected... by NorthWay · · Score: 1

      >Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.

      Or go for some other tech that can detect it. Like end-to-end checksumming.

    3. Re:Works as expected... by girlintraining · · Score: 5, Insightful

      Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.

      You're right, there really is nothing to see here. Or rather, there's nothing left. As the article says, a large number of configuration files are opened and written to as KDE starts up. If KDE crashes and takes the OS with it (as it apparently does), those configuration files may be truncated or deleted entirely -- the commands to re-create and write them having never been sync'd to disk. As the startup of KDE takes longer than the write delay, it's entirely possible for this to seriously screw with the user.

      The two problems are:

      1. Bad application development. Don't delete and then re-create the same file. Use atomic operations that ensure that files you are reading/writing to/from will always be consistent. This can't be done by the Operating System, whatever the four color glossy told you.

      2. Bad Operating System development. If an application kills the kernel, it's usually the kernel's fault (drivers and other code operating in priviledged space is obviously not the kernel's fault) -- and this appears to be a crash initiated from code running in user space. Bad kernel, no cookie for you.

      --
      #fuckbeta #iamslashdot #dicemustdie
    4. Re:Works as expected... by Anonymous Coward · · Score: 0

      Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.

      Like expectations that a filesystem not lose data just because you have small files?

    5. Re:Works as expected... by gweihir · · Score: 3, Insightful

      I agree on both counts. Some comments

      1) The right sequence of events is this: Rename old file to backup name (atomic). Write new file, sync new file and then delete the backup file. It is however better for anything critical to keep the backup. In any case an application should offer to recover from the backup if the main file is missing or broken. To this end, add a clear end-mark that allows to check whether the file was written completely. Nothing new or exciting, just stuff any good software developer knows.

      2) Yes, a kernel should not crash. Occasionally it happens nonetheless. It is important to notice that ext4 is blameless in the whole mess (unless it causes the crash).

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    6. Re:Works as expected... by hedwards · · Score: 1

      I had to reread that a few times, but it seems like a compelling argument for COW filesystems with a backend scrubber to fix any suboptimal file placements. I'm sure it isn't quite as fast, but unless the disk is constantly accessed it's probably not going to make much of a negative impact.

      And depending upon the type of files most handled one could probably optimize where the default placements are made.

    7. Re:Works as expected... by somenickname · · Score: 1

      2. Bad Operating System development. If an application kills the kernel, it's usually the kernel's fault (drivers and other code operating in priviledged space is obviously not the kernel's fault) -- and this appears to be a crash initiated from code running in user space. Bad kernel, no cookie for you.

      I've been an Ubuntu user for years and I'm always amazed at the state of the kernel during the alpha and beta releases (and at times even the final release). I generally run the latest vanilla kernel from kernel.org. It's always just fine. I honestly don't know how they fuck it up. I've custom compiled kernels for years and never had a kernel panic except when I'm running an Ubuntu kernel.

    8. Re:Works as expected... by fishbowl · · Score: 1

      >add a clear end-mark that allows to check whether the file was written completely.

      Why do you assume the file was persisted in top-down fashion? Better make that "end-mark" a checksum.

      --
      -fb Everything not expressly forbidden is now mandatory.
    9. Re:Works as expected... by gweihir · · Score: 1

      Depends on the write-pattern, Unix is very good at keeping sequential writes sequential (as I understand the KDE writes are). If you do seeks, then you should indeed use a checksum.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    10. Re:Works as expected... by Tokerat · · Score: 1

      1) The right sequence of events is this: Rename old file to backup name (atomic). Write new file, sync new file and then delete the backup file.

      Really? Much better: Write new file into a temp file, sync, whatever you need to do. When you're done, delete original and rename the temp to the original's name. This way, if you die with half a file, you've got a corrupted temp file and not a corrupted original.

      --
      CAn'T CompreHend SARcaSm?
    11. Re:Works as expected... by gweihir · · Score: 1

      I've been an Ubuntu user for years and I'm always amazed at the state of the kernel during the alpha and beta releases (and at times even the final release). I generally run the latest vanilla kernel from kernel.org. It's always just fine. I honestly don't know how they fuck it up. I've custom compiled kernels for years and never had a kernel panic except when I'm running an Ubuntu kernel.

      Interesting. Following the discussion, I have wondered how they managed to crash the kernel in the first place. I have almost no experience with Ubuntu kernele, but I have been using my own kernel.org compiles for almost a decade now, mostly with debian. I have had absolutely no crashes with the mainline kernels in quite some time. The only issues I had was extreme delays with faulty SATA hardware, but still no crash even then.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    12. Re:Works as expected... by kasperd · · Score: 4, Insightful

      Write new file into a temp file, sync, whatever you need to do. When you're done, delete original and rename the temp to the original's name.

      That's an improvement, but it can be made even safer by skipping the delete step. Once the new file is created just rename it on top of the original. The rename system call guarantees that at any point in time the name will refer to either the old or the new file. I'm not sure you really need the sync step. I haven't read the spec in that kind of detail, but if that sync step is really necessary I'd call that a design flaw. The file system may delay the write of the file as well as the rename, but it shouldn't perform the rename until the file has actually been written.

      --

      Do you care about the security of your wireless mouse?
    13. Re:Works as expected... by gweihir · · Score: 1

      Really? Much better: Write new file into a temp file, sync, whatever you need to do. When you're done, delete original and rename the temp to the original's name. This way, if you die with half a file, you've got a corrupted temp file and not a corrupted original.

      Well, the difference is one very fast rename operation, that incidentially will likely be journalled. Whether having the new one or the old one is better depends on the application, especially if it is a lot of files affected. With your way you then get some old and some new files on crash in the middle, unless you can sync all these writes (difficult if different applications). With may way, you get a completely valid set of old files. That is not to say my way is better in general, both ways have merit, depending on the situation. Anyways, it was just an example.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    14. Re:Works as expected... by shutdown+-p+now · · Score: 1

      Use atomic operations that ensure that files you are reading/writing to/from will always be consistent. This can't be done by the Operating System

      Why? I mean, certainly, an OS cannot determine the correct boundaries for atomic operations. But there's no reason why it cannot have explicit transaction control API, where the programmer has a function to mark the beginning of the transaction and the end of it, and all disk operations in between are considered one single atomic operation that's either committed or rolled back. In fact, this isn't hard to do on top of a journalling FS, and Microsoft already did it with NTFS with Vista. TFA describes precisely the kind of problem this feature is designed to solve.

    15. Re:Works as expected... by moonbender · · Score: 2, Interesting

      The rename system call guarantees that at any point in time the name will refer to either the old or the new file. I'm not sure you really need the sync step. I haven't read the spec in that kind of detail, but if that sync step is really necessary I'd call that a design flaw. The file system may delay the write of the file as well as the rename, but it shouldn't perform the rename until the file has actually been written.

      As I understand it, that is EXACTLY what happens. The move/relinking is commited, but the data isn't. If true, a real case of WTF. The relinking should only be executed AFTER the data has been commited to the drive.

      --
      Switch back to Slashdot's D1 system.
    16. Re:Works as expected... by Anonymous Coward · · Score: 0

      Right, people are making BAD assumptions about atomicity. Even with sync write, there is a crash window where the open O_TRUNC blanks the file and the write is still pending in the application.

      There is simply no excuse for defending a fragile application code because it fails under a POSIX filesystem.

    17. Re:Works as expected... by Anonymous Coward · · Score: 0

      "...but it shouldn't perform the rename until the file has actually been written."

      That's not actually required unless you sync the new file first. The rename might be atomic (always points to new or old file) but the new file isn't required to be on disc, it can still be in cache. And that's what people are seeing here, the mechanism is basically the same if you truncate the existing file. Yes, it might be nice functionally if we could cache the rename along with the write and make it all atomic, but I'm guessing there is a reason that can't be done.

      If an app wants to use tons of small files constantly it needs to sync them if they're important, or keep backups of each file and recover if a crash occurs (ideally it would do this gracefully).

      I think this is a confluence of issues that will continue to become more apparent as we move forward with drive technology. We are realizing that HDD's are pathetically slow, and SSDs are not magic, they require various optimizations and you just *can't* commit at will without serious penalties.

    18. Re:Works as expected... by BZ · · Score: 1

      > The rename system call guarantees that at any point in time the name will refer to either
      > the old or the new file.

      If you read the article, this is precisely false with ext4. Or rather, the rename can be committed to disk before the data writes have been; if you crash in between, you lose.

    19. Re:Works as expected... by Anonymous Coward · · Score: 0

      Well, yeah, skipping the unlink() is certainly a must. However, by POSIX semantics, that still doesn't suffice. The guarantee of a rename() atomically renaming a file is just true with regard to a running system, not with regard to a crashing system - there could very well be two links to the same file afterwards, for example.

      Also, no, having to sync manually is not a design flaw. In general, the filesystem (and underlying layers) is allowed to reorder pretty much any writes with regard to one another, because that is beneficial for performance. If you need a certain ordering for your application, that is your responsibility.

      So, basically, the safe way to do it is this:

      - check for file.commit, if it exists:
          - fdatasync() it
          - unlink() file
          - hardlink file to file.commit
          - open . r/o
          - fdatasync() .
      - unlink file.new (ignoring ENOENT)
      - open file.new for writing
      - write new contents to file.new
      - fdatasync() file.new
      - link file.new to file.commit
      - open . r/o
      - fdatasync() .
      - unlink() file
      - hardlink file to file.commit
      - fdatasync() .
      - unlink file.commit

      All that under the assumption that you can guarantee that you won't have any concurrent accesses - otherwise, you'll need some locking in addition to that (and/or potentially change some of the steps). And yeah, if you can rely on having an ext3 or ext4 with journaling beneath your feet, you can skip some of those steps, as those FSes do provide you with some more guarantees.

    20. Re:Works as expected... by rastos1 · · Score: 1

      For a shorter commit time, simply stay with ext3. You can also mount your filesystems "sync" for a dramatic performance hit, but no write delay at all.

      Why there can't be a mount option for ext4 specifying the commit delay in seconds?

      As the article says, a large number of configuration files are opened and written to as KDE starts up.

      Why KDE needs to write configuration files on startup? I thought that it reads the configuration on startup.

    21. Re:Works as expected... by kasperd · · Score: 1

      If you read the article, this is precisely false with ext4.

      Well. The summary had multiple links, and the pages it pointed to had multiple links to other locations. And I haven't found out which one is supposed to explain exactly what happened.

      I went to read https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45 again, and though the wording there is quite confusing, by reading it again now a bit earlier in the day, I managed to figure out that it describes "workarounds" for two situations.

      1. A file is overwritten by being truncated to length zero and then new data is written.
      2. A new file is created filled with data and renamed over an existing file

      The first of those two is clearly a buggy application, and I'd rather the applications were fixed than the kernel have workarounds to minimize the risk of data loss. That case is one where data loss is to be expected, it could happen with any file system, the window for loss just happen to be larger on ext4.

      The second case is one where a newly created file is renamed on top of an existing file. A pretty normal procedure. And I wrote "workaround" in quotes because the kernel isn't really working around something inherently wrong in the application. I'd rather say one part of ext4 is working around a problem in another part of it.

      I have yet to find an authoritative source telling me which of the two "workarounds" apply to the KDE data loss. I did however find indications that a lot of other applications were affected as well. I can't imagine all of them have the bug of deleting the old version before creating the new one.

      Or rather, the rename can be committed to disk before the data writes have been; if you crash in between, you lose.

      I can understand that if the system crash and you didn't sync, then what you find on disk after rebooting may not be entirely up to date. If it matches something that logically was on disk two minutes before the crash, that would still be valid. I still think the system should try to minimize this window. In particular if the disk is idle, there is no reason why the difference should be more than one second.

      However it sounds like what you get after reboot is something that didn't exist at any point in time. The name refers to a file of zero length after the reboot, and at no point in time did that name refer to an empty file, so it is not just an old version of some data, it is a corrupt version. Saying that it is conforming to the standard is a lame excuse, I'm pretty sure the standard does not say the system must cause data loss in this case.

      It is possible to make a fix to the file system that handles this case without data loss and without performance hit. If the disk is idle and you don't want to allocate sectors for the data right away to get a better layout, then just commit the data to the journal. I don't know if the journal format already permits committing data for which no location has been allocated yet, but it certainly could be modified to support it. I doesn't cost you any performance to write the data twice in this case because the disk was already idle, and it doesn't give you a worse layout that will give you performance problems later, because the allocation of the final location is still delayed. If the system crashed before the delayed allocation had happened, it would then happen during journal replay. If the system is busy, it is acceptable for data to take a while to make it to disk, as long as data on disk is consistent. That means don't commit to the metadata changes until after the data which had to go before it has been committed to disk. Of course in that case you still have to take care to avoid starvation.

      Saying that it should be fixed in the application is BS. Creating a new file and renaming that should be sufficie

      --

      Do you care about the security of your wireless mouse?
    22. Re:Works as expected... by Anonymous Coward · · Score: 0

      Yes, the sync step is required... you need to sync the metadata too, and that is also delayed unless you fsync some file in that directory, the directory itself, or call full-blown sync (stupid!).

      What the fs will guarantee is that you won't lose the directory, but you might find yourself with an older version of it after the crash, unless you did some sort of fsync operation that caused the metadata to be flushed.

      This is how it has been done for decades, it is how just about every journaling filesystem behaves when it is not in sync mode.

      What CAN be surprising is when file metadata sync is decoupled from the use of O_SYNC in the file itself (thus requiring a fsync on the directory to really be sure the data will be there after a crash). THAT is allowed by the spec too, AFAIK, but it is really not nice.

    23. Re:Works as expected... by BZ · · Score: 1

      Yep. Sounds like we're on the same page here; fully agreed with your characterization of the two numbered situations.

    24. Re:Works as expected... by Simetrical · · Score: 1

      That's an improvement, but it can be made even safer by skipping the delete step. Once the new file is created just rename it on top of the original. The rename system call guarantees that at any point in time the name will refer to either the old or the new file. I'm not sure you really need the sync step. I haven't read the spec in that kind of detail, but if that sync step is really necessary I'd call that a design flaw. The file system may delay the write of the file as well as the rename, but it shouldn't perform the rename until the file has actually been written.

      That is exactly the problem. People who are saying that the issue is just longer times before sync have not RTFA. The problem is that applications that are just trying to update files can completely delete them. It is not just recent changes that are lost. Quoting Ted Ts'o:

      OK, so let me explain what's going on a bit more explicitly. There are application programmers who are rewriting application files like this:

      1.a) open and read file ~/.kde/foo/bar/baz
      1.b) fd = open("~/.kde/foo/bar/baz", O_WRONLY|O_TRUNC|O_CREAT) --- this truncates the file
      1.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
      1.d) close(fd)

      Slightly more sophisticated application writers will do this:

      2.a) open and read file ~/.kde/foo/bar/baz
      2.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
      2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
      2.d) close(fd)
      2.e) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

      ...

      The fact that series (1) and (2) works at all is an accident....

      Is that clear? The file system is not "truncating" files. The application is truncating the files, or is constantly overwriting the files using the rename system call....

      In other words, the application (in scenario 2) is copying the file, changing the copy, and renaming only after it writes the changes. But POSIX doesn't guarantee that the filesystem calls will be written to disk in order in the event of a crash. ext4 does not write them to disk in order: it writes the rename first and only then the new file's contents. So if the system crashes in between, the file will be truncated.

      This is much like ext3's behavior in the data=writeback mode; the default mode for ext3 is data=ordered, which ensures that operations like these are properly ordered (at some performance cost). You can emulate this with ext4 by using nodelalloc, but you'll lose most of the performance advantage.

      All of this is worked around in some patches that are queued for 2.6.30:

      http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=3bf3342f394d72ed2ec7e77b5b39e1b50fad8284
      http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=6645f8c3bc3cdaa7de4aaa3d34d40c2e8e5f09ae
      http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=dbc85aa9f11d8c13c15527d43a3def8d7beffdc8

      These basically add a forced fsync() at some point in the process in some common cases.

      As for this not being a filesystem bug . . . in the event of a crash, POSIX makes no guarantees at all unless you've fsync()ed. A filesystem that deletes things at random on crash would be POSIX-compliant. That doesn't mean that it's not buggy.

      I strongly encourage everyone to read the comments in

      --
      MediaWiki developer, Total War Center sysadmin
    25. Re:Works as expected... by spitzak · · Score: 0

      You are thinking in a Windows-centric way.

      The correct way to do it is:

        Write data to a NEW file.
        Close file
        Rename to the old filename (an atomic operation)

      After a crash the program should see either the old or new file. There is no need to worry about finding a backup file.

      EXT4 breaks the rename so that both files are lost.

      Yea, sticking a sync in there "fixes" the symptom but only by slowing everything down. Please try to understand that the old file (ie the "backup" in your solution) is OK!!!

    26. Re:Works as expected... by spitzak · · Score: 1

      You added a lot of unnecessary steps that are needed on Windows but not POSIX. Here are the only steps needed:

      - open file.new for writing
      - write new contents to file.new
      - fdatasync() file.new
      - link file.new to file
      - unlink file.new

      Big nasty thing is that if you want to be perfectly safe, "file.new" has to somehow be a unique name, just in case another program tries to update the file at the same time. It would be nice if the fs or operating system supported this directly.

    27. Re:Works as expected... by spitzak · · Score: 1

      Atomic rename can be used for this for most uses, including the ones causing the problems.

      NTFS style transactions are useful if there are a either many files that have to be insync, or where several programs are writing portions of the same file. However both these cases are quite rare compared to the "save the new version" api that atomic rename allows you to do.

    28. Re:Works as expected... by shutdown+-p+now · · Score: 1

      NTFS style transactions are useful if there are a either many files that have to be insync, or where several programs are writing portions of the same file.

      As far as I know, it doesn't allow for the latter scenario - when you open a file using CreateFileTransacted, it's locked for writing for all other transactions.

    29. Re:Works as expected... by Anonymous Coward · · Score: 0

      The sync step is needed with XFS, because, to my knowledge, XFS is the only modern journaled filesystem that does not write in ordered fashion. Ordered means that meta data is only written after the data is wirtten to disk, That's why renaming the file guarantees that the file is only renamed after the data is flushed to disk. If using XFS, however, or even ext3 with some option or other that turns off ordered writes, and there is no guarantee without fsync. That's why applications should use Fsync rather than rely on file system behaviour.

      This problem with KDE overwriting configuration files and being left with nothing in case of crash dates back several years, nothing new to see here, just finally also affected ext? users more often.

    30. Re:Works as expected... by spitzak · · Score: 1

      That's acceptable I think. Single file update by many programs is not really used anymore except by databases, and they can implement their own locking.

      The ability to lock a whole lot of operations file operations into a single block is a great idea and it would be nice to see it elsewhere.

      On windows if it can be used to group a delete and rename together then we can finally have atomic rename, which would remove one of the big obstacles to making software port between Windows and Unix. Do you know if this works? I can imagine problems like unacceptable overhead, or that errors break the atomicity (and thus the need to detect if the file exists, which would make it non-atomic).

    31. Re:Works as expected... by shutdown+-p+now · · Score: 1

      On windows if it can be used to group a delete and rename together then we can finally have atomic rename, which would remove one of the big obstacles to making software port between Windows and Unix. Do you know if this works? I can imagine problems like unacceptable overhead, or that errors break the atomicity (and thus the need to detect if the file exists, which would make it non-atomic).

      Yes, it does. There are MoveFileTransacted and DeleteFileTransacted APIs for that purpose (effectively all Win32 APIs that deal with files have gotten transacted counterparts). I don't know about overhead, but I'm pretty sure that errors do not leak - you handle them the way you normally would, and the rest of the system just sees the snapshot of how things were when you started the transaction.

    32. Re:Works as expected... by spitzak · · Score: 1

      I'm a bit confused about the api. There must also be some call to indicate the break between the transactions, right? What are they?

      What I want to make sure is if I *ignore* the error from the delete, that the rename is still done. I don't want it to say "there was an error in this transacted set and therefore I should roll it all back", I want it to complete.

      I'm guessing you can make non-transacted calls in the middle of transacted ones, is this useful? It seems like a reasonable api would be to add start/stop calls but then reuse the existing calls, but it would prevent this mixing.

    33. Re:Works as expected... by shutdown+-p+now · · Score: 1

      I'm a bit confused about the api. There must also be some call to indicate the break between the transactions, right? What are they?

      CreateTransaction creates a new transaction and returns a handle to that, which all the new ...Transacted functions then take as an extra argument. CommitTransaction and RollbackTransaction also take the handle, and do what their name says.

      What I want to make sure is if I *ignore* the error from the delete, that the rename is still done. I don't want it to say "there was an error in this transacted set and therefore I should roll it all back", I want it to complete.

      There's no implicit rollback on error there (short of when the process kills itself and the OS has to clean up after it, of course). If DeleteFileTransacted fails, you just get an failure indicator return value from it, and error code from GetLastError, as usual - but transaction handle remains open and active.

      I'm guessing you can make non-transacted calls in the middle of transacted ones, is this useful?

      You definitely can, just as you can have several transactions running at once in a single process or thread, since they are all identified explicitly by handles. I guess it can be useful if you want to see the original state of the FS as it was before your changes, though I can't think of any specific use case for that.

    34. Re:Works as expected... by spitzak · · Score: 1

      That all makes sense. I should have looked at the msdn pages and seen that there was a transaction argument to the calls.

      You are right that a crash had better act like rollback for all the unfinished transactions! I would be reasonably confident they got that right.

    35. Re:Works as expected... by k8to · · Score: 1

      You are promoting the pattern which produces the problem.

      Your steps

      1 - open filename.temp for writing
      2 - write into filename.temp
      3 - rename filename.temp to filename

      The problem with this is that the operating sytsem has never been asked to push the data into the actual filename.temp, just to rename it over filename. This can result in a zero byte filename after a crash, as reported in the article.

      Correct method:

      1 - open filename.temp for writing
      2 - write into filename.temp
      3 - fsync filename.temp and check return value for errors
      4 - close filename.temp
      5 - rename filename.temp to filename

      "BUT BUT BUT," everyone says "That's crazy, overriding a filename with an unwritten file who would want to do that?"

      The fact is, sometimes that is what you would want. Not all renames are in the case where you are attempting to do an atomic update of file state. The posix system calls don't express this intent, so you have to do it yourself with use of fsync.

      As others point out, there's a price here. fsync() will spin up the disk, when it isn't necesssary. Sometimes all you need is that you have the old data or the new data, and you don't care which, so fsync is heavier than you want. True, but your choices are heavier than you want, or risk no data at all.

      Basically, many people have been assuming POSIX guarantees something it doesn't. Maybe we can get a new system call out of the hooplah that allows people to request this behavior when it is neeeded. In the meantime, fix the damn applications.

      --
      -josh
  8. Firefox fix by Anonymous Coward · · Score: 0

    No problem, just run Firefox and it'll make sure your disks are synch'd all the time ;)

  9. my experience by Anonymous Coward · · Score: 0

    not with ext4, but with xfs. lat month, I created an xfs logical volume and exported it with nfs (with fsync). I chose xfs because this was for large files (videos). After copying a couple files, the xfs volume developed errors and was unrecoverable. I've never seen a file system so fucked up so easily without hardware problems.

    1. Re:my experience by metamatic · · Score: 1

      I've never seen a file system so fucked up so easily without hardware problems.

      Yet another area in which the Amiga was first, but people nowadays don't know it.

      --
      GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
  10. Programmers at Microsoft suck by Xoc-S · · Score: 1, Funny

    If Microsoft hadn't written this crappy code, and they'd used Linux instead, this wouldn't have happened.

    1. Re:Programmers at Microsoft suck by spitzak · · Score: 1

      Considering that Microsoft has so far failed to make an atomic rename() call then I don't thing they should feel very proud of their results. This bug is that EXT4 breaks a feature that Microsoft has failed to implement in the first place!

  11. Classic tradeoff by Otterley · · Score: 5, Insightful

    It's amazing how fast a filesystem can be if it makes no guarantees that your data will actually be on disk when the application writes it.

    Anyone who assumes modern filesystems are synchronous by default is deluded. If you need to guarantee your data is actually on disk, open the file with O_SYNC semantics. Otherwise, you take your chances.

    Moreover, there's no assertion that the filesystem was corrupt as a result of the crash. That would be a far more serious concern.

    1. Re:Classic tradeoff by imsabbel · · Score: 3, Informative

      Its even WORSE than just being asynchronous:

      EXT4 reproducably delays write ops, but commits journal updates concerning this write.

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    2. Re:Classic tradeoff by slashdotmsiriv · · Score: 2, Interesting

      Even if you use O_SYNC, or fsync() there is no guarantee that the data are safely stored on disk.

      You also have to disable HDD caching, e.g., using
        hdparm -W0 /dev/hda1

    3. Re:Classic tradeoff by gweihir · · Score: 2, Insightful


      Even if you use O_SYNC, or fsync() there is no guarantee that the data are safely stored on disk.

      You also have to disable HDD caching, e.g., using
          hdparm -W0 /dev/hda1

      Well, yes, but unless you have an extreme write pattern, the disk will not take long to flush to platter. And this will only result in data loss on power failure. If that is really a concern, get an UPS.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    4. Re:Classic tradeoff by legirons · · Score: 2, Funny

      It's amazing how fast a filesystem can be if it makes no guarantees that your data will actually be on disk when the application writes it.

      Backups redirected to /dev/null, run much faster... ;)

    5. Re:Classic tradeoff by Anonymous Coward · · Score: 0

      Thats the whole problem - it has become trendy in filesystem design to totally de-couple the meta data and file content data. Ted said this is unsafe:

      open(O_CREATE)
      write()
      close()
      rename(A,B)

      As it can result in a 0 length B on crash because the FS journaled all the meta data operations but tossed out the data.

      It's silly and hiding behind 'POSIX allows it' is not helping apps designs.

      A useful design in the case above would be to defer journaling the rename until the data is committed, but for some reason FS designers seem to view that as not worthwhile. (Too hard I expect..)

      Fsync isn't what the apps want here, what they want is transactional consistency where operations like rename serve as ordering barriers to synchronize meta-data and data.

      Ted's later comment that new patches are available that make rename() flush the data when it overwrites an existing file achieve this ordering in some cases, but at a performance cost of course.

      This applies to the whole storage stack, the useful behavior is that if the rename is observed after a crash then _all the data writes_ must also be preserved.

  12. Re:Exactly by TerranFury · · Score: 5, Insightful

    Meh, this is crap that happens only when the system crashes, and is pretty much unavoidable if you're doing a lot of caching in memory -- which, coincidentally, is what you need to do to maximize performance. This doesn't sound like the filesystem's "fault" or the application's "fault;" it's just the way things are. Everybody knows that if you don't cleanly unmount, most bets are off.

  13. Theory doesn't matter; practice does by microbee · · Score: 3, Interesting

    So, POSIX never guarantees your data is safe unless you do fsync(). So, ext3 was not 100% safer either. So, it's the applications' fault that they truncate files before writing.

    But it doesn't matter what POSIX says. It doesn't matter where the fault belongs to. To the users, a system either works nor not, as a whole.

    EXT4 aims to replace EXT3 and becomes the next gen de-facto filesystem on Linux desktop. So it has to compete with EXT3 in all regards; not just performance, but data integrity and reliability as well. If in the common scenarios people lose data on EXT4 but not EXT3, the blame is on EXT4. Period.

    It's the same thing that a kernel does. You have to COPE with crappy hardware and user applications, because that's your job.

    1. Re:Theory doesn't matter; practice does by MichaelSmith · · Score: 1

      So, POSIX never guarantees your data is safe unless you do fsync().

      I always had the impression that closing a file descriptor does an fsync(). Surely if KDE is writing multiple small files it will be closing each file in turn?

    2. Re:Theory doesn't matter; practice does by caerwyn · · Score: 5, Insightful

      This is the attitude that has the web stuck with IE.

      There's a standard out there called POSIX. It's just like an HTML or CSS standard. If everyone pays attention to it, everything works better. If you fail to pay attention to it for your bit (writing files or writing web pages), it's not *my* fault if my conforming implementation (implementing the writing or the rendering) doesn't magically fix your bugs.

      --
      The ringing of the division bell has begun... -PF
    3. Re:Theory doesn't matter; practice does by gweihir · · Score: 1

      I always had the impression that closing a file descriptor does an fsync(). Surely if KDE is writing multiple small files it will be closing each file in turn?

      No, a close is not enough. close() just flushes all application buffers to the filesystem and frees the filehandle. If you want the data on disk, you have to sync.

      From "man close":
                    A successful close does not guarantee that the data has been success-
                    fully saved to disk, as the kernel defers writes.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    4. Re:Theory doesn't matter; practice does by somenickname · · Score: 2, Insightful

      "The machine crashed" isn't a common situation. In fact, it's a very, very rare situation.

    5. Re:Theory doesn't matter; practice does by Anonymous Coward · · Score: 0

      fsync() might guarantee your file is written to disk, but it won't guarantee the directory entry has been updated so you can find it again later.

    6. Re:Theory doesn't matter; practice does by Anonymous Coward · · Score: 0

      You can claim that your implementation is POSIX conform, but that doesn't stop anyone from claiming that your implementation sucks.

      Sure, I understand the whole issue. Why delayed writes are necessary. Why data loss can't be avoided. But if ext3 behaves better in these cases than ext4 than that is a valid reason to complain.

      With this issue we have to competing features: reliability vs. performance.
      And here we have a case where performance is default and reliability (sync) is optional. Imho something should be reliably by default and performance should be a feature you have to consider carefully.

    7. Re:Theory doesn't matter; practice does by microbee · · Score: 3, Insightful

      Apparently, you don't know real life.

      Does POSIX tell you what happens if your OS crashes? That's right, it says "undefined". Oops, sorry, it's too hard a problem and we'll just leave it to you OS implementers.

      Asking everyone to use fsync() to ensure their data not being lost is insane. Nobody want to pay that kind of performance penalty unless the data is very critical.

      Normal applications have a reasonable expectation that the OS doesn't crash, or doesn't crash too often for this to be a big problem. However, shit happens, and people scream loud if their data is lost BEYOND reasonable expectations.

      Forget POSIX. It's irrelevent in the real world. It's exactly this pragmatic attitude that brought Linux to its current state.

    8. Re:Theory doesn't matter; practice does by microbee · · Score: 1

      Yet the whole point of journaling filesystem is to protect against data loss.

      Seriously, folks, the number one priority of a filesystem is to protect data, even at the event of unexpected crash. What's so hard to understand? CRASH IS NOT A RARE SCENARIO FOR FILESYSTEM DESIGNERS.

    9. Re:Theory doesn't matter; practice does by Anonymous Coward · · Score: 0

      That's ridiculous, are you actually saying the file system should automatically sync after every operation just because the KDE and other dev teams don't want to adhere to a standard that's been around for ages?? Do you realize how much slower that is, and how much more stress that puts on the system? If your application needs to guarantee data is instantly written, sync it.

      This is like complaining that UDP doesn't guarantee packets reach the other end...

      If you are using applications written in a crappy manner and can't adhere to the spec, and you're REALLY worried about it use a file system that's synchronous.

      This is a bug in applications that assume instant writing to disk, hell, even ext3 didn't instantly write to disk. I won't comment on other things to do with ext4, but in this case there's no issue, this function is working as intended.

    10. Re:Theory doesn't matter; practice does by somenickname · · Score: 1

      Yes but, the OP is about Ubuntu 9.04. It's an alpha version. It will crash. It will lose your data. It will violate your wife. That is the nature of an alpha release. I understand that we are fundamentally talking about ext4 but, the badness of ext4 is only being brought to light by using an alpha level OS.

    11. Re:Theory doesn't matter; practice does by Watson+Ladd · · Score: 1

      Dijkstra does not approve.

      --
      Inventions have long since reached their limit, and I see no hope for further development.-- Frontinus, 1st cent. AD
    12. Re:Theory doesn't matter; practice does by Anonymous Coward · · Score: 0

      Are are aware that this attitude is exactly why windows is the way it is? Popular app X doesn't read/ignores the API documentation so MS makes a hacky workaround in their next version to keep all the customers happy, at the expense of an overall fast, stable and reliable system.

      The bug is clearly in app developers court, the behaviour is as it has been specified for many years. If those files were important to kde it would have synced them.

    13. Re:Theory doesn't matter; practice does by gweihir · · Score: 1

      Yet the whole point of journaling filesystem is to protect against data loss.

      Wrong. That is for databases, but not for filesystems. The journal on a filesystem serves to protect against filesystem corruption on crash. That is protection against loss of data written earlier. Protecting against loss of data still buffered is not the task of a journalling filesystem, no matter how much you may wish it.

      If you need data to be on disk, then you need to fsync.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    14. Re:Theory doesn't matter; practice does by gweihir · · Score: 1

      Asking everyone to use fsync() to ensure their data not being lost is insane.

      There really is no way around that. That is why any sensible design will minimize critical data, keep backups and minimize writes to critical data.

      You may wish it were different, but ignoring reality will get you hurt.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    15. Re:Theory doesn't matter; practice does by caerwyn · · Score: 3, Insightful

      Apparently, you don't know how to *deal* with real life.

      POSIX *does* tell you what happens if your OS crashes. It says "as an application developer, you cannot rely on things in this instance." It also provides mechanisms for successfully dealing with this scenario.

      As for fsync() being a performance issue, you can't have your cake and edit it too. If you don't want to pay a performance penalty, you can lose data. Ext4 simply only imparts that penalty to those applications that say they need it, and thereby gives a performance boost to others who are, due to their code, effectively saying "I don't particularly care about this data" - or more specifically, "I can accept a loss risk with this data."

      Normal applications have a reasonable expectation that the OS doesn't crash, yes. And usually it doesn't. Out of all the installs out there... how often is this happening? Not very. They've made a performance-reliability tradeoff, and as with any risk... sometimes it's the bad outcome that occurs. If they don't want that to happen, they need to take steps to reduce that risk- and the correct way to do that has always been available in the API.

      As for forgetting POSIX... it's the basis of all unix cross-platform code. It's what allows code to run on linux, BSD, Solaris, MacOS X, embedded platforms, etc, without (mostly) caring which one they're on. It's *highly* relevant to the real world because it's the API that most programs not written for windows are written to. Pull up a man page for system calls and you'll see the POSIX standard referenced- that's where they all came from.

      Saying "Forget POSIX. It's irrelevant in the real world." is like people saying a few years ago "Forget CSS standards. It's irrelevant in the real world." And you know what? That's the attitude that's dying out in the web as everything moves toward standards compliance. So it is in this case with the filesystem.

      --
      The ringing of the division bell has begun... -PF
    16. Re:Theory doesn't matter; practice does by MichaelSmith · · Score: 1

      Once the application calls close() the data is out of its hands. The application shouldn't be required to take another step to ensure that the storage systems have correctly handled the data. Doing that would make implementations too hardware specific.

    17. Re:Theory doesn't matter; practice does by Anonymous Coward · · Score: 0

      This guy is a Micro$oft shill - M$ would love for everyone to "forget POSIX". If they can get everyone to write software that is not standards compliant, it will be no better than the garbage M$ writes, and then their advertising budget will be the only deciding factor in selecting a system to use. Ignore the troll...

    18. Re:Theory doesn't matter; practice does by ion.simon.c · · Score: 1

      As for forgetting POSIX... it's the basis of all unix cross-platform code. It's what allows code to run on linux, BSD, Solaris, MacOS X, embedded platforms, etc...

      _Awww! My Hero!_ :D

      ...it's the API that most programs not written for windows are written to.

      Whenever I write Windows code for work, I try to adhere as closely to POSIX as the OS and project specification allow. Srsly, doing this saves *sooo* much trouble in the long run... even if you never leave the Windows world!

    19. Re:Theory doesn't matter; practice does by mkcmkc · · Score: 1

      Once the application calls close() the data is out of its hands. The application shouldn't be required to take another step to ensure that the storage systems have correctly handled the data.

      I think calling close() should give me a piece of bacon, but unfortunately POSIX doesn't specify that either.

      As for "correct handling", though, it's far from clear that the correct thing to do on close() is to immediately flush the data to disk. There are a lot of times when this would be inappropriate.

      --
      "Not an actor, but he plays one on TV."
    20. Re:Theory doesn't matter; practice does by shutdown+-p+now · · Score: 1

      Asking everyone to use fsync() to ensure their data not being lost is insane. Nobody want to pay that kind of performance penalty unless the data is very critical.

      There's nothing insane about it, and the performance penalty is only there for "lots of tiny files" case, where you have to fsync each and every one. Add a new function to sync a lot of handles in one go, and the penalty disappears.

    21. Re:Theory doesn't matter; practice does by Random+Walk · · Score: 1

      There is no language called POSIX. It's an OS standard, and software authors writing in high-level languages should not need to care about it. There are languages like C, C++, Java, etc. If you're writing Java, you should pay attention to the Java standard. If you're writing C, you should pay attention to the C standard (which doesn't include OS specific things like fsync()).

      And if you adhere to the standard of the language you are using, you should have some reasonable expectation that things will work, instead of dying a horrible and gruesome death. A conforming HTML page should not fail. A conforming Java/C/C++/whatever program should not fail either, but the problem is, it will. This is like asking HTML authors to understand the low-level details of the browser rendering engine in order to write working HTML pages.

    22. Re:Theory doesn't matter; practice does by Hatta · · Score: 1

      Let's not forget POSIX, let's improve it. This is a bug in the POSIX spec. It should be fixed.

      --
      Give me Classic Slashdot or give me death!
    23. Re:Theory doesn't matter; practice does by skeeto · · Score: 1

      [...] It will violate your wife. That is the nature of an alpha release.

      And the nature of an alpha male.

    24. Re:Theory doesn't matter; practice does by spitzak · · Score: 1

      closing a file descriptor does an fsync()

      This has been false for decades, on every operating system in common use, not just Linux.

    25. Re:Theory doesn't matter; practice does by caerwyn · · Score: 1

      What does "language" have to do with it? Software authors have to pay attention to many standards beyond that of the language they're working in. If they're working with XML, they should pay attention to the XML standards. If they're interacting with an operating system, they need to pay attention to the standards that the OS is written to- in this case POSIX.

      The application authors don't need to know *how* the operating system implements the standard- they just need to know what they can and can't expect from those standardize functions.

      --
      The ringing of the division bell has begun... -PF
    26. Re:Theory doesn't matter; practice does by caerwyn · · Score: 1

      I'll agree wholeheartedly with that. In the meantime, though, people shouldn't ignore it simply because it has some flaws- time has shown that the advantages of POSIX far outweigh any such existing flaws.

      --
      The ringing of the division bell has begun... -PF
    27. Re:Theory doesn't matter; practice does by Anonymous Coward · · Score: 0

      The POSIX standard specifies some behaviour. It is not a complete specification of what must be done. In the areas not specified by the POSIX standard there are many options for behaviour - some of these result in more robust systems and others in less robust systems.

      The POSIX standard does not say that file systems must be written in the most unreliable way (from a total systems perspective) that does not violate the standard.

      I don't write most of the applications I use but I do need to be able to use them reliably. POSIX does not preclude this but much of what is being advocated for ext4 does.

      It will be good to improve applications. In the mean time, I need a system that doesn't wipe out data every time some fault causes a system halt.

      The more serious problem is that operations are reordered between the application and the disk. POSIX does not require that the order of operations be changed. What is needed is an option to ensure operations happen in a specified (robust) order without forcing everything to be flushed to disk after every file operation.

    28. Re:Theory doesn't matter; practice does by Trogre · · Score: 1

      ... and if such specs are found to be broken, they need to be either fixed or routed around.

      (sorry, I forgot - All Hail The Spec)

      --
      "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
  14. Excuses are false. This is a severe flaw. by rpp3po · · Score: 3, Interesting

    There are several excuses circulating: 1. This is not a bug, 2. It's the apps' fault, 3. all modern filesystems are at risk.
    This is all a bunch of BS! Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
    ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!

    1. Re:Excuses are false. This is a severe flaw. by Anonymous Coward · · Score: 2, Informative

      > Delayed writes should lose at most any data between commit and actual write to disk.

      And that's exactly what ext4 does.

      Application decides to update some file:
      1) Reads the some file
      2) Modifies the buffer as needed
      3) Truncates the file
      4) Writes the buffer to the file

      Now, if the filesystem commit happens right between, 3 and 4, the truncation hits the disk, but the new content does not (yet). If a crash happens before the next commit, all what remains is the truncated file.

    2. Re:Excuses are false. This is a severe flaw. by ivoras · · Score: 1

      No, ZFS cannot do it also. I've checked :)

      --
      -- Sig down
    3. Re:Excuses are false. This is a severe flaw. by Anonymous Coward · · Score: 4, Informative

      Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).

      You seem to misunderstand that's *exactly* what is happening.

      KDE is *DELETING* all of its config files, then writing them back out again in two operations.

      Three states now exist, the 'old old' state, where the original file existed, the 'old' state, where it is empty, and the 'new' state where it is full again.

      The problem is getting caught between step #2 and step #3, which on ext3 was mostly mitigated by the write delay being only 5 seconds.

      KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.

    4. Re:Excuses are false. This is a severe flaw. by rpp3po · · Score: 2, Insightful

      That's not true. KDE is not "*DELETING*" any of its files. It's just opening them with the O_TRUNC flag (expressing an intent to overwrite its contents). That's perfectly safe for a copy-on-write filesystems (as ZFS) but not for ext4. So calling all "modern" filesystems at risk is pure ignorance. Ext4 could delay content deletion of open files until write time and write both within a single transaction.

    5. Re:Excuses are false. This is a severe flaw. by fishbowl · · Score: 1

      >Now, if the filesystem commit happens right between, 3 and 4, the truncation hits the disk, but the new content does not (yet).

      But it does write to the journal to say that it did.
      This isn't a subtle thing, but people are missing something fundamental about the nature of the bug.

      --
      -fb Everything not expressly forbidden is now mandatory.
    6. Re:Excuses are false. This is a severe flaw. by macshit · · Score: 2, Informative

      ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!

      I read the FA, and it actually really does look like the applications are simply using stupidly risky practices:

      These applications are truncating the file before writing (i.e., opening with O_TRUNC), and then assuming that the truncation and any following write are atomic. That's obviously not true -- what happens if your system is very busy (not surprising in the startup flurry which is apparently where this stuff happens), the process doesn't get scheduled for a while after the truncate (but before the write), and the system happens to crash in that interval?

      I'm as lazy as they get, but even I know enough not to do that kind of crap...

      There's probably some way the FS could finesse this issue -- e.g., don't actually schedule truncation until you see the first write or close -- but it would be a workaround for buggy applications, not a FS bugfix.

      --
      We live, as we dream -- alone....
    7. Re:Excuses are false. This is a severe flaw. by rpp3po · · Score: 1

      There's probably some way the FS could finesse this issue -- e.g., don't actually schedule truncation until you see the first write or close -- but it would be a workaround for buggy applications, not a FS bugfix.

      There's no benefit of NOT delaying deletion on disk until actual writes of new content. It's not too much to expect from a filesystem to behave reasonably.

    8. Re:Excuses are false. This is a severe flaw. by blueg3 · · Score: 1

      There's probably some way the FS could finesse this issue -- e.g., don't actually schedule truncation until you see the first write or close -- but it would be a workaround for buggy applications, not a FS bugfix.

      The general solution is to provide a guarantee of atomicity for arbitrary sequential collections of filesystem operations. So, if I do operations A, B, and C, the filesystem is in a state where either none or all of them are done. POSIX does *not* provide that at all. A well-done POSIX filesystem will guarantee atomicity at the level of individual filesystem calls, but not groups. (A filesystem more advanced than POSIX requires certainly could provide this feature.)

      Incidentally, what he recommends is using a database. What is a database? Oh, wait, it's very much like a filesystem, but with more features. What's one of these features? Oh, it's grouping multiple operations into a single atomic operation!

    9. Re:Excuses are false. This is a severe flaw. by cratermoon · · Score: 1

      Security. Consider the following scenario

      1. Super-secure process opens private.txt
      2. Super-secure process truncates private.txt
      3. Super-secure process closes the file.
      4. O/S re-allocates those disk blocks just freed by the truncate.
      5. Nosy process opens a new file using the recently-reallocated blocks.
      6. Nosy process reads through the undeleted data left by Super-secure process and sends them over a network connection to someplace bad.
      7. Nosy process writes some random noise to the blocks.
      8. O/S deletes the data on disk and then writes the data supplied by Nosy.

      See the problem? See why it's good to delete on disk at the time of truncation? Even if you include a step between 2 and 3 where Super-secure process writes back, what happens if the system crashes right after the truncate and before the write? Yep -- the blocks of private.txt are out there, on the disk, for anyone to read.

    10. Re:Excuses are false. This is a severe flaw. by rpp3po · · Score: 1

      Security. Consider the following scenario

      1. Super-secure process opens private.txt 2. Super-secure process truncates private.txt 3. Super-secure process closes the file. 4. O/S re-allocates those disk blocks just freed by the truncate. 5. Nosy process opens a new file using the recently-reallocated blocks. 6. Nosy process reads through the undeleted data left by Super-secure process and sends them over a network connection to someplace bad. 7. Nosy process writes some random noise to the blocks. 8. O/S deletes the data on disk and then writes the data supplied by Nosy.

      Ext4 does not make any guarantees about the erasure of file contents on disk. Even truncation as ext4 is doing it right now, doesn't actually overwrite truncated blocks with zeroes. So your whole point doesn't make sense at all.

    11. Re:Excuses are false. This is a severe flaw. by Tadu · · Score: 5, Informative

      KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.

      Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. While that may be correct from a POSIX lawyer pont of view, it is still heavily undesirable.

    12. Re:Excuses are false. This is a severe flaw. by David+Gerard · · Score: 1

      ZFS has other exciting problems. Like the bug where if your file system gets too full, ZFS will start using 70-80% of CPU to try to allocate the blocks absolutely perfectly rather than just getting on with allocating the damn things.

      (Sun have acknowledged the bug. "Yeah, we'll have a fix next update, six months. Probably." Their workaround in the meantime? "Keep your disks below 70% full." Yeah, that's why we bought huge disks and believed your lies about the brilliance of ZFS. We are not best pleased.)

      All hardware sucks. All software sucks.

      --
      http://rocknerd.co.uk
    13. Re:Excuses are false. This is a severe flaw. by Qzukk · · Score: 1

      or rolls back in case of a crash

      Data loss!!1!1!

      (even their content before the write).

      That's because the application asked the filesystem to truncate the old file before it wrote the new file. The system crashed after the filesystem truncated the old file and before it wrote the new file.

      To really use ZFS's transaction stuff requires that the application developer do special-case magic to hook into the transaction code (and developers think a cross-platform, cross-filesystem fsync call is hard?). Otherwise...

      Each TXG is 5 sec long (in normal cases unless some operation forcefully closed it). So, it is quite possible that the 2 syscalls can end up in the same TXG. But, is not guaranteed.

      http://markmail.org/message/lbbgxu4huzczwh6g
      So, if your "truncate this file" ended up in transaction 1 and your "write this data" ended up in transaction 2, and the system crashes and rolls back transaction 2, you've still got an empty file.

      --
      If I have been able to see further than others, it is because I bought a pair of binoculars.
    14. Re:Excuses are false. This is a severe flaw. by shutdown+-p+now · · Score: 1

      But it does write to the journal to say that it did.
      This isn't a subtle thing, but people are missing something fundamental about the nature of the bug.

      Can you explain how this makes any difference in the case GP has outlined? I'm genuinely interested because I can't find any flaw in his argument. So long as the programmer cannot make a bunch of file ops atomic for the purposes of commit/rollback - essentially, an FS transaction - I can't see how the problem can be fully dealt with.

    15. Re:Excuses are false. This is a severe flaw. by shutdown+-p+now · · Score: 1

      Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data.

      Thank you; finally, a concise explanation of the nature of the problem! So, in effect, in ext4 they keep data and metadata out of sync, which can show in the edge cases such as TFA?

      Anyway, please mod parent up. For all the pointless discussions about how one should call fsync() in the comments to the story (including some of my own), this is one of the few posts that are actually on-topic...

    16. Re:Excuses are false. This is a severe flaw. by Anonymous Coward · · Score: 0

      It's a shame that Unix doesn't have a registry like Windows, where this sort of thing doesn't happen.

      dom

    17. Re:Excuses are false. This is a severe flaw. by Anonymous Coward · · Score: 0

      Hooray for being deliberately obtuse!

    18. Re:Excuses are false. This is a severe flaw. by Anonymous Coward · · Score: 0

      Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. While that may be correct from a POSIX lawyer pont of view, it is still heavily undesirable.

      rename guarantee that application read new renamed files not guarantee that file are written to disk.

      This is same as write() and close() then open() and read().

    19. Re:Excuses are false. This is a severe flaw. by k8to · · Score: 1

      This problem is not specific to ext4. Portable code (such as KDE aims to be) *must* call fsync to employ this pattern correctly and safely.

      The application has a bug. The end.

      ext4 is being changed to try to minimize the problem, but it isn't the one with the bug.

      --
      -josh
  15. Re:Exactly by gweihir · · Score: 5, Insightful

    The problem is not the many small files, but the missing disk sync. The many small files just make the issue more pbvous.

    True, with ext4 this is more likely to cause problems, but any delayed write can cause this type of issue when no explicit flush-to-disk is done. And lets face it: fsync/fdatasync are not really a secret to any competent developer.

    What however is a mistake, and a bad one, is making ext4 the default filesystem at this time. I say give it another half year, for exactly this type of problem.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  16. I use ntfs by Anonymous Coward · · Score: 0

    and I don't know what's going on behind the curtain, nor do I care. I can't recall losing any data in such a manner since, well, ever. even given the fat32 and fat16 days. there was that one time I managed to destroy someone's data with doublespace... anyways, the important thing is that I had an onion tied to my belt, which was the style at the time...

  17. Re:Exactly by Anonymous Coward · · Score: 0

    Lack of data loss during unexpected power outages or shutdowns was the primary reason people adopted ext3. Journaling was supposed to fix exactly this.

  18. not mounted sync,dirsync? by dltaylor · · Score: 4, Interesting

    When I write data to a file (either through a descriptor or FILE *), I expect it to be stored on media at the earliest practical moment. That doesn't mean "immediately", but 150 seconds is brain-damaged. Regardless of how many reads are pending, writes must be scheduled, at least in proportion of the overall file system activity, or you might as well run on a ramdisk.

    While reading/writing a flurry of small files at application startup is sloppy from a system performance point of view, data loss is not the application developers' fault, it's the file system designers'.

    BTW, I write drivers for a living, have tuned SMD drive format for performance, and written microkernels, so this is not a developer's rant.

    1. Re:not mounted sync,dirsync? by rdoger6424 · · Score: 1

      Exactly. I think there should be a command to commit it to the media. Sync up an asymmetric file system. Something like fsync. The same fsync that has been around for decades.

      --
      "Hello 911? I just tried to toast some bread, and the toaster grew an arm and stabbed me in the face!"
    2. Re:not mounted sync,dirsync? by QuoteMstr · · Score: 1

      When I write data to a file (either through a descriptor or FILE*)...[delaying] 150 seconds is brain-damaged.

      You have a flawed and dangerous conception of how C's buffered IO functions work. Consider the following program:

      #include <stdio.h>
      #include <unistd.h>

      int main() {
              fprintf(stdout, "hi");
              sleep(150);
              _exit(0);
      }

      This program will output nothing. fprintf, fwrite, etc. write to C streams (a FILE*). Each stream can be under one of three buffering schemes, selected by setvbuf:

      1. unbuffered
      2. line-buffered
      3. fully buffered

      (By default, standard output is typically fully buffered unless it is connected to a terminal, in which case it is fully buffered. Incidentally, this default buffering is one reason for the pty(7) system.)

      When a stream is unbuffered, IO will be sent directly to the underlying system (in the POSIX case, write()) after each C stream operation. However, in the other two modes, data will be buffered forever unless certain conditions are met. These conditions are:

      • Forcing a flush explicitly with fflush
      • Closing a stream with fclose
      • An orderly shutdown of the process
      • In line-buffered mode, writing a newline
      • Filling up the internal stream buffer with data to be written

      There is no timed flush. If a program terminates (say, by pulling the plug) without these conditions having been met, buffered data are lost.

    3. Re:not mounted sync,dirsync? by Anonymous Coward · · Score: 0

      I expect it to be stored on media at the earliest practical moment.

      The problem is as MLC flash takes over the drive market this time is going to be pushed out farther and farther... 150 seconds may seem 'brain damaged' from a rotating media point of view but in a SSD this could save incredible amounts of wear on the flash.

    4. Re:not mounted sync,dirsync? by QuoteMstr · · Score: 1

      Grr. Standard output is line buffered when connected to a terminal, not fully buffered. That was a silly typo.

    5. Re:not mounted sync,dirsync? by dltaylor · · Score: 1

      Just a shorthand for fwrite()/fflush().

      But the fflush()/fclose()/... only write the stream data to the file system. Neither of them, nor filling the stream buffer, has any semantics to force the file system to actually store the media on disk. fsync(), does, but does not guarantee that the directory structure is flushed.

      open() O_SYNC also requires flushing the data to media.

      However, forcing the application writer to put in flurries of fflush() or open() files O_SYNC, really does have a nasty effect on file system performance.

      150 seconds is simply too long. If we're going to force the application writers to have "magic" knowledge of the underlying kernel/file system/hardware (SSDs), except in the case of embedded systems, then we've defeated the purpose of having an operating system (abstracting those things) in the first place. At the very least, we should have a tuning parameter to put bounds on write latency, but, honestly, right now, the best tuning parameter is to exclude ext4 until it is more robust, or the system designer (which no general-purpose distribution can do) has verified that the potential for data loss is tolerable.

    6. Re:not mounted sync,dirsync? by Anonymous Coward · · Score: 0

      So writing data and then crashing is ok if it was only another 4 seconds until the data would be written, but not ok if it was 149 seconds?

    7. Re:not mounted sync,dirsync? by Dog-Cow · · Score: 2

      Right. Its the rant of a fucked-up asshole. If the developer does not use a mechanism that GUARANTEEs writes to disk, how the fuck is at anyone else's fault? It isn't, you brain-damaged idiot.

    8. Re:not mounted sync,dirsync? by Anonymous Coward · · Score: 0

      Sadly, what you expect is really quite irrelevant. What is relevant is what the spec says and whether or not your software adheres to the spec. While one might wish that the spec said something different, only a fool ignores the spec.

      P.S. 150 seconds does seem more than a tad on the long side if not borderline brain damaged.

    9. Re:not mounted sync,dirsync? by Anonymous Coward · · Score: 0

      150 is not a terribly long time. If my computer is likely to crash in the next 150 seconds then I have a problem with my computer/OS.

  19. ZFS isn't invulnerable either by feld · · Score: 0, Redundant

    News at 11.

  20. Why SHOULD applications have to assume bad FSs? by nweaver · · Score: 1

    True, posix says that unless you do a fsync(), the file might never be written to disk before the system crashes. But Whiskey-tango-Foxtrot?

    Whats wrong with "After a file is closed, its synced to disk"?!?

    --
    Test your net with Netalyzr
    1. Re:Why SHOULD applications have to assume bad FSs? by caerwyn · · Score: 2, Informative

      Nothing- except that it's not in the spec.

      POSIX is like a contract. KDE is breaking the contract and then whining about it to ext4- which isn't breaking the contract. Just as in a court, KDE here doesn't have much of a leg to stand on.

      --
      The ringing of the division bell has begun... -PF
    2. Re:Why SHOULD applications have to assume bad FSs? by gweihir · · Score: 3, Informative

      Whats wrong with "After a file is closed, its synced to disk"?!?

      What, you want people to have to delay/stagger/coordinate their file closes in order to avoid overloading the filesystem? That is the wrong approach. close() just means that the application is done with the file. The sync calls are not a joke, they are there precisely for the reason that close() already has an antirely sensible but different semantics. Anybody that wants close also to sync can code it that way without problem. Anybody else probably does not want this behaviour in the first place.

      This is not hidden in any way. A simple "man close" not warns of this, it also refers the reader to the fsync call. Anybody getting bitten by this did not no their homework.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    3. Re:Why SHOULD applications have to assume bad FSs? by Anonymous Coward · · Score: 0

      What's wrong with is that this operation takes very long time. period.

      Please note: it only happens if the application actually *crashes*! The only "bug" here is that the Ext4 developers used too much of a lease time on writing to disk. Someone will put in a new number(perhaps matching ext3) and problem is solved.

    4. Re:Why SHOULD applications have to assume bad FSs? by Anonymous Coward · · Score: 0

      True, posix says that unless you do a fsync(), the file might never be written to disk before the system crashes. But Whiskey-tango-Foxtrot?

      Whats wrong with "After a file is closed, its synced to disk"?!?

      People don't fsync() all the time because it's SLOW. Not just a little slow, but RTFS's bug report for the link to the Firefox 3 bug due to performing 8 syncs per page load: if there's any IO going on, firefox ground to a halt to wait its turn to ensure that your bookmarks and history and cookies and everything else were really, really written to disk.

      What people are complaining about is that they want the OS to do a magical sync for them. One that happens as soon as they close the file, but that doesn't actually take any time to perform, even if their applications save 100 files in a row. They've already gotten their pony, now they want a unicorn.

    5. Re:Why SHOULD applications have to assume bad FSs? by mkcmkc · · Score: 1

      Whats wrong with "After a file is closed, its synced to disk"?!?

      That's what happens now--it's just that "after" might be "quite a while after". If you're doing this after the file is closed, though, it's all a race anyway--the only question is how wide the window of evil is.

      If you want it to be right, you need a synchronous call. That's what fsync and friends are for.

      --
      "Not an actor, but he plays one on TV."
    6. Re:Why SHOULD applications have to assume bad FSs? by Eunuchswear · · Score: 2, Informative

      People don't fsync() all the time because it's SLOW. Not just a little slow, but RTFS's bug report for the link to the Firefox 3 bug due to performing 8 syncs per page load: if there's any IO going on, firefox ground to a halt to wait its turn to ensure that your bookmarks and history and cookies and everything else were really, really written to disk.

      Well, it has to be said that fsync() on ext3 is slow because of an ext3 bug - fsync() is the same as sync() on ext3.

      --
      Watch this Heartland Institute video
    7. Re:Why SHOULD applications have to assume bad FSs? by swilver · · Score: 1

      That's all nice and stuff, but IMHO filesystems should atleast preserve the order of actions it syncs. So I don't care that when I do A, B, C, D and then E that I could end up with A + B + C or even just A, or nothing at all, that's fine.

      However, ending up with something like:

      A + D + E

      seems to be highly undesirable. If the filesystem decides to a sync on its own, it should atleast sync everything up to a certain point in time, not pick and choose whatever happens to be handy. Most filesystems in fact do this, and the only thing they donot guarantee is that actual content *during* a sync is 100% safe (which means you only corrupt a file that was being written at the time crash). EXT4 seems to take it to a new level and will sync meta-data ahead of time without even attempting to write out the data that was part of earlier steps...

    8. Re:Why SHOULD applications have to assume bad FSs? by Trogre · · Score: 1

      Please cite the article where KDE claimed POSIX compliance.

      --
      "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
  21. Re:Exactly by Psychotria · · Score: 1

    If an application that reads and writes lots of small files fails under Ext4, then it is Ext4's fault, not the application. An application should be able to read and write lots of small files if it wants... I can think of a great many practical examples

    Yeah, but it's not just ext4, it's any modern filesystem. If the application writes thousands of individual files (without fsync()) and there is a power failure or system crash then data loss is possible. This isn't ext4's 'fault' any more than it's the applications 'fault'. It isn't a bug or a bad design decision either; it's just how things are.

  22. Re:Exactly by Psychotria · · Score: 1

    Your comment scares me. See my comment below. Are you sure you're not me?

  23. Re:Exactly by msuarezalvarez · · Score: 1

    No, not really. Journalling is done so that after a crash the filesystem is in a consistent state, and that does *not* include the no-data-loss requirement you are talking about.

  24. Translation by microbee · · Score: 3, Insightful

    We use techniques that show great performance so people can see we beat ext3 and other filesystems.

    Oh shit, as a tradeoff we lose more data in case of a crash. But it's not our fault.

    Honestly, you cannot eat your cake and have it too.

    1. Re:Translation by PhrostyMcByte · · Score: 1

      Honestly, you cannot eat your cake and have it too.

      Say what you will about Vista (as we all have), but it did get one exceptional feature: Transactional NTFS. The filesystem's default behavior is like it always was - journal enough to keep the FS correct, but screw the user's data. But there are also new APIs that let you use BEGIN/COMMIT/ROLLBACK for the filesystem. You can group file creations, deletions, moves, writes, etc. into this, just like people have been doing forever in databases. I think this is a good middle ground.

    2. Re:Translation by spitzak · · Score: 1

      Can you combine delete and rename into one of these?

      Congratulations, you have finally implemented atomic rename, which Unix has had for THIRTY YEARS on machines with 16K of memory. And it requires 4 calls instead of one. Wow I am so impressed.

      Transactions like this are pretty interesting but don't fool yourself into thinking the complicated solution is what is needed all the time.

  25. A Windows-like registry can not be the answer. by Anonymous Coward · · Score: 0

    T'so may argue that we can't "have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories", but isn't UNIX philosophy all about having just that? And isn't it the filesystem's job to handle the files? Fix EXT4!

    1. Re:A Windows-like registry can not be the answer. by billcopc · · Score: 2

      Unix philosophy is to make configuration files user- and script-editable. NOT to create hundreds of files per app making it utterly unmanageable.

      --
      -Billco, Fnarg.com
    2. Re:A Windows-like registry can not be the answer. by EsbenMoseHansen · · Score: 1

      Unix philosophy is to make configuration files user- and script-editable. NOT to create hundreds of files per app making it utterly unmanageable.

      Neither Gnome nor KDE creates hundreds of files per application. They create a few, usually one per application. However, when KDE or Gnome starts many applications are started simultaneously, and thus you get the "lots-of-small-files-at-once"-syndrome.

      Also note that KDE used to have lots of fsyncs as a workaround to XFS, which had a similar behaviour. But since XFS was allegedly fixed, and since especially laptop owners got hit pretty hard by those fsyncs(), they were removed. Dig through the kde-devel-core mailing list for the gory details.

      Looking at the code, I fail to see how it is the applications fault. The code in question is

      460 FILE *fp = KDE_fdopen(fd, "w");
      461 if (!fp) {
      462 close(fd);
      463 return false;
      464 }
      465 QFile f;
      466 if (!f.open(fp, QIODevice::WriteOnly)) {
      467 fclose(fp);
      468 return false;
      469 }
      470 writeEntries(locale, f, writeMap);
      471 f.close();
      472 fclose(fp);

      The only other reasonable approach I can see would be the old create-copy-and-rename trick, which would require a scratch-file.

      Perhaps glibc interface could also be blamed for not providing an interface for doing this the right way, whichever this way is. After all, rewriting a file with new contents inplace is fairly common operations, and if it is difficult and error-prone, it shouldn't be every application developers responsibility to get right.

      --
      Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
  26. Re:Exactly by somenickname · · Score: 0, Flamebait

    Wait, are you saying the crashing of an alpha level OS could cause data loss? I find this unfathomable.

  27. Re:Exactly by pc486 · · Score: 1

    Lots of small files isn't bad on its own. In fact, it's downright common. Ext4's design does consider this case and makes these operations efficient.

    The problem with small files is data consistency. If the application requires a file hierarchy and associated buffers to be on disk before continuing, then a call to fsync() is required (even on ext3). Implicitly syncing on every small file will kill performance, so don't do that.

  28. Top down reliability? by oneofthose · · Score: 0

    So is this a new trend to design systems? Make them reliable from top to bottom? Designing an upper-layer part of the system to work around the flaws of a lower layer system component is often necessary but is not the right thing to do it. Telling application developers to change their applications because a new version of the file system breaks their stuff is madness. No matter what POSIX standards say: it worked before, it is broken now: go fix it.

    1. Re:Top down reliability? by un1xl0ser · · Score: 1

      Right, because this has never been done before....

      I can't imagine such a system that is widely used today where layers were added to build out functionality, and work around various issues below in the stack, hardware issues, et cetera.

      This truly is madness.

      http://en.wikipedia.org/wiki/OSI_reference_model

      --
      v4sw6PU$hw6ln6pr4F$ck 4/6$ma3+6u7LNS$w2m4l7U$i2e4+7en6a2X h
    2. Re:Top down reliability? by Qzukk · · Score: 2, Informative

      change their applications because a new version of the file system breaks their stuff is madness

      Their applications were already broken, committing everything every 5 seconds* regardless of what the applications had wanted was the workaround in ext3, but I guess it's only madness when street-makers demand that you drive with round wheels, not when you demand that street-makers accommodate your square ones.

      * Unless you increased the commit time to reduce power usage (eg laptop_mode)

      --
      If I have been able to see further than others, it is because I bought a pair of binoculars.
    3. Re:Top down reliability? by oneofthose · · Score: 1

      Those are valid points but I'm not sure how applicable they are in this situation. You have layers in a system to abstract functionality and hide problems or "difficult stuff" in a lower layer so that the engineers developing upper layer stuff don't have to think about it - they just use it and it works. In my opinion this is an example where the opposite happens. As an application developer I wouldn't want to think about the crazy internals of the underlying file system - I would simply use it and expect it to work.

  29. Actually, no. by Jane+Q.+Public · · Score: 2

    As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter. Do not expect me to know Assembly language for a given chip, for example, in order to implement a calendar program. The very idea is ridiculous.

    1. Re:Actually, no. by muridae · · Score: 3, Insightful

      As a user of high-level languages, do not directly access the I/O API without knowing what it does. Use a higher level wrapper that properly interacts with the low level functions, and does all of the fsync and similar calls for you.

      If those high level wrappers do not exist, then do not blame the API developers for you not knowing how they work.

    2. Re:Actually, no. by msuarezalvarez · · Score: 1

      This is not a minor detail. POSIX file system semantics have *never* implied that writing to a file includes that the data is actually on the device. Are you saying that API's should be resilient to developers using them under whatever unfounded expectations they may have about them?

    3. Re:Actually, no. by TheRaven64 · · Score: 2, Interesting

      As a user of a framework that doesn't suck, I don't have to worry about this problem. When I need to write a file in such a way that the entire operation either succeeds, or the entire operation fails (a common requirement), the framework I use provides a flag that I can set on the write operation to do all of the write/rename juggling that needs to happen, according to POSIX, to make it work. As such, my code will work happily on any filesystem that doesn't break the spec.

      If you are using a high-level language with a low-level framework, you might want to reconsider your approach.

      --
      I am TheRaven on Soylent News
    4. Re:Actually, no. by Anonymous Coward · · Score: 0

      As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter. Do not expect me to know Assembly language for a given chip, for example, in order to implement a calendar program. The very idea is ridiculous.

      You're not being expected to know the ext4 API, you're expected to understand the C library, if you're working in C.

      I had the same misunderstanding that you did, so let me try to explain what is going on here:

      When you make a call to write (int filedes, const void *buffer, size_t size), its return doesn't guarantee the file is written to disk (in fact it doesn't even guarantee the whole buffer you sent in is going to be written, it returns the number of bytes that actually are). Now, even the bytes that will be written to the disk aren't immediately written, but it's in a buffer somewhere until enough data accumulates that makes committing data to disk an efficient operation. You can witness the same thing happen by making printf calls and noticing how sometimes nothing is getting printed to stdout even after the call is made. If you want to force this, you can call flush. If you to force a write to disk, you can call fsync().

      What the KDE developers were doing was not bothering to call fsync at all. So if the computer crashes before enough data accumulates to actually commit the files to disk, they're screwed. It was incorrect behavior in ext3 too, except that with ext3, the data would be committed after 5 seconds even before the buffer is full. With ext4, that may be up to 60 seconds. If the computer doesn't crash, it doesn't matter, but if they require those settings to be committed to disk immediately while using write(), the C API requires them to force a sync.

    5. Re:Actually, no. by amirulbahr · · Score: 1
      You are a very prolific poster but you still have no point. What high-level language are you referring to?

      Java? See java.io.FileDescriptor.sync(). You might also want to read about java.io.Writer.flush() and how that doesn't mean data is written to the underlying device.

      Python? See os.fsync.

      No one is asking you to learn assembly, but at least understand the API of the language you are dealing with.

      If you're not to worried about the rare event of a crash or power-loss, then you don't even need to bother with any of that. Just write as you normally would and know that the system will deal with it in an efficient manner.

      So please get a clue before making a thousand posts on a subject you have clearly failed to comprehend.

    6. Re:Actually, no. by Jane+Q.+Public · · Score: 1

      No... where did you get that idea?

      I was simply saying that not all "competent developers" are aware of this situation. In this case it is a matter that is properly handled by those who write the lower-level code, such as the compiler and interpreter.

      Not everybody programs in C, or Assembler. I can and have, but prefer not to most of the time (it would not be appropriate for the things I am doing).

    7. Re:Actually, no. by Jane+Q.+Public · · Score: 1

      That was precisely my point. Thanks for reinforcing it.

    8. Re:Actually, no. by joe_bruin · · Score: 1

      As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter.

      If you need very specific behavior, as a high level developer you should be calling your environment's DoExactlyWhatIWantNotJustWhatIAssume() implementation (which in this case would be something like SyncFileDataToDisk() or TransactionCommit()). Implementing this function is for the authors of the interpreter or library set so that you don't have to understand the disk IO API. If your environment does not provide one of these, you're probably using the wrong tool for the job and you better make friends with some low level programmers.

    9. Re:Actually, no. by msuarezalvarez · · Score: 2, Insightful

      If a programmer is using a file API with POSIX semantics in any non-trivial way, and is not aware of the fact that POSIX does not specify any assurances that data will be written to the device unless fsync is called or another similar action is taken, then that programmer is *not* competent.

    10. Re:Actually, no. by grumbel · · Score: 1

      What about C99 (FILE, fopen, fclose, ...) or ISO-C++ (std::ofstream)?

    11. Re:Actually, no. by Anonymous Coward · · Score: 0

      Yes, damnit yes! You should know assembly of the chip for your calender program. In *some* languages, like C, you need to know this stuff because thats where errors can happen (buffer overflows, for example).

      As a programmer, you are expected to know the language (and in some cases this requires knowing more then one), the ins and the outs. You are also expected to know the API you are using. You dont like the API? Tuff. You can always make your own OS (yea..haha), or make a wrapper API.

      All this damn "people shouldent have to know" crapy this week is really starting to bug me. Learn something damnit and stop complaining when you dont know something you should. Even a damn school-kid would know this simple lesson!

    12. Re:Actually, no. by LWATCDR · · Score: 1

      Well for one thing this IS part of the language. I don't think there is a single clib that doesn't have fsync but I could be wrong.
      Second then BLAME your language. If the langauge is supposed to work that way then they should code it so every write and close function also does and fsync.
      If you are making calls to the API or using a library YES YOU BLOODY WELL SHOULD KNOW HOW THEY WORK.
      And what does any of this have to do with knowing assembly?

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    13. Re:Actually, no. by gweihir · · Score: 1

      I was simply saying that not all "competent developers" are aware of this situation. In this case it is a matter that is properly handled by those who write the lower-level code, such as the compiler and interpreter.

      And here you are dead wrong. Unless you think higher-level languages deserve a dog-slow filesystem by default?

      There are details of the system you can (and should) hide form developers in higher level languages. This is not one of them. It cannot be hidden without extreme negative side-effects. Any competent developert needs at least to know the issue exists. And there need to be a choice, as synchronous writes are unacceptable in many applications.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    14. Re:Actually, no. by LWATCDR · · Score: 1

      "I was simply saying that not all "competent developers" are aware of this situation."
      And you would be totally wrong.
      This is the way Posix works. If you are making Posix calls then you better know how they work. If you really think it should handled by the interpreter or compiler then you should then feel the bug is in the compiler or interpreter but not in the filesystem.
      But no the developers that use a language that DOESN'T document that IO is synced and doesn't use the library calls to do that sync is incompetent.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    15. Re:Actually, no. by Jane+Q.+Public · · Score: 1

      I disagree, to an extent.

      One of the reasons for the existence of high-level languages in the first place (such as Java) is that the programmer is not supposed to have to worry about OS-specific details. I dispute whether most Java programmers -- even the competent ones -- necessarily know about the sync issue. The existence of a sync function in the FileDescriptor class does not mean that everybody is familiar with it... or needs to be in most cases. I could be wrong about that, but I am not convinced that I am. Further, as someone else brought up, other languages also have their version of sync(). Ruby and Python for example.

      However, I will concede that if you are doing non-trivial disk I/O, it does behoove one to know about it. The question is: how many programmers (Ruby and Python for example) are doing non-trivial disk I/O today? Most Ruby and Python programmers -- even the competent ones -- do very little disk I/O at all, much less the non-trivial variety.

    16. Re:Actually, no. by Jane+Q.+Public · · Score: 1

      I hereby retract that statement, clearly it was wrong. I should have stated "for simple I/O tasks" or something to that effect. One should not have to know about syncing to do something like a few simple file writes, but as someone else pointed out, if what you are doing is non-trivial then that is not sufficient.

    17. Re:Actually, no. by Anonymous Coward · · Score: 0

      As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter. Do not expect me to know Assembly language for a given chip, for example, in order to implement a calendar program. The very idea is ridiculous.

      Then you better pick a high-level language developed by programmers that know the spec.

    18. Re:Actually, no. by Jane+Q.+Public · · Score: 1

      Please see my reply just above.

    19. Re:Actually, no. by LWATCDR · · Score: 1

      I did. I just don't understand how you could call a developers competent and not expect them to read documentation of the file system calls they are using.
      I am not the greatest developer of all times but I know how fsync works. Everybody makes mistakes but this was 100% the mistake of the application programmer that didn't call fsync. It was and is in no way a problem with or a limitation of the file system.
      Maybe competent means something different your language than in English.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    20. Re:Actually, no. by gweihir · · Score: 1

      One of the reasons for the existence of high-level languages in the first place (such as Java) is that the programmer is not supposed to have to worry about OS-specific details.

      I agree. However, delayed writes are not a detail. They are a fundamental design characteristic of all modern OSes and filesystems. In addition, it is a detail that cannot be hidden without dramatic negative impact on performance. The problem I see here is how this gets communicated. Obviouly there are deeply defective programming courses that teach you file I/O but do not warn you of this. On the other hand, maybe the documentation for file I/O in high-level languages should just follow the Unix system call example and warn you. After all, a C programmer does not understand delayed write better than, say, a Java programmer. The difference is that the C programmer gets a warning when looking at "man close".

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    21. Re:Actually, no. by Anonymous Coward · · Score: 0

      What high-level? I just looked up all those recommended manpages (i do not program). And even i understand now what fsync, syncer, sync etc. mean.
      To cite man sync:
      "The sync utility can be called to ensure, that all disk writes have been completed before the processor is halted in a way not suitably done by reboot or halt." Maybe a syllogism is to assume that fsync is the same in C. Maybe not.

      I supposed that stuff is standard curriculum or at least self explanatory for everyone who fills files with ones and zeroes.

      If you dig a hole you have to worry about cables even if you only dig 20cm deep.

    22. Re:Actually, no. by Jane+Q.+Public · · Score: 1

      What I was saying in my example (Ruby and Python) is that probably most programmers in those languages today do very little file I/O at all -- if any -- and would probably never even need to know about the sync issue. Don't misunderstand me, many would... but if it's something they (almost) never do, and when they do, they do trivially, then it's simply not an issue. I have worked with Ruby full-time for over 3 years and in all that time I have needed to write code to access the filesystem exactly twice: one file read and one file write, and in neither case was I concerned with sync although I know it's there.

      Would you expect all competent Java Programmers to be fluent in calls to Graphics, even if they work on back-end database operations full time? That's all I'm saying.

      But as for the limitation thing: I think you and some others have mistaken my claim of "limitation" for a claim of "fault". The need to do a sync before data is guaranteed to be written is a limitation of the filesystem. It might be a reasonable limitation, and it may be a limitation that is common to several filesystems, but it is still a limitation.

      Maybe I am off-base here, but in my ideal filesystem, the disk I/O operations of any given execution thread would be queued chronologically: data that had been written but not yet flushed to disk would not be accessible until it was. (If it could be reliably read from the buffer before being committed to disk so much the better, but I don't think that would be practical.) Since -- as has been clearly demonstrated here -- is it not possible to properly access that data anyway, that should introduce no delays in the overall disk operation. Then any data areas for which there are access requests in the queue are written out to disk first. So only files with pending requests are synched right away; others are free to be optimized as the FS sees fit. To me, that would seem to be the right balance. The cost would be the overhead of managing the queues in this way. Is it practical? I don't know.

    23. Re:Actually, no. by Anonymous Coward · · Score: 0

      As a user of a high-level language, you should know how your high-level language's framework API handles files. For example, check out Python's docs on the file object:

      file.write(str)

              Write a string to the file. There is no return value. Due to buffering, the string may not actually show up in the file until the flush() or close() method is called.

      http://docs.python.org/library/stdtypes.html#file-objects

    24. Re:Actually, no. by LWATCDR · · Score: 1

      You did say it was a flaw. Not in that post but you did say it was. It has been a long time since I took my OS class but even then that is not how file systems worked.
      First file systems do work the way you want in a file handle. Once that handle is closed then if you are going to do anything else with that file you must call fsync.
      The reason is performance. Back in the day good filesystems used was is called elevator seaking. The head of the drive would travel up and down the drive like an elevator. If the writes and reads where always sequential that poor drive head would be darting all over the place and the drive performance would be terrible.
      How you could add a check that before any new operation was done that That a check was made and an fsync is called if needed. The problem with that is that check would be called thousands or even millions times for no good reason just to allow a programmer to not make a call to fsync.
      It isn't worth it and it is documented.
      Trust me the exact same issue can happen on many file systems and it is up to programmers to read the docs and follow them.
      I can not imagine writhing a lot of software without every having to do disk io even if it just to save configuration information.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    25. Re:Actually, no. by Anonymous Coward · · Score: 0

      As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter. Do not expect me to know Assembly language for a given chip, for example, in order to implement a calendar program. The very idea is ridiculous.

      Stop saying bullshit.

      The kernel, while the fs lives, says that until you have done a fsync you aren't guaranteed anything. Do whatever you want in user space.

      If the alleged high level language does not present you with a proper interface for disk I/O (which includes reads and writes btw, not just fsync) you will either pay the price of the fsync everywhere or you won't have any guarantees.

      Perl has a sync, ruby has a sync, python has a sync. Even java has a sync (FileDescriptor.sync() for example). So quit spewing bullshit when you don't know what you're talking about.

    26. Re:Actually, no. by Jane+Q.+Public · · Score: 1

      Yes I did. I have revised my opinion somewhat but this is my current opinion: yes the programmers messed up, but yes the filesystem has some serious limitations.

      I understand why delayed writes exist and what the performance issues are. But a limitation is still a limitation, regardless of the reason for its existence.

      If you cannot imagine doing a lot of programming without doing a lot of disk IO then you are not familiar with the current state of Web development.

    27. Re:Actually, no. by shutdown+-p+now · · Score: 1

      One of the reasons for the existence of high-level languages in the first place (such as Java) is that the programmer is not supposed to have to worry about OS-specific details. I dispute whether most Java programmers -- even the competent ones -- necessarily know about the sync issue. The existence of a sync function in the FileDescriptor class does not mean that everybody is familiar with it... or needs to be in most cases. I could be wrong about that, but I am not convinced that I am.

      Doesn't Java call fsync() when closing the stream associated with the file, however? This covers 99% of cases where you'd use fsync() on lower level.

    28. Re:Actually, no. by ChienAndalu · · Score: 1

      Out of curiosity, I checked how Python handles fsync() stuff - turns out that it's the same as in C.

      I never heard about this and often coded with the assumption that a flush() writes to disk, but I was wrong.

    29. Re:Actually, no. by swillden · · Score: 1

      I dispute whether most Java programmers -- even the competent ones -- necessarily know about the sync issue.

      From the java.io.OutputStream.flush() Javadoc:

      If the intended destination of this stream is an abstraction provided by the underlying operating system, for example a file, then flushing the stream guarantees only that bytes previously written to the stream are passed to the operating system for writing; it does not guarantee that they are actually written to a physical device such as a disk drive.

      Is it your contention that "competent" programmers are not in the habit of reading the API documentation?

      I agree that hiding the sync() function in the FileDescriptor class is poor design, but the documentation of flush() makes quite clear that if programmers wish to ensure that their data actually gets to the disk, they have to do something more.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    30. Re:Actually, no. by snemarch · · Score: 1

      > The need to do a sync before data is guaranteed
      > to be written is a limitation of the filesystem.
      > It might be a reasonable limitation, and it may
      > be a limitation that is common to several
      > filesystems, but it is still a limitation.

      It's not a limitation, it's a feature - seriously. This way, an application developers gets *some* control over performance aspects of their code.

      sync should be seen as an expensive operation, and only used when necessary. At the same time, you shouldn't *depend* on a low-level call like write() to do buffering - doing lots of 1-byte writes is a bad idea, buffer your stuff.

      --
      Coffee-driven development.
    31. Re:Actually, no. by gweihir · · Score: 1

      Well, I agree to that.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    32. Re:Actually, no. by Jane+Q.+Public · · Score: 1

      It's only a feature in the sense that it gives the programmer a workaround for the limitation. Seriously: an ideal filesystem would not require manual synching; that would be handled behind the scenes. Practical considerations today say that manual synching is required. Fine. BUT THAT DOES NOT MEAN THAT IT IS NOT A LIMITATION!!! I honestly do not see where the failure in communication has been here.

    33. Re:Actually, no. by Anonymous Coward · · Score: 0

      There is no such thing as a "simple I/O task". I/O tasks are all of the same complexity in application space. read data in, write data out, handle conditions. If you prefer to do something specific at certain conditions, build yourself a wrapper.

      "One should not have to know about syncing [...]"

      Says who? And how do you know you don't want to sync when you don't know about it? You obviously have a couple misconceptions in that head of yours.

    34. Re:Actually, no. by Anonymous Coward · · Score: 0

      No, that was not your point. There is nothing that limits these frameworks to interpreted languages. Furthermore, that they exist in interpreted languages actually means the language has to handle traditional I/O semantics.

      As I said above write a wrapper (or use one such as this framework). Nothing to do with languages. But not knowing your language (including what happens in the framework) will make your code bad and undoubtly bring in similar bugs in other subsystems.

      Ignorance is no excuse.

    35. Re:Actually, no. by Peeteriz · · Score: 1

      This problem doesn't affect at all "Ruby and Python developers who rarely use files".

      This problem affects developers who call directly the POSIX API write() without properly calling the fsync().

      If you are using a high-level language, then it's high-level method .WriteItAllToDiskAndBeSureItsOK() will either do these tasks properly (and won't be affected), or has the same bug in it's implementation as the mentioned Gnome configuration file handling.

    36. Re:Actually, no. by EsbenMoseHansen · · Score: 1

      No. What KDE and Gnome does is to overwrite small files, and be surprised when this results in a truncated (0-sized) file for a lot of files. As I gather, what happens is that ext4 immediately truncates the files when they are opened, but delays the writing of the new data for 1-2minutes. That is certainly within the standard, but still a rather broken behavior, that will needlessly break a lot of applications. Had the truncation been delayed till the actual write to disk, everything would have been fine, and the data loss minimal, which is really the best you can hope for in this situation.

      --
      Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
    37. Re:Actually, no. by EsbenMoseHansen · · Score: 1

      Yes, damnit yes! You should know assembly of the chip for your calender program. In *some* languages, like C, you need to know this stuff because thats where errors can happen (buffer overflows, for example).

      That is not really true. Only a C-program that does something to cause "undefined" or "implementation-defined" behavior will cause errors due to how the chip is working, except perhaps for longer execution times.

      As a programmer, you are expected to know the language (and in some cases this requires knowing more then one), the ins and the outs. You are also expected to know the API you are using. You dont like the API? Tuff. You can always make your own OS (yea..haha), or make a wrapper API.

      All this damn "people shouldent have to know" crapy this week is really starting to bug me. Learn something damnit and stop complaining when you dont know something you should. Even a damn school-kid would know this simple lesson!

      The real complaint is that requiring people to remember to do stuff the right way every time is inviting disaster, and is rather pointless. Instead, such details should be written a few places and everywhere else just point to those few places. And lo and behold! That is exactly what Gnome and KDE does. As a bonus, should the ext4 people actually be right that their behavior is good and desirable, we only have to fix a few lines of code.

      And don't scuff the notion of choosing the best API. In this case, candidates could be e.g. boost::filesystem or Qt for the C++ crowd.

      --
      Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
    38. Re:Actually, no. by LWATCDR · · Score: 1

      I have done a good amount of Web development. I used a good amount of file io for configuration and such. I am not fond of the set variables in the source file method of setting up configuration. Yes I know everybody uses it but I don't. But yes Web programing seems to tends to go with the MySQL as a file system replacement.
      If you want to call it a limitation well then fine except that EVERYTHING has some limitation. There is just no way that this can be blamed on the file system at all. The cause is simply in this case application programmer IQ error.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    39. Re:Actually, no. by snemarch · · Score: 1

      I - as many others - disagree. The filesystem does keep things synced, when it decides to. It's documented pretty well, and you have manual override if you need more precise control. If you need everything synced, you can specify that as a filesystem mount option.

      If the filesystem always synced after every operation, THAT would be a limitation, since it would be a major performance with no way to avoid.

      The current commit policy of EXT4 might be dangerously long, but if that helps getting rid of *bugs* in other software, then all is good.

      --
      Coffee-driven development.
  30. aww shucks... by Anonymous Coward · · Score: 0

    I thought everybody knew that you don't use a new filesystem until it's stable enough for even Debian to use it as their default.

  31. Alarmist and ignorant article - not a "problem" by ivoras · · Score: 4, Insightful

    *No* modern, desktop-usable file systems today guarantee new files to be there if the power goes out except if the application specifically requests it with O_SYNC, fsync() and similar techniques (and then only "within reason" - actually the most guarantee that the file system will recover itself, not the data). It is universally true - for UFS (the Unix file system), ext2/3, JFS, XFS, ZFS, raiserfs, NTFS, everything. This can only be a topic for inexperienced developers that don't know the assumptions behind the systems they use.

    The same is true for data ordering - only by separating the writes with fync() can one piece of data be forced to be written before another.

    This is an issue of great sensitivity for databases. See for example:

    That there exist reasonably reliable databases is a testament that it *can* be done, with enough understanding and effort, but is not something that's automagically available.

    --
    -- Sig down
    1. Re:Alarmist and ignorant article - not a "problem" by Anonymous Coward · · Score: 0

      that's why I never write anything to disk but only to the network

    2. Re:Alarmist and ignorant article - not a "problem" by rpp3po · · Score: 1

      You did not understand the bug. It's not that people expect actual writes without calling fsync(). It's that ext4 decouples file deletions caused by opening files with the O_TRUNC flag from the actual writes of the files' new contents. This is not necessary. Ext4 could delay deletion on disk until it actually writes any changed contents to disk.

    3. Re:Alarmist and ignorant article - not a "problem" by grumbel · · Score: 1

      *No* modern, desktop-usable file systems today guarantee new files to be there if the power goes out

      They might not guarantee it 100%, but some at least are still pretty good as keeping your files save, while other are completly unusable piles of garbage, unfit for desktop use. Little anecdote, I never lost a single file due to crash on reiserfs or ext3 in a decade, yet after I switched to XFS I lost files on the *first* day and after almost every crash after that (mostly gconf files and Wine registry got nullified). Completely unusable.

      Unlike a server, desktop systems have the tendency to run unstable graphics driver or being laptops that run out of battery. So crashes are quite common on desktop systems and something a file system has to deal with, not just ignore and delete the users files.

    4. Re:Alarmist and ignorant article - not a "problem" by ivoras · · Score: 1

      Ext4 could delay deletion on disk until it actually writes any changed contents to disk.

      I think the problem is that every heuristic has a pathological case. In this case it boils down to "how long should it wait?". Of course, there are more heuristics that can be stacked on top of that like "use a timer", "wait until the file descriptor is closed", etc. but in the extreme this leads to an explosion of possibilities and the code starts resembling an expert system or an AI :) Better to educate developers to know how the system really behaves, and what is guaranteed and what isn't.

      For an explanation, imagine you're running a file system by hand, on a piece of paper or in your head. What you receive are certain simple instructions like "create file F", "write this data to position X in file F", "read data from position X, length Y in file F", "delete file", "truncate file", etc. (there are many more but these are the obvious ones). What you don't get is any kind of knowledge of what future requests will be - you must work with requests as they come in and have only two freedoms: 1) freedom to actually write or read the data when you decide it will be most appropriate (i.e. fastest) and 2) freedom to choose when to acknowledge to the application when the operation is done (obviously this is more important for writing then for reading). And you do need to actually come up with some clever way to use these freedoms because doing synchronous requests is *very* slow.

      AFAIK some historically used heuristics are:

      • "Optimize for small or quickly deleted files - like in a busy mail server or a compiler" : when a "create file" request comes, acknowledge it but don't write anything; wait until some data arrives (some threshold), and if the file is closed then decide where to put it (so small files don't get fragmented, which is the worst case), or don't write it at all if it's unlinked (so small temporary files don't ever touch the physical drive - very desireable). Of course, while this is optimal for placement of small files, it will lose files like crazy on a busy file server.
      • "Log synchronously and periodically checkpoint everything" : can also fast (at least for writing), but if power goes down, ops between checkpoints could all be lost. Also, checkpoints can physically separate what could be logically close operations - like in this case O_TRUNC and writing something.

      I think ext4 uses a combination of these.

      An interesting choice of a set of heuristics that strongly relies on file system request ordering is the BSD UFS's soft-updates but note that even with it the user data is not guaranteed to be preserved.

      --
      -- Sig down
    5. Re:Alarmist and ignorant article - not a "problem" by Eunuchswear · · Score: 1

      *No* modern, desktop-usable file systems today guarantee new files to be there if the power goes out except if the application specifically requests it with O_SYNC, fsync() and similar techniques (and then only "within reason" - actually the most guarantee that the file system will recover itself, not the data). It is universally true - for UFS (the Unix file system), ext2/3, JFS, XFS, ZFS, raiserfs, NTFS, everything. This can only be a topic for inexperienced developers that don't know the assumptions behind the systems they use.

      Untrue. With VXFS the default mount option, "-o mincache=closesync" guarantees that the data will be on the disk when the close() finishes.

      --
      Watch this Heartland Institute video
    6. Re:Alarmist and ignorant article - not a "problem" by ivoras · · Score: 1

      And how many desktops are using Veritas software? :)

      --
      -- Sig down
    7. Re:Alarmist and ignorant article - not a "problem" by Eunuchswear · · Score: 1

      Mine?

      --
      Watch this Heartland Institute video
  32. Re:Bull... by caerwyn · · Score: 1

    They are.

    The only calls that say data is written are the fsync() family (or files opened with O_SYNC.

    Those calls do not lie.

    Unfortunately, application developers are assuming that other calls say something they do not, in fact, say. This is where the problem comes in. close() does not guarantee that data has been written yet. fsync();close(); does.

    --
    The ringing of the division bell has begun... -PF
  33. Nobody ever said ZFS is fast.... by Anonymous Coward · · Score: 0

    There's no magic in ZFS Intent Log. It slows down writes greatly, but give back data integrity. Google "zil_disable" to find a bunch of people that are surprised by the slow write performance of ZFS.

  34. In other words, the Windows Registry by Anonymous Coward · · Score: 0

    No thanks. Been there. Suffered through that. Give me lots of little files.

    DON'T hide them in an all-or-nothing database!

  35. Re:Exactly by Anonymous Coward · · Score: 0

    Can you please but a brain and then return here, I wonder how you even get mods up, the amount of crap you are posting here is just amazing.

    Go read the specs or even just man fsync
    It has not changed in *ages*.

    The fact is that with ext3 delayed writes where only 5 seconds apart so by *SHEER LUCK* any application that didn't use fsync *MOST OF THE TIME* did not had problems.

    Now if you think that properly written application should keep relying on *LUCK* instead of properly using the POSIX interfaces Linux rely on, then go troll elsewhere. Probably a Visual Basic Forum is right about your level of knowledge ...

  36. man 2 fsync by Nicolas+MONNET · · Score: 5, Informative

    The filesystem doesn't guarantee anything is written until you've called fsync and it has returned.

    1. Re:man 2 fsync by setagllib · · Score: 2, Interesting

      No, disk caching is now considered the default. Nothing is written until the disk decides it is time, and this is completely up to them. It doesn't even have to occur in the same order the writes were issued in, especially with TCQ.

      --
      Sam ty sig.
    2. Re:man 2 fsync by SaDan · · Score: 1

      All of which can be disabled, or should be disabled on critical systems.

    3. Re:man 2 fsync by setagllib · · Score: 1

      Not my point - the point is the filesystem doesn't guarantee anything, fsync just gets it one layer further (which may or may not be enough depending on your controller, disk, etc.)

      --
      Sam ty sig.
    4. Re:man 2 fsync by gnasher719 · · Score: 1

      The filesystem doesn't guarantee anything is written until you've called fsync and it has returned.

      Which may be true, but is missing the point. Your computer can crash just before fsync is called, so relying on fsync doesn't really help if you want to ensure that the data has been written. It only helps if you need to _know_ that it is written (for example, if you want to delete the original after the copy has been safely written).

    5. Re:man 2 fsync by SaDan · · Score: 1

      Not arguing with you, just stating that if something like relying on the hardware cache on a hard drive concerns someone, they should be able to disable that feature.

    6. Re:man 2 fsync by davecb · · Score: 1

      Non-journaled filesystems had that behavior, and could become internally inconsisent. So we created journaled ones, so that one wouldn't have to suffer that behavior except in cases which break the guarantees of that filesystem. The debate is really about the proper guarantees on metadata and data write, deep in the filesystem, specifically the updating of the metadata before the data, as Mr. Tso pointed out.

      --dave

      --
      davecb@spamcop.net
  37. Re:Exactly by doshell · · Score: 1

    This is a design decision, and it is a problem of the filesystem, no matter how much they try to blame it on "poorly written applications". Applications should be able to do whatever they want. It is the job of the filesystem to accurately record it. Period.

    The job of the filesystem is to provide system calls whose behavior has been clearly specified. Now point to me where SUS or POSIX or your favorite Unix standard says, e.g., that write(2) ensures data has been flushed to disk upon return.

    The write(2) manpage on my Linux system says

    A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.

    And the situation is similar for many other filesystem-related system calls.

    If the applications are making wrong assumptions about what the system calls they use provide, they are indeed "poorly written". And the filesystem shouldn't get the blame for that.

    --
    Score: i, Imaginary
  38. raid controllers don't fake it by Nicolas+MONNET · · Score: 1

    They use battery-backed cache. The data is written and stored, just not on disk yet. The battery is supposed to last a couple days. If you need to shut a server down for longer than that ... well just don't yank the power cord, perform a clean shutdown.

    1. Re:raid controllers don't fake it by DigiShaman · · Score: 1

      I'm not sure if it's a fake sync or not, but RAID controllers must act as the mediator as the OS will not directly access the individual drives themselves. So in order to prevent buffer under/overruns of the RAID cache, I can imagine it faking syncs.

      As for the battery. Yes, it's used for when the OS hasn't made a call to perform a shutdown. Things like a power outage or kernel panic will cause data loss without it. Data of course will be flushed from cache back to disk upon system power up.

      FYI (and OT), I just learned a few days ago that I could pull the cache ECC DIMM from a Dell PERC card and install it on the motherboard. From there, memory diagnostics can be ran. Now only if they had this diag embedded in the RAID card itself, but I digress...

      --
      Life is not for the lazy.
    2. Re:raid controllers don't fake it by greg1104 · · Score: 2, Informative

      If your battery-backed RAID controller ever fakes a fsync it is fundamentally broken or misconfigured. When the cache is filled with a write backlog and you try to write something else, that write will block until there is free space. Same as any other write cache that fills up.

      When cache space is available to cache the write again, the data goes into there, and then a fsync request after it can then return success.

  39. Re:Exactly by Jane+Q.+Public · · Score: 1

    That is the point: Ext4 greatly increases the delays, and thereby increases the risk of something going wrong. Sure, it is a tradeoff... but it is beginning to appear that Ext4 traded off a bit too much.

    "And lets face it: fsync/fdatasync are not really a secret to any competent developer."

    I disagree. Users of high-level languages (especially those that are cross-platform) are not necessarily aware of this situation, and arguably should not need to be.

  40. Re:Exactly by xenocide2 · · Score: 1

    AFAIK, nobody's calling for ext4 as default. Just calling for people to test it. This is a report in a development version of Ubuntu, after all.

    --
    I Browse at +4 Flamebait

    Open Source Sysadmin

  41. small files, fsync and journal=data by Nicolas+MONNET · · Score: 1

    If I understand things correctly, while there is a significant hit when writing lots of small files and fsyncing after each of them, you take a hit except when you're journalling data. But in that case you take a hit when writing big files, since data has to be written twice (first in the journal, then when the journal is flushed).

  42. Bad defaults by Big_Mamma · · Score: 1

    This is a classic case of bad defaults. Yes, you will always have a trade off between performance and security, but going for either extreme is bad usability!

    People expect that, without explicit syncing, the data is safe after a short period of time, measure in seconds. The old defaults were: 5 seconds in ext3, in NTFS metadata is always and data flushed asap with but no guarantees. In practice, people don't lose huge amount of work.

    What happened is that the ext4 team thought waiting up to a *minute* to reorder writes is a good idea - choosing for the extreme end of performance.

    My question is: WHY? Does it really matters to home users that KDE or Firefox starts 0.005 seconds faster? Apparently, the wait period is long enough to have real life consequences even with limited amount of testers, imaging what happens when it gets rolled out to everyone. On servers, it's redundant. Data is worth much, much more than anything you hope to gain and SSD's, battery backed write cache on controllers and SAN's have taken care of fsync's() already. If you run databases, those sync their disks anyway, so you just traded a huge chunk of reliability for "performance" on stuff like /home, /var/mail and /etc.

    The "solution" of mounting the volume with the sync everything flag is just stupid. Yay, lets go for the other extreme - sync every bit moving to the disk. Isn't it already obvious that either extreme is silly?

    Just set innodb^W ext4_flush_log_at_trx_commit on something less stupid already, flushing once every second shouldn't kill any disk. Copy Microsoft for config options:
    * Disable flush metadata on write -> "This setting improves disk performance, but a power outage or equipment failure might result in data loss".
    * Enable "advanced performance" disk write cache -> "Recommended only for disks with a battery backup power supply" etc etc.
    * Enable cache stuff in RAM for 60s -> "Just don't do it okay, it's stupid."

    1. Re:Bad defaults by 0123456 · · Score: 2, Interesting

      The old defaults were: 5 seconds in ext3, in NTFS metadata is always and data flushed asap with but no guarantees. In practice, people don't lose huge amount of work.

      Actually, I've lost multi-gigabyte files on NTFS; in one particular case I left IE downloading a game installer overnight, heard it beep around 8am to tell me it had completed, and then the power went out a couple of hours later before I got up. The file system was magically 'consistent' after the power came back and it rebooted, but it achieved that by deleting over two gigabytes of my data.

      Modern file systems may be a bit faster than FAT32, but they're shit when it comes to reliably storing data.

      In this case, yes, the KDE developers are retarded, but if the ext4 developers want ext4 to become the default filesystem for Linux, they need to make it work with retarded developers. 'But POSIX says we can do this' is worthless if it loses large amounts of user data; heck, you can easily guarantee 'file system consistency' by simply reformatting the disk on every reboot, but your users would be pretty damn pissed.

    2. Re:Bad defaults by snemarch · · Score: 1

      "Advanced performance" doesn't mean enabling disk write cache, it means making FlushFileBuffers do nothing (stu-pi-di-ty). Disk write caching is called... *drumroll*... "Enabled write caching on the disk".

      Iirc, the "Optimice for performance" vs. "Optimize for quick removal" settings mean whether to use filesystem cache or not.

      --
      Coffee-driven development.
    3. Re:Bad defaults by Anonymous Coward · · Score: 0

      This is clearly an application bug.
      However, cutting down the power of your PC right after a move operation on NTFS will result in the file vanishing 100% of the time.

    4. Re:Bad defaults by Trogre · · Score: 1

      Clearly you should have gotten up at 8:30am :p

      --
      "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
  43. Re:excuses don't get you world domination by Anonymous Coward · · Score: 0

    This AC is spot on! Well said sir/madam!

  44. To Anonymous Coward: by Jane+Q.+Public · · Score: 1, Insightful

    "... the standards are particularly clear about what is guaranteed and what is not."

    That still does not make it any less of a filesystem limitation! Are we speaking the same language?

    1. Re:To Anonymous Coward: by Bronster · · Score: 4, Informative

      mount -o sync. Enjoy your slow returns and strictly ordered writes.

    2. Re:To Anonymous Coward: by gweihir · · Score: 1

      It is a general limitation for any modern filesystem, yes. And therefore any good developer knows it and kowns how to deal with it. Face it: The KDE people messed up.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    3. Re:To Anonymous Coward: by LWATCDR · · Score: 2, Insightful

      It isn't a file system limitation. And here is why.
      1. The POSIX standard specifies that writes may be delayed. Every modern file system may delay writes.
      2. The POSIX standard then gives you a way to flush the buffer at the time of the programs choosing. It is called fsync(). If the programmer called that well documented function then all would have been well.
      You have the best performance possible and you can insure that file is flushed before you do something else.
      The file system didn't cause this bug. The posix spec didn't cause this bug. The programmer that didn't use the tools as documented caused his own bug.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    4. Re:To Anonymous Coward: by Jane+Q.+Public · · Score: 1

      It most definitely is a filesystem limitation. That is different from saying that it's the filesystem's fault.

      Why is the data that was not yet flushed to disk still accessible (albeit incorrectly) via the filesystem? That IS a limitation, and a rather large one. No matter whether it is in the specification or not, it is still a limitation. The programmers might have screwed up... but this is still a filesystem limitation.

      If my airplane is only designed to pull 3 Gs before the wings come off, and I try to do a 4 G maneuver... I may have screwed up, but it's still a limitation of the airplane.

    5. Re:To Anonymous Coward: by LWATCDR · · Score: 1

      "It most definitely is a filesystem limitation. That is different from saying that it's the filesystem's fault. "
      Funny that isn't what you said.
      "Blaming it on the applications is a cop-out. The filesystem is flawed, plain and simple. "

      But to use your airplane analogy. If you fill the planes tanks with water then ram it into a brick wall and it doesn't come out the other side without scratching the paint then that is a limitation.
      The problem is that to make a file system work the way you want it to you would pay a big price in performance to make up for programmer incompetence.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    6. Re:To Anonymous Coward: by Jane+Q.+Public · · Score: 1

      Yes, that is what I wrote, and after having discussed the issue with several people here I would qualify that to some degree, but not much. Clearly the programmers did mess up, but also clearly the filesystem has some serious limitations.

      But even though you seem determined to use ridiculous examples, the answer is yes, the failure of the plane to come out the other side would indeed be a limitation of the plane. However, it is a limitation that any reasonably sane adult would already know without having to ask anybody or to look it up in the API documentation.

      "The problem is that to make a file system work the way you want it to you would pay a big price in performance to make up for programmer incompetence."

      Not necessarily. For example, how much overhead would it take to flag buffered data so that it could not be accessed by an application again until it had been written? It would take some fairly sophisticated mapping. Enough to make it impractical? I don't know. But that would be a partial solution to the problem that introduces no delays other than the aforementioned overhead. Then the writes could be prioritized such that those areas that have access requests waiting in the queue are written out first. More overhead. Too much? I don't know. I don't design filesystems. But I am not convinced that this is an insurmountable problem.

      I mean, the issue here is actually pretty simple: somebody makes changes to some files, then tries to access the changed data before it is written to disk. The filesystem allows them to do this. Why? It does not seem to me that it would be very difficult to just lock those areas until the changes are written, pretty much the way databases lock records that are being accessed. Databases do this efficiently; why not filesystems?

    7. Re:To Anonymous Coward: by Jane+Q.+Public · · Score: 2, Insightful

      As it turns out, the point is probably moot. As someone else has pointed out, the bug report itself (not TFA) makes it clear that the trashed data was, in fact, caused by a system crash and not by filesystem access per se. TFA and the headline both strongly implied otherwise, but as it turns out, this is a non-issue.

    8. Re:To Anonymous Coward: by swillden · · Score: 2, Informative

      It most definitely is a filesystem limitation.

      No, it's not. The file system is perfectly capable of making sure all your writes hit the disk as soon as possible.

      Just mount it with the 'sync' option.

      If you want the significant performance benefits of delayed writes, however, you should not use 'sync' and accept that, with Ext4, write() works the way the documentation says it does.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    9. Re:To Anonymous Coward: by Antique+Geekmeister · · Score: 1

      Why not send it out to be engraved on stone tablets while you're at it? I overstate, but for filesystems that get a lot of small writes, the '-o sync' introduces a very serious performance hit.

    10. Re:To Anonymous Coward: by scotch · · Score: 1

      You keep saying it's a filesystem limitation in some completely meaningless way in which all things are limited. Congratulation filesystem, you are among the set of things which have limitations of which all things are a member. What are you trying to prove other than you can't admit that you are wrong?

      --
      XML causes global warming.
    11. Re:To Anonymous Coward: by gzipped_tar · · Score: 1

      Call me retarded, but following your argument I can't grasp what do you mean by "limitation" and/or your use of the genitive ("filesystem's"). Suppose I designed a filesystem that is heavily hyped to be "Teh Filesystem for Dummies!" and as a result this filesystem (RetardFS, let's call it ;) has the highest percentage of n00bs in its user demographics among all filesystems'. Do I call its "n00biness" a "filesystem's limitation"?

      OK forget about my poor analogy -- it's for the lulz. Let's call a spade a spade: the stuff we're discussing here is an application limitation. Whether it's *also* implying a so-called "filesystem limitation" (whatever it means) may not be as clearly cut as you think it is. Maybe I could say it depends on your definition of "'s".

      BTW your airplane analogy sucks IMNSHO. It doesn't sound like saying "it's still a limitation of the airplane" to me. It implies *everything* has some inherent limitations, which is right but pretty irrelevant to the particular problem here.

      --
      Colorless green Cthulhu waits dreaming furiously.
    12. Re:To Anonymous Coward: by gzipped_tar · · Score: 1

      *Everything* has limitations. That's why engineering is necessarily an art. But in no means does the inherent limitations of something justify the misuse of it. Your car has an limited maximum load of 4t. You overload it to 5t and it stops working. Your car has an limitation. But that's something everyone knows and it's not the car that is supposed to be blamed for the failure.

      I replied to another of your posts, basically saying the same thing but in a harsher language. I apologize for that.

      As for your suggestion of improving the filesystem, I'm no expert either, but I guess the problem could boil down to "filesystem is not database" and "read/write is not query/commit". Filesystems are built to support the very basic IO operations. It's up to the app developers to implement proper cache control for their specific applications, whether it's DB-like or not.

      --
      Colorless green Cthulhu waits dreaming furiously.
    13. Re:To Anonymous Coward: by Bronster · · Score: 1

      See my parent who was claiming that it's a filesystem limitation that you don't get sync behaviour.

      Dur. That's slow, so you don't do that if you want speed - but don't come whining that you can't have your cake and eat it too.

    14. Re:To Anonymous Coward: by Antique+Geekmeister · · Score: 1

      ext3 used to work reasonably well for such short-term operations. The change is profound, and noticeable. They're whining because Ted, understandably, didn't give them fries with their Happy Meal, and gave them the "fruit pouch" of better write performance.

    15. Re:To Anonymous Coward: by Anonymous Coward · · Score: 0

      Wtf? We are arguing about a feature that speeds the whole system up by several orders of magnitude, by using RAM as write cache, so that application don't have to wait for slow hard drives, and you suggest "fixing" it by having the system lock the file, forcing applications to wait for the same slow hard drive.

      That's a much more complicated attempt at achieving the same thing as -o sync does in the first place. Except, it won't write things out right away, so you'll still lose data you thought were saved, and it will be even slower, because the data will be locked for much longer.

    16. Re:To Anonymous Coward: by LWATCDR · · Score: 1

      Well as for how much overhead I would have to say that it would be too much for what you gain.
      If you really want ACID then use a database that is what they are for.
      In this case the programmer messed up on a few levels IMHO.
      First He should have written to a temp file and then called fsync. Then done a rename on the original to a .bak and then renamed the temp.
      Second he should have had sanity checks on the data in the file. If the data fails the sanity checks then you fall back to your bak file.
      Third if the bak fails sanity checks then you use some safe defaults and all the user to reconfigure.
      You really don't want an end user locked out of their GUI.
      What I find so interesting is the evolution of how we store data files.
      When I started programming we used to use text files for configuration. Then we found out that we could just write out blocks of memory containing data structures to a file. It was fast and it worked until something goes wrong.
      Now we have gone back to text/xml files because we can do sanity checks and prevent crashes. Plus it is easier to document and convert to new versions.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
  45. Re:Exactly by gweihir · · Score: 4, Insightful

    "And lets face it: fsync/fdatasync are not really a secret to any competent developer."

    I disagree. Users of high-level languages (especially those that are cross-platform) are not necessarily aware of this situation, and arguably should not need to be.

    And I disagree with your disagreement. This is something any competent developer has to know. There are fundamental limits in practical computing. This is one. It cannot be hidden without dramatic negative effects on performance. It is not a platform-specific problem. It is not a language-specific problem. It is not a hidden issue. A simple "man close" will already tell you about it. Any decent OS course will cover the issue.

    I reiterate: Any good developer knows about write-buffering and knows at least that extra measures have to be taken to ensure data is on disk. Those that do not are simply not good developers.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  46. Re:Exactly by gweihir · · Score: 1

    AFAIK, nobody's calling for ext4 as default. Just calling for people to test it. This is a report in a development version of Ubuntu, after all.

    Ah, ok. Then the story should perhaps have been called "Experimental version of Ubuntu is not totally reliable"?

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  47. No, it isn't! by gbutler69 · · Score: 1

    Read the POSIX specification. Any developer of applications that need on-disk data integrity already knows this! The KDE/GNOME developers (and many others) have just gotten lazy about doing things properly. They now need to fix the Gnome/KDE libraries and applications to NOT do stupid things!

    If you don't understand this, you really need to refrain from talking about things you don't know anything about.

    --
    Over-the-top Response Guy! Giving "Over-the-Top Responses" since 1970.
  48. Things that should be improved ... by hattig · · Score: 2, Interesting

    Bah. Maybe all computers should come with a single-cell battery, for a couple of minutes of backup power.

    As soon as power fails to the system and it resorts to battery, all calls to write() should also call fsync(), even if that slows the system down.

    Never mind an option that implicitly calls fsync() if it hasn't been called in the past 3 seconds, for a minimal performance hit. If you have a specific application that doesn't want fsync() then you can disable that feature, but clearly on a consumer box, no UPS, potentially dodgy hardware and drivers, it makes sense. 150 seconds without a sync, just dumping into a buffer for writing ... sheesh.

  49. Everyone keeps missing the point by Anonymous Coward · · Score: 0

    The problem isn't that a pending file UPDATE is lost, it's that the ORIGINAL is lost, too.

    As the 2nd post points out (and I think that's about the only post so far that got it right) the FS shouldn't record the journal update before the actual file update -- else the original file is lost!

    1. Re:Everyone keeps missing the point by Anonymous Coward · · Score: 0

      Wrong. RTFA.
      The file contents are lost because the application truncated the file. The journal has nothing to do with this issue (actually, setting journal=data will completely fix this issue).

  50. Why XFS was never an option by dtfinch · · Score: 1

    Ext3's commit interval was one of its best features.

    Sure, it doesn't have to make guarantees when the app doesn't explicitly sync, but losing data 1% of the time in an outage is better than losing data 99% of those in those cases.

    Whenever I saw people complaining of losses in XFS that wouldn't have happened in ext3, the "doesn't have to guarantee unless synced" thing was brought up as an excuse.

    1. Re:Why XFS was never an option by Zan+Lynx · · Score: 1

      XFS is tons faster than Ext3. Ext3 is nearly the slowest file system on the planet. Defaulting to ordered mode made it worse.

      Using anything that called fsync could cause Ext3 up to 30 second delays because all of your much loved ordering guarantees forced a complete flush of the entire journal and all related data.

      It's not an excuse for XFS, it is just the choice made between perfectly slow and safe or ridiculously fast and not safe at all. XFS is quite fast and mostly safe. Ext3 is safe and slow.

    2. Re:Why XFS was never an option by QuoteMstr · · Score: 1

      So what if ext3 is relatively slow? It's not the fucking bottleneck . If you're tuning a mailserver, a compile box, or a database machine, and you show with fucking benchmarks that filesystem metadata updates are causing significant performance problems, then use a different filesystem. But for most people, filesystem writes are not the bottleneck, and the data integrity lost by giving up ordered writes just isn't worth it.

      With apologies to Benjamin Franklin, those who would give up essential data integrity for a little unimportant performance deserve neither.

  51. amirulbahr: by Jane+Q.+Public · · Score: 1

    Excuse the heck out of me, but the issue being discussed was a failure that was NOT due to a power loss or other such system problem. It was a crash caused by this very issue. If it were simply missing data due to power loss or some such, there would be no point in this discussion at all.

    Perhaps you would benefit from reading TFA? Then maybe you would know what we are discussing and see the point after all.

    1. Re:amirulbahr: by amirulbahr · · Score: 1
      FTFA:

      The KDE and GNOME desktop applications often read and write a large number of small files (for example, the configuration files for your personal settings). If the system crashes there may not be enough time for the data to be allocated and written to the hard drive â" under ext4, the files may be truncated.

      Clearly you must have read TFA if you were advising me to do so. So I can only conclude that you are in fact entirely clueless on the subject matter if you did RTFA but completely failed to grasp what is going on.

      Of course, I mean you no personal offence, and you don't have to take my word for it, but you really don't know what you're talking about. You should not be posting so much about this topic. If you spend a few hours, reading up on and perhaps even implementing a basic file-system, you will personally gain far more than this incessant spam-posting.

    2. Re:amirulbahr: by Zakabog · · Score: 1

      Excuse the heck out of me, but the issue being discussed was a failure that was NOT due to a power loss or other such system problem. It was a crash caused by this very issue.

      From TFA:

      "If the system crashes there may not be enough time for the data to be allocated and written to the hard drive"

      From the bug report in TFA:

      "Today, I was experimenting with some BIOS settings that made the system crash right after loading the desktop."

      The filesystem behaved exactly as it should have. It was the application that wrote to the files without an fsync that caused the files to be lost when the system crashed.

      Perhaps you would benefit from reading TFA?

    3. Re:amirulbahr: by Jane+Q.+Public · · Score: 1

      "... under ext4, the files may be truncated. This is because of delayed allocation. When a new file is created, the change is noted in the journal, but the data isn't written to the disk for a new file for anything between 45 and 150 seconds. The file system then catches up, allocating space for the file and writing the data."

      Further:

      "The report describes a crash occurring shortly after the KDE 4 desktop files had been loaded, resulting in the loss of all of the data that had been created, including many KDE configuration files."

      This was a crash that did NOT occur due to power failure or some other system problem! As has been discussed already in this thread multiple times, the problem was due to a program that wrote lots of small files and then tried to access them again before they were written to disk.

      I don't mean any personal offense either, but if you were to read the WHOLE thing, in context, you would see that I am correct in this, and that you are the one who is off-base. Go ahead... start at the top and read what people are writing about.

    4. Re:amirulbahr: by Jane+Q.+Public · · Score: 1

      See my comment to amirulbahr just above.

      I did in fact read TFA, but I had not read the bug report. As you point out, the bug report itself does mention this. However, the article itself, as well as the headline of this topic (before it was changed), both strongly implied otherwise.

      The context of most of this whole topic assumes that the problem was caused by the programs behaving as I stated... and I only stated it that way because others had. I shall not apologize for discussing the topic in the prevailing context.

      As it stands, this is now a non-issue, because the problem was in fact caused by a system crash unrelated to the supposed "software problem". Other cached filesystems will also inevitably trash data under these circumstances, sometimes whether sync was used or not. However, I must point out that most journaling filesystems will not cause multiple 0-byte files; changes are either written or they are not. There does in fact appear to be a problem here between the "journaling" done by Ext4 and its writes.

    5. Re:amirulbahr: by amirulbahr · · Score: 2, Informative
      I assure you it is you who has mis-understood the situation. From the bug report referenced in the summary:

      Today, I was experimenting with some BIOS settings that made the system crash right after loading the desktop. After a clean reboot pretty much any file written to by any application (during the previous boot) was 0 bytes. For example Plasma and some of the KDE core config files were reset. Also some of my MySQL databases were killed...

      My EXT4 partitions all use the default settings with no performance tweaks. Barriers on, extents on, ordered data mode..

      I used Ext3 for 2 years and I never had any problems after power losses or system crashes.

      The crash was not caused by ext4 but by something else. The file system was in a consistent state because of the journal. Some data had not yet been written to disk, because of the delayed write and was thus lost.

      Maybe you need to take a break, or have a coffee, or get some sleep or something. But you really are way off and posting way too much on this topic that you are not well informed of.

      This is not a bug, not a flaw, not a limitation. You can write and then read regardless of whether or not actual disk commits take place. The file system takes care of that for you. If you're doing file I/O, and you want to call yourself half-way competent, then you should have some clue about the possibility that the underlying file-system will be doing delayed writes. If you a writing critical applications for which this may cause issue then you might decide to throw in some fsync calls (or there equivalent in whatever platform you are using).

      I know you have learnt something today. Glad to help out.

    6. Re:amirulbahr: by Jane+Q.+Public · · Score: 1

      Please see my reply below. I do understand this now, having read the bug report. However, I must point out that the article itself, as well as the headline for this topic (before it was changed), both strongly implied that the situation was different from what it actually is.

      So there was indeed a misunderstanding on my part, but I believe it was understandable, given the wording of the article and the context of much of the discussions here.

      In any case, I did indeed learn something: I should have looked at the original item (the bug report), and not a poorly-worded and misleading article.

      Given the situation as it is, most of your other comments become obvious facts and somewhat redundant.

      I do not dispute that a programmer who is doing non-trivial disk I/O should sync when appropriate. However that *IS* still a limitation of the filesystem, and no amount of arguing on your part will make it otherwise. As I have stated elsewhere, several times now: it may be a common limitation, and it may even be a reasonable limitation, but for all that it is still no less a limitation.

  52. Re:Exactly by Dog-Cow · · Score: 2, Insightful

    You are an idiot. The design of the POSIX API dictates that fsync (or equivalent) is required to ensure data is flushed to disk. This has been true forever. If an abstraction in an i/o library is not using the API correctly, it is the fault of the library.

    You are correct that the user of the abstraction should not care, but you are putting the blame in the wrong place. The whole point of using an abstraction is to hide details such as this. If the library author is too stupid to learn the API he is abstracting that is HIS fault.

  53. Re:Exactly by asretfroodle · · Score: 1

    If you can't be bothered learning how the API works, then how about you use a library which takes care of it for you?

    Just because you're using a high level language doesn't mean you can ignore learning it's API.

    For example, from the Python docs:

    os.fsync(fd)
    Force write of file with filedescriptor fd to disk. On Unix, this calls the native fsync function; on Windows, the MS _commit function. If youre starting with a Python file object f, first do f.flush(), and then do os.fsync(f.fileno()), to ensure that all internal buffers associated with f are written to disk. Availability: Unix, and Windows starting in 2.2.3.

    If you're writing applications robustly, this is something you need to be aware of.

  54. Re:Bull... by Anonymous Coward · · Score: 2, Insightful

    Optimize the reads all you want, but those writes better damn well happen before the calls that say data is written return.

    And this is where most of the confusion comes from. There is a difference between a logical write and a physical write. When your write call completes, it says the logical write has completed. It says nothing about the physical write. Depending on file system semantics, your physical write may have already completed too - or shortly after. If you must explicitly ensure the physical write is complete then you must explicitly ensure it via code - otherwise the physical write can only be assumed. And this is where the the lessor informed seem confused by their own poor expectations and ignorance. Unless they are actually following their write with some sort of file system synchronization call, ignoring their ignorant expectation, they have no right what-so-ever to assume the data will still be there in the face of a system crash. Its a very poor coder who falls into that trap.

    Good programmers know this and have known it for tens of years. Good database programmers know this. Good file system developers know this. Those that are outraged by their own ignorance are either not programmers or are not good programmers.

    And lastly, I'll point out, which is exactly why Tso pointed it out - use a solution where its foundation is built by coders who already understand the proper way to ensure data is safe on the file system - for example, use a database. While I don't consider the use of a database to be an ideal solution here, it does a wonderful job of highlighting the crappy design both KDE and GNOME have used to store configuration data - and how unconcerned they are about data loss and data corruption. If the developers of KDE and GNOME don't give a crap about your configuration data then how on earth can you possibly be upset at the file system for doing what its suppose to do?

    In short, both KDE and GNOME need to give a crap about how, when, and why they write configuration data. Since they don't care about data integrity, you now know who you should be angry at. Here's a hint, and it doesn't have anything to do with the file system.

  55. Sounds Familiar by slyn · · Score: 1

    So what your saying is: Its not a bug, its a feature?

    Where have I heard the before? Hmm.....

    1. Re:Sounds Familiar by Lucid+3ntr0py · · Score: 1

      So what your saying is: Its not a bug, its a feature?

      Where have I heard the before? Hmm.....

      Everyday when I support Lotus Notes

  56. Wow. I learned something useful out of this... by rickb928 · · Score: 1

    And I asked my buddy who writes *nix disk drivers at a very well-known outfit. He was a little shocked that someone would measure commit time in minutes. He writes mostly RAID drivers now, for server hardware, and thinks in terms of single-digit seconds is chancing it, even with battery-backed cache (which his hardware does NOT have, BTW). He is of the opinion that this is a terrible mistake, and someone should change these defaults and issue the patch, quietly, so no one gets hurt more than they aready have. He says he wouldn't what was done, but then again, he spends his days troubleshooting race conditions and interrupt conflicts, what does he know... And he is getting old before his time. I tell him he oughta go into display drivers and save his life...

    But this reminds me of the problems of networked drives - delayed writes on Windows servers often lead to corruption and lost data if the network connection broke and then the server borked. Some legendary fiascos I presided over, and very unhappy people who didn't understand the concepts of networking and Microsoft's brain dead implementations. Lots of lost sleep.

    So does this also potentially affect NFS and SAMBA shares? Add in the possibility of network connection dropouts, and this sounds worse than ever.

    Are we making progress yet?

    --
    deleting the extra space after periods so i can stay relevant, yeah.
    1. Re:Wow. I learned something useful out of this... by larry+bagina · · Score: 1

      NFS has an option (on the server side) to be sync or async.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

  57. Mod parent up by betterunixthanunix · · Score: 2, Informative

    Much as I love to fallback on the "POSIX says that this could be the case so it is OK that it is the case" excuse, it really does not fly in this case. The POSIX doesn't allow this sort of behavior because it is a "good" thing to do, it allows it because there are systems where this is an OK thing to do -- systems intended to manage database, systems that are heavily verified and have backup power supplies, etc. This is not appropriate for desktop systems -- desktop systems have to be robust against all kinds of stupid situations, like sudden power losses, users hitting the reset button because an application is hanging, and so forth. EXT4 should not be used in a desktop system if it can cause data loss when the unexpected happens, regardless of the technical merits of writing to small configuration files.

    --
    Palm trees and 8
    1. Re:Mod parent up by Anonymous Coward · · Score: 0

      I bet the writers of the POSIX spec never even considered that their spec allows this kind of behaviour.

  58. Bzzzt! by gbutler69 · · Score: 1

    Thank you for playing, but, unfortunately, you are the idiot. You do not understand the purpose of the file-system, what journaling is, nor do you understand proper use of an API.

    Please, SHUT THE FUCK UP!

    --
    Over-the-top Response Guy! Giving "Over-the-Top Responses" since 1970.
  59. Re:Exactly by xenocide2 · · Score: 1

    Well, the dev version's slated for release shortly. But even then, ext4 will not be default. On the other hand, I have heard calls to make ext4 default for the next Fedora, whenever that is. Please direct your anger and abuse their way, please ;-)

    --
    I Browse at +4 Flamebait

    Open Source Sysadmin

  60. The file system is at fault here by Anonymous Coward · · Score: 0

    at least partially. The commit interval of ext4 seems too long to me. The developers seem to have sacrifized reliability for speed.

    If you want a quick file system that does not write to disk, use a RAM disk.

    If you want persisent data, write to a disk. This should not be a matter of how the apps are written.

  61. Learned something today by drolli · · Score: 2, Informative

    Citing from the message Ts'o post:

    ----
    So, what is the problem. POSIX fundamentally says that what happens if the system is not shutdown cleanly is undefined. If you want to force things to be stored on disk, you must use fsync() or fdatasync(). There may be performance problems with this, which is what happened with FireFox 3.0[1] --- but that's why POSIX doesn't require that things be synched to disk as soon as the file is closed.
    ----

    And indeed, and reading the NOTES section of "man -S2 close" explicitely notes what is not mentioned in the other sections. I up to this day also lived under the assumption that a close implies a fsync. Now i have to change my ptograms where it matters. All the Idiots who scream here that the OS is doing something worng: no, it's not. AFAIU it's following the befined behaviour which is what i expect an OS to do. It should NOT try to magically guess where i forgot to fsync my files.

  62. Re:Bull... by Qzukk · · Score: 1

    those writes better damn well happen before the calls that say data is written return.

    A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.

    Not checking the return value of close() is a common but nevertheless serious programming error. It is quite possible that errors on a previous write(2) operation are first reported at the final close(). Not checking the return value when closing the file may lead to silent loss of data. This can especially be observed with NFS and with disk quota.

    A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a filesystem to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2).

    The manpages agree with what you are saying, the problem is that the application developers who forgot to/don't want to use fsync() don't, because the *sync() functions are the only ones that say anything has been written, and they do in fact stop and wait until the data really is written.

    --
    If I have been able to see further than others, it is because I bought a pair of binoculars.
  63. MOD PARENT DOWN by mkcmkc · · Score: 1

    and definitely don't hire to write safety-critical software...

    --
    "Not an actor, but he plays one on TV."
  64. Ok... Two things by Skurge357 · · Score: 1

    1) You can adjust your commit interval and 2) Look into PC-BSD/Solaris, ZFS is fairly solid, from what experimentation I've done. Wish Linux could use it properly. Loving the BSD implementation of ZFS.

    1. Re:Ok... Two things by Conley+Index · · Score: 1

      Look into PC-BSD/Solaris, ZFS is fairly solid, from what experimentation I've done.

      You have not tested too much. Have a look at the FreeBSD mailing lists. ZFS is not stable (and that is not only for the BSD interpretation of "stable"). It needs major tweaking to work on any RELEASE or STABLE. CURRENT has got a newer version that supposedly does much better, but recommending anyone new to the BSD world to go straight to CURRENT is insane. PC-BSD is not based on CURRENT. Moreover, you should be on amd64 or maybe sparc64 and have lots of RAM.

      FreeBSD 8.0 is scheduled for summer, which means that it is likely to come out this year...

    2. Re:Ok... Two things by Skurge357 · · Score: 0

      Well, I WAS running it on a laptop, not a server, so it is true that I didn't push it at all, but it was fast and stable while I was testing it. But I do have the 64-bit processor and 2GB of ram, so that may also have helped it look better that it is. I still think it's got a lot of potential.

  65. Hmm...think again! by gbutler69 · · Score: 1

    Yet the whole point of journaling filesystem is to protect against data loss.

    No, it isn't!

    --
    Over-the-top Response Guy! Giving "Over-the-Top Responses" since 1970.
  66. Don't write to files and your app will be fine by presidenteloco · · Score: 0

    According to TFA:
    Ts'o says that the application should be fixed so it does not write and rewrite small files. He advises that "this is really more of an application design problem more than anything else."

    Unbelievable that this guy is the main author of a file system. To paraphrase him:
    "My file system is awesome. Just don't expect that when you issue a file write command that the file system will ensure that the file
    will be written."

    Cool.

    --

    Where are we going and why are we in a handbasket?
    1. Re:Don't write to files and your app will be fine by Anonymous Coward · · Score: 1, Insightful

      Just don't expect that when you issue a file write command that the file system will ensure that the file will be written.

      Glad to know that someone's reading the manpages, since you aren't. Go back and read your write() and close() manpages, then come back and tell us that write() is supposed to ensure that the file will be written.

      Now, remount your filesystem -o sync, and come back and tell us WHY write() does not ensure that the file will be written by default.

    2. Re:Don't write to files and your app will be fine by swilver · · Score: 1

      -o sync is overkill. A filesystem should respect the order of commands given to it. I don't care when it syncs, as long as there's no gaps in things I told it to do. Syncing later steps before having synced steps before it is ludicrous. Either sync them all in the same big batch or none at all. Syncing just half of them at random is gonna wreak havoc. So, to put it more clearly: OpenTempFile -> Write -> CloseTempFile -> RenameTempFileOverOriginal should never result in: OpenTempFile -> CloseTempFile -> RenameTempFileOverOriginal. During a bad crash, I accept that the contents of one of these files may become corrupted, as writes may have been done in place and it would be excessive to have to log all content to the journal as well. I however donot accept that this can happen to dozens of files at the same time because the filesystem decided to sync future actions before syncing earlier steps.

    3. Re:Don't write to files and your app will be fine by swilver · · Score: 1

      (with formatting this time)

      -o sync is overkill.

      A filesystem should respect the order of commands given to it. I don't care when it syncs, as long as there's no gaps in things I told it to do. Syncing later steps before having synced steps before it is ludicrous. Either sync them all in the same big batch or none at all. Syncing just half of them at random is gonna wreak havoc.

      So, to put it more clearly:

      OpenTempFile -> Write -> CloseTempFile -> RenameTempFileOverOriginal should never result in: OpenTempFile -> CloseTempFile -> RenameTempFileOverOriginal.

      During a bad crash, I accept that the contents of one of these files may become corrupted, as writes may have been done in place and it would be excessive to have to log all content to the journal as well. I however donot accept that this can happen to dozens of files at the same time because the filesystem decided to sync future actions before syncing earlier steps.

  67. Exactly. by aussersterne · · Score: 4, Insightful

    People keep making arguments about the spec, but this seems like a case of throwing the baby out with the bathwater. The spec is intended to serve the interest of robustness, not the other way around; demolishing robustness and then citing the spec is forgetting why there is a spec in the first place.

    Yes, you can design something that's intentionally brain-dead, but still true to spec as a kind of intellectual exercise about extremes, but in the real world, the idea should be the opposite:

    Stay true to the spec and try to robustly handle as many contingencies as is possible. Both developers should do this, filesystem and application, not "just" one or the other.

    It's not enough just to be true to spec; the idea is to get something that works as well, not jump through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes.

    It's the bad outcomes that we're trying to mitigate by having a spec in the first place!

    So my point: what exactly is wrong with meeting the spec and trying to prevent serious problems by other coders from affecting your own code? I thought this was a basic part of coding: even if someone else is an idiot programmer, that doesn't make it okay to let the whole system fall down. Or did we all miss the part where we went for protected memory access and pre-emptive multitasking? Hell, if everybody had just been a great programmer, none of that would have been needed.

    The point is to have a working system by following the spec and to try to clean up behind other programmers when they don't as much as possible within your own spec-compliant code. The point is not simply to "meet spec" and the actual utility of the system or vulnerability to the mistakes of others be damned.

    --
    STOP . AMERICA . NOW
    1. Re:Exactly. by Ash+Vince · · Score: 0

      The most astounding thing from the original bug report is this:

      Also some of my MySQL databases were killed...

      This pretty much rules out me using Ext4 in a production environment. I have to admin some servers that have some damn big databases on them made up of some equally large tables. If I have to rebuild this from nightly backups or the replicated database just because the harddisk did not unmount properly or the box crashed then I am not going to risk it. Not when I know Ext3 handles the same situation flawlessly.

      I know I could switch to make the back up the main server but this will also take too much of my too precious time to consider as we have various stuff that needs to run against the replicated DB. I would still then have to spend too much of my time making sure the old master was back in the system as the new slave to make sure we still have the same level resilience going forward.

      If I wanted a server crash during a disk write to cause the entire file to be corrupt, I would never have upgraded from Ext2.

      If I can choose between Performance and Reliability, I will choose Reliability every time.

      --
      I dont read /. to RTFA, I read /. to offend people in ignorance.
    2. Re:Exactly. by RAMMS+EIN · · Score: 3, Insightful

      ``It's not enough just to be true to spec;''

      Yes, it is. That way, you get what the spec says you get.

      It can even be argued that doing better than the spec is dangerous. After all, that is what got us this riot: things doing more than the spec said, people relying on that, and then getting angry when another implementation of the spec didn't have the same additional features.

      You can only assume that you get what the spec says you get. If you assume more, it's your problem if your assumptions are wrong. If you want more than the spec gives you, you either need to implement it yourself or get a new spec implemented.

      ``the idea is to get something that works as well, not jump through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes.''

      I don't think anyone jumped through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes. I think they jumped through hoops to get the best possible performance, while still being conformant to the spec. If this breaks applications that rely on behavior that isn't in the spec, it's because those applications are buggy.

      ``It's the bad outcomes that we're trying to mitigate by having a spec in the first place!''

      I agree completely. But we seem to differ in how this is supposed to work.

      I say that specifications can be used to avoid bad results by specifying exactly what can be relied on. Everything that is not in the specification is unspecified and thus cannot be relied on. Knowing this helps you write better software, because you know what you can assume, and what you have to write code for.

      You seem to be saying that having a specification means we want to avoid bad results, so whomever implements the specification must do their best to avoid bad results, no matter what it says in the specification. I find that completely unreasonable.

      --
      Please correct me if I got my facts wrong.
    3. Re:Exactly. by cheater512 · · Score: 2, Insightful

      And is that the woosh from what actually went wrong going over your head?

    4. Re:Exactly. by Anonymous Coward · · Score: 0

      I dont read /. to RTFA, I read /. to offend people in ignorance.

      Your sig matches your post wonderfully, as you're offending us with your ignorance in this matter.

    5. Re:Exactly. by davecb · · Score: 1

      POSIX is the minimum standard one must achieve to be called unix-like. In general, one wants to do better that that in Linux (;-))

      --dave

      --
      davecb@spamcop.net
    6. Re:Exactly. by Ash+Vince · · Score: 1

      Who cares what went wrong? If I can avoid this happening by using Ext3 I will stick with Ext3.

      I am not interested in blame, just a resilient system.

      --
      I dont read /. to RTFA, I read /. to offend people in ignorance.
    7. Re:Exactly. by afidel · · Score: 1

      The spec is a set of necessary conditions but for many people it will be a sufficient set, they expect a filesystem to be as bulletproof as possible in every situation.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    8. Re:Exactly. by TemporalBeing · · Score: 2, Insightful

      The spec is a set of necessary conditions but for many people it will be a sufficient set, they expect a filesystem to be as bulletproof as possible in every situation.

      The spec in any design is the final authority.

      For example, if the spec for a bridge crossing a river says that the bridge ought to hold 20 tons of weight, then it must do at least that. If the bridget collapses because you put 20 tons on and a spec of dust landed on top of it, then it doesn't matter - it still held to spec. If you were able to get 40 tons on before it collapsed all the better, but you were only ever guaranteed by the spec (and thus the designers) 20 tons.

      If the spec for an engine said it could handle 8k RPM and it blew up at 8001 RPM, it was in spec. If you managed to get it to 9k RPM great, but you were only guaranteed 8k RPM.

      That doesn't mean you don't build tolerance into the spec - e.g. 8k RPM +/- 5% - or in try to exceed it where it makes sense e.g. delivering 25 ton to ensure you have 20 tons and some leeway for safety. (After all stupid is as stupid does.)

      However, you can't fault the designers or engineers when the device lives up to spec and breaks because you (as the user) tried to exceed the spec and it failed.

      Same goes for software. If the software spec says "provides A at rate B" then you better expect that and nothing more. If you need something different, then find a device (or API or file system, etc) that meets your requirements.

      Pushing something beyond spec is not the problem of the spec designers - but of the users of the spec that expect it to exceed the spec.

      And, btw, specs that supposedly are "minimum" standard specs are still specs just the same. They allow a certain minimum that (with software) allows portability; if you want to do better you still need to find another spec that supports what you want to do. For example: POSIX guarantees a portability between Unix and Unix-like OS's; but if you want to do better than POSIX then you use the Linux POSIX spec or the Solaris POSIX spec ( or BSD POSIX spec, etc.). You are get what you want, but at the cost of some portability. Failing to do that is the failure of the user of the spec, not the writers of the spec.

      And just to be clear - by "user of the spec I do not mean the people implementing the spec but the people using the software (or device) that implements the spec. In this case, not the implementors of ext3 or ext4, but the implementors of the software going above the ext3/4 spec to do something else.

      Furthermore the spec exists as a measurement to be able to tell when you've completed your job. If the spec says 10 tons and you get 11 tons you've finished the job; if you're only getting 9.99 tons you're not done. If you get 10.00001 tons you've got. If it say 10 tons +/- 5%, then might be done at 9.99 tons, but you really should go for the 10 tons + 5% just to be safe. Either way - once you've met the spec you're done. That doesn't mean you don't try to improve the spec and then make a better product; but there's no guarantee that will happen - the spec is the spec, and that's all you have to do - it's all you agreed to do to start with. (Think of it like a contract.)

      --
      Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
    9. Re:Exactly. by rtfa-troll · · Score: 1

      The thing being missed here is that EXT3FS is essentially turning a problem that happens regularly into an intermittent, rare but still present problem. That's much worse because it means that the problem is less likely to be noticed in testing and fixed. For "robustness" (e.g. three to five nines reliability) we need most things to be predictable something of the order of at least 10 nines. Ts'o making a mistake (hold on... excuse me whilst I perform my ablutions and sacrifice a beer bottle to the great kernel hacker god.. ) here and instead of fixing this so called bug, what he should have done is put in a patch to break EXT3FS instead.

      --
      =~ s,(.*),<sarcasm>$1</sarcasm>,g if any_point_you_wish();
    10. Re:Exactly. by Anonymous Coward · · Score: 0

      Hear Hear!!

      And note that the POSIX standard is not a well designed and complete standard - it is merely as much as a bunch of fractiously competitive companies more interested in the superiority and uniqueness of their own products would agree to

    11. Re:Exactly. by xous · · Score: 1

      Hi,

      First of all if your running database servers in a production environment any competent engineer would have a UPS in place on each power supply connected to a different circuit.

      This avoids the scenario of data not getting written due to a power failure.

      If your running database servers in a production environment any competent engineer would be running stable software which would avoid 99% of crashes.

      EXT3 is not a replacement for a UPS or using stable software.

    12. Re:Exactly. by NoahsMyBro · · Score: 1

      In the real world, multiple UPSes, multiple power supplies, connected to different power circuits, don't always exist.

      UPSes are inexpensive enough, but redundant power supplies and electricians are costlier, and not always an option for an IT guy in a small business.

      Some businesses face financial constraints (amazing, I know!), and a decision to allocate money to having multiple power circuits run to the servers doesn't always happen.

      Your post comes across to me as very haughty and holier-than-thou, and doesn't seem to consider real-world situations.

    13. Re:Exactly. by Ash+Vince · · Score: 1

      Wow, a new sensible reply to a comment I had long since forgotten :)

      Just thought I would mention the servers in question are in a datacentre with 3 levels of UPS redundancy. Including one of ours in the cage with them. This is necessary since they host government data and we got audited when we were shortlisted for the contract. This included them sending out some experts to inspect the datacentre. They even went so far as to get the datacentre staff to pull up a few floor tiles so they could inspect the underfloor cable routing to our cage.

      Not sure what you mean by stable software, are you referring to Ext4? The point of my post was to say I would NEVER use it until they changed this "feature". Even if they do change it I am not likely to migrate any servers to Ext4 in the next 2 or 3 years. I am not likely to deploy new servers using it for a similar length of time.

      I generally follow the "If it is not broken, do not fix it rule" very strictly.

      Last year I managed 99.9999% uptime, so I think I am ok. The contract in question stipulates 99.999% uptime but that is too easy. Please not these uptime figures are for the entire web application. I think the database servers actually hit the big 100% since I cannot remember any issues since some time in 2007. I could look in our monitoring data but I am on a train at the moment and can not be bothered connecting to our office VPN over a sketchy mobile broadband connection.

      --
      I dont read /. to RTFA, I read /. to offend people in ignorance.
    14. Re:Exactly. by Codifex+Maximus · · Score: 1

      aussersterne said:
      "Yes, you can design something that's intentionally brain-dead, but still true to spec as a kind of intellectual exercise about extremes, but in the real world, the idea should be the opposite:

      Stay true to the spec and try to robustly handle as many contingencies as is possible. Both developers should do this, filesystem and application, not "just" one or the other."

      Applications DEPEND on the filesystem and other OS API's and services. Application programmers should NOT have to do anything but depend on the reliability of those OS components. The programmers of applications have enough to worry about without having to play hopscotch around OS irregularities - they've got enough to worry about working with the user's irregularities.

      An implementation written according to the spec should inter-operate with any other app or implementation written according to that same spec. If problems with interoperability develop by any misinterpretation or stretching of the spec then the spec is not specific enough.

      There is a solution lurking somewhere.

      --
      Codifex Maximus ~ In search of... a shorter sig.
    15. Re:Exactly. by xous · · Score: 1

      I guess I got a bit carried away and detailed a more ideal environment than most can afford. Still if you have a database server and you don't even have a cheap UPS you deserve what you get.

      IMHO if a business can't afford a UPS for the server they should reconsider their business plan. I've seen business that run core business operations on a $8.95 shared account come in and whine about losing $100/h when downtime occurs. I have no sympathy for them.

      By stable software I meant stable kernel, stable drivers, mysql, and etc. Obviously Ex4 is a little new and so I might not consider it stable enough for a production environment. (This has nothing to do with the data-loss 'bug'.

      I'm not C/C++ programmer but from the explanation in the article and the bug reports it seems to me that this issue is nothing more than idiots not knowing the proper function to call when writing files.

      To seem this seems like writing a network application and then whining that some of your data didn't get send because you didn't flush the buffer. Just plain stupid.

      The only thing that has changed here is the window to lose data went from 5 seconds to 150 seconds.
             

  68. Memory is very cheap these days. by aussersterne · · Score: 1

    Why not just accumulate all disk changes in cached RAM and wait until the next shutdown to sync it all. The maximum time spent writing to the hard disk no matter what the computer did while it was on or how long it was on would then be O(1) (no more than the total size of the disk) and write performance would be astronomical!

    Of course, reliability would suffer...

    --
    STOP . AMERICA . NOW
  69. Re:Exactly by Anonymous Coward · · Score: 0

    Those that do not are simply not good developers.

    Because blaming the user is a strategy that has been used with complete success for generations?

    Seriously - who cares whose fault it is? Quit attributing blame, and think about ways to mitigate the symptom of said widespread failure to code nicely.

  70. You misread TFA - Apps vs. Power Loss etc. by billstewart · · Score: 1

    No, it was about the user's data getting trashed because the application wrote the data to disk in orders that weren't stable if the system crashed for whatever reason, including power loss, and about the change in file system behaviour increasing the potential delay between writes and therefore increasing the risk that the badly written application would lose data by about 30x.

    TFA shows examples of applications that work by

    • 1. Trash the old version of the file.
    • 2. Write the new version.
    • .3...(oops, system crashed, not getting the new version written to disk.)
    • 4. Lossage!

    and comparing them to applications that rename the old version, write the new version, and do their fsyncs in orders that will always leave the disk with a correct old version, at least until the new version is stably written. In the latter case, if the new version didn't get written, the application can use the old version, and it'll be fine.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  71. Re:Exactly by ozphx · · Score: 1

    High-level languages usually provide buffered stream APIs. All of these APIs contain sync/flush calls. Its bloody obvious that if you have buffers then shit is in memory, and not necessarily where you sent it. That should be the first hint to an incompetant monkey who thinks "Hey my high-level language abstracts me from knowing what the fuck I'm doing!"

    --
    3laws: No freebies, no backsies, GTFO.
  72. Re:Bull Huh??? by davidsyes · · Score: 1

    And i was going to ask if this issue had any relation to why KDE 4 (in my Mandriva 2009 Free system) NEVER remembers what opened apps and folders i had open. I NEVER had KDE 3.x "forget" to remember my settings across sessions once i checked the box for it to do so. KDE 4, no matter what i try, keeps returning me to a blank/no previous session items desktop. Making changes in KDE 3 messes around with KDE 4, and that's a shame. Certain settings in KDE 4 are grayed out, and that's annoying.

    But, i suppose someone will say my comment is off-topic, or not related. But, thought I'd mention this anyway...

    --
    Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
  73. Re:Works as expected...No! by sam0737 · · Score: 1

    No!...it does not work as expected, or did I misunderstand something?

    I don't mind the file system losing the last 2 minutes of work, as long as it's a consistent state. But if you mean the metadata and data can be committed at different time, which means the file size could change without actual data going in or vice versa, no it's a big no!

    If the data and metadata could be committed onto the disk out of order would also be a big concern.

    It has to be consistent please. 5 mins lost of work is ok, but if it lost 50% of work but 50% state, then it's really bad.

    Could someone please clarify if the above is not the case please? Otherwise I wonder if database application around is doing the right thing?

  74. Re:Bull... by ozphx · · Score: 1

    And you fix it in GNOME/KDE. All these advocates of ext4 shooting itself in the foot to work around bad practice on the part of the UI developers need to fuck off. They are advocating MS-style back-compatibility kludges... for the fkn filesystem.

    Even MS has managed to stop their UI people kludging up their core shit like NTFS.

    --
    3laws: No freebies, no backsies, GTFO.
  75. Compromise? by Tubal-Cain · · Score: 1

    Is there a way to simply change the delay to what it had been in ext3?

  76. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  77. All API's Have This by coryking · · Score: 1

    If you are a .NET developer, FileOptions.WriteThrough is what you are looking for it you need your shit to get written out to the filesystem right away.


    using(FileStream fWrite = new FileStream("test.txt", FileMode.Create, FileSystemRights.Modify, FileShare.None, 8, FileOptions.WriteThrough)) { // do shit.... // no need to Flush();

    } // this will be written out
    using(FileStream fWrite = new FileStream("test2.txt", FileMode.Create, FileSystemRights.Modify, FileShare.None, 8, FileOptions.None)) { // do shit....
          fWrite.Flush(); // flush the shit do disk, kinda like fsync();

    }

    1. Re:All API's Have This by Anonymous Coward · · Score: 0

      Clearly you do not want to "writethrough" dozens of small files. It would be unnecessarily slow. If you really don't want to use a database, the right way is to write all files with new filenames, then fsync() once, then rename them all over the old files. That way the files with the normal name either have the old contents or the new contents, but are never empty.

    2. Re:All API's Have This by Hal_Porter · · Score: 1

      CreateFile in Win32 has always allowed you to force write through if you want writes to be synchronous.

      http://msdn.microsoft.com/en-us/library/cc644950(VS.85).aspx

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
  78. Nothing new by mmu_man · · Score: 1

    It's been discussed for about an hour in the corridor at FOSDEM after Ted T'so's talk about ext4...
    Basically he said app writers are to blame for abusing fs-specific behaviour :P

    1. Re:Nothing new by SaDan · · Score: 1

      It's a shame Reiser turned out to be such a whackjob. We could have had a really sweet filesystem under Linux.

    2. Re:Nothing new by Anonymous Coward · · Score: 0

      By abusing FS-specific behavior, does he mean working around performance deficiencies of ext3?

  79. But can you do it without looking? by clarkn0va · · Score: 2, Funny

    My core2quad machine with 3 SATA disk RAID runs for about 20 minutes on a tiny APC UPS I bought from newegg for less than $100.

    Sure, but that's assuming you can save your work in all open applications without power to your display. Me, I like a UPS with a little more juice so I can reap the fullness of my 52" plasma while cleaning up and shutting down.

    --
    I am literally 3000 tokens away from the chaotic crossbow --Stephen
    1. Re:But can you do it without looking? by theapeman · · Score: 1

      Saving data ought to be automatic, without requiring the user to do anything.

  80. If it's not a bug... by Anonymous Coward · · Score: 0

    it's at least very poor implementation decision.
    They basically choose performance (and/or factually correct) over safe.
    Create a new fopen flag for delayed write, create a fsync'er daemon, add tunable parameters to (disable) fsync on file close or on app close, I don't care, but as is the "new" behavior is pretty disturbing.

  81. rename and fsync by DragonHawk · · Score: 3, Insightful

    "Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. "

    Two things are happening:
    (1) KDE is writing a new inode.
    (2) KDE is renaming the directory entry for the inode, replacing an existing inode in the process.

    KDE never calls fsync(2), so the data from step one is not committed to be on disk. Thus, KDE is atomically replacing the old file with an uncommitted file. If the system crashes before it gets around to writing the data, too bad.

    EXT4 isn't "broken" for doing this, as endless people have pointed out. The spec says if you don't call fsync(2) you're taking your chances. In this case, you gambled and lost.

    KDE isn't "broken" for doing this unless KDE promised never to leave the disk in an inconsistent state during a crash. That's a hard promise to keep, so I doubt KDE ever made it.

    A system crash means loss of data not committed to disk. A system crash frequently means loss of lots of other things, too. Unsaved application data in memory which never even made it to write(2). Process state. Service availability. Jobs. Money. System crashes are bad; this should not be news.

    The database suggestion some are making comes from the fact that if you want on-disk consistency *and* good performance, you have to do a lot of implementation work, and do things like batching your updates into calls to write(2) and fsync(2). Otherwise, performance will stink. This is a big part of what databases do.

    As someone else suggested, it's perfectly easy to make writes atomic in most filesystems. Mount with the "sync" option. Write performance will absolutely suck, but you get a never-loses-uncommitted-data filesystem.

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
    1. Re:rename and fsync by QuoteMstr · · Score: 2, Informative

      Telling application developers to use a database is bullshit. The filesystem is a database, albeit not a relational one. A open-write-close-rename sequence merely asks for atomicity without durability, something that's perfectly reasonable. As other posters have mentioned in vain, all the application wants is for either the old version of a file or the entire new version to appear on a reboot. He doesn't care at the instant of the rename whether that replacement has been recorded on disk, just that eventually, when the filesystem does record that replacement, that it's recorded atomically.

      You might want the open-write-fsync-close-rename behavior for a mailserver, in which you must acknowledge receipt (i.e., you need durability), but asking for that same durability in a multi-file configuration setup is just stupidly degrading performance.

      open-write-close-rename is saying something fundamentally different from open-write-fsync-close-rename, and it's perfectly reasonable for a filesystem to act sanely in response to both kinds of request.

    2. Re:rename and fsync by swilver · · Score: 1

      KDE never calls fsync(2), so the data from step one is not committed to be on disk. Thus, KDE is atomically replacing the old file with an uncommitted file. If the system crashes before it gets around to writing the data, too bad.

      POSIX may allow it, but I was under the impression that filesystems should try and remain in a sane state. That means if my application does A, B, C and then D, that I expect the following to be on disk no matter what happens, either:

      1) Nothing

      2) A

      3) A + B

      4) A + B + C

      5) A + B + C + D

      It seems that EXT4 will deem that some of these are not as equal as others (if C for example is a data write), and can give you a new result like:

      6) A + B + D

      That's certainly unexpected, and highly undesirable if D happens to be something like "delete backup file" or a rename over an old file.

  82. fsync() is no substitute for ordered disk writes by Cassini2 · · Score: 1

    Having file system metadata not match file system data is a pretty big bug. Ext3 defaulted to having everything mounted such that the writes to the disk were "ordered" ie: (data=ordered). Ext4 does not force "ordered".

    Userland can solve this problem by calling fsync() all over the place, like before every close. However, that completely defeats the purpose of having a buffered write-back file system. If the new rule is to change every userland program to force all data to be flushed to disk after every close, then we might as well mount the filesystem "o=sync", and flush our performance down the toilet. (Pun intended.)

    The problem here is no call exist to force writes to disk to be "ordered". fsync() is not a substitute for ordered writes to disk. There are just too many ways an application can get into trouble if writes to disk aren't ordered. Having situations where neither the backup file nor the new file are valid is just beginning of the problems.

    I write data acquisition applications that write lots of data in many files to disk. I don't care if my newest file is blank. This "bug" could mean that I have a pseudo-random number of blank files, and they might not even be ordered. My only solution, fsync(), will tank the applications performance, by causing huge amounts of disk activity. fsync() is not a substitute for ordered writes.

  83. Not acceptable... by Anonymous Coward · · Score: 0

    POSIX or no, this kind of crap won't cut it. Losing a file or two every few years was acceptable, but zeroing or orphaning hundreds of the most recently written files every time there's a crash is NOT, I don't care whether you're in a home or production environment.

    1. Re:Not acceptable... by SaDan · · Score: 1

      This is exactly why I don't trust production data to ext-based filesystems. The developers all have a huge chip on their shoulders, and a pipe up their ass.

      It all looks great in theory, but ext has been getting worse and worse with performance AND reliability since ext2. If Linux can't get a decent native filesystem in the kernel, I'm going to switch back to Solaris.

  84. Ya it's pretty unreasonable by Sycraft-fu · · Score: 1

    Whenever you are talking anything with some kind of delay, you need to think about what is reasonable for the situation. For something like disk writes, a few seconds is probably the most that is reasonable. It is ok to say that no, a write isn't going to happen right away and put everything on hold, but the expectation is that it will be serviced ASAP, not put off for minutes. That is just waaay too long.

    To me this would be like depositing a check at the bank and then not having it show up in your account for two months. Sure, there's always a delay between the deposit and when it posts since it needs to clear, but for that days are reasonable, not months.

    Performance for file systems is great and all, but not if it comes at the severe expensive of reliability. I just don't see minutes as being an acceptable delay for writes, no matter what the case. You can argue all you like about the theory, the fact of the matter is I don't know of any widely in use OS/FS combo that does this. Windows/NTFS doesn't, Linux/EXT3 doesn't, etc.

    1. Re:Ya it's pretty unreasonable by davecb · · Score: 1

      It's ok to have some pretty huge delays, if and only if you delay the metadata update from the (atomic) rename as well as the write. Doing the metadata change and then delaying the write opens a huge window for a crash to render the filesystem inconsistent.

      The delay should be sufficient to allow all the operations to be topologically sorted into an order which both preserves correctness and allows the data to be written in a single pass of the elevator.

      Research over ten years ago showed the necessary and sufficient time to vary with the size of the commit cache, and to be otherwise on the order of seconds. If I understand correctly, ZFS uses 5 seconds.

      --dave

      --
      davecb@spamcop.net
  85. How to express "atomic replace, defer ok" by cnewman · · Score: 1

    As an application programmer, one of the more common filesystem operations I want to do is "replace this file atomically; and feel free to delay commit of the replace for power/performance reasons as long as it happens atomically." The POSIX API provides no documented way to express this, so a common POSIX call sequence is used to express this semantic (write-new-file, rename on top of old-file).

    The problem is that EXT4 now interprets that common calling sequence which traditionally has useful semantics on most filesystems in a way that is both useless and harmful to data integrity. And furthermore it leaves application programmers no way to express the "atomic replace defer ok" semantics. So in pursuit of filesystem performance EXT4 has broken a performance-optimizing semantic. If applications are changed to fsync when it's completely unnecessary (only sequence preservation is needed), we will all pay the performance cost.

    So EXT4 may comply with POSIX, but it does so in a way that is harmful to overall system performance, harmful to data integrity and harmful to performance optimization of application file operations.

    As an application developer highly concerned with optimal performance, my response will be to refuse to support EXT4, and to discourage use of EXT4+workaround as it has suboptimal performance. The correct fix is to make EXT4 guarantee to commit the rename after the data write operations, but for performance, it should delay both commits until the next flush interval. If I replace the same file twice within a flush interval, I'd prefer the intermediate version never be written to disk.

    Until an "atomic replace" operation is added to POSIX, I want the filesystem to interpret that common sequence of calls with the sensible and rational interpretation.

    1. Re:How to express "atomic replace, defer ok" by QuoteMstr · · Score: 1

      Thank you! I'm glad someone else finally gets it! I disagree slightly, however: I think open-write-close-rename is a perfectly good way of asking for atomic replacement without an immediate commit. I think that should be the default behavior, since it's the most useful. The ext4/XFS behavior, in terms of safety, is just equivalent to open-write-close on the original file, and I think it's okay to make that operation the unsafe one.

      In short, you're right, but in addition, I think the POSIX vocabulary is just fine as it is. Also, the delay behavior you're talking about is essentially what soft-updates does. It seems to work fine for the BSD people.

  86. If file is CLOSED queue fsync by topham · · Score: 1

    If a file is closed queue an fsync within a few seconds. If many files are closed by the same application before the first fsync bump the SHORT delay each time. This means the last fsync will only be 1 delay period away from write.

    An fsync caused by a close should be less than 2 seconds away.
    DONE.
    Problem stays OUTSIDE of the application space where it sure as fuck does not belong in the first place.

    1. Re:If file is CLOSED queue fsync by QuoteMstr · · Score: 1

      You're right that the problem is below the application layer, but fsync-on-close (even after two seconds) is a phenomenally stupid idea. Plenty of applications open a file each time they write something to a log, then close it again. While this is slightly inefficient, it's not something that should cause a disk-grinding full sync.

      No! Atomic rename is how atomic updates work under unixish systems. The filesystem needs to properly implement the existing vocabulary. open-write-close-rename (with no goddamn fsync) is a perfectly reasonable thing to do.

      All that's required is for the system to guarantee that when A is renamed on top of B, all data blocks for A are committed before the rename is. That way, an application looking at B will either see the contents of the old B or A, and never an empty file or a mixture of the two, even if the system dies in the seconds between that close and rename.

    2. Re:If file is CLOSED queue fsync by Eunuchswear · · Score: 1

      You're right that the problem is below the application layer, but fsync-on-close (even after two seconds) is a phenomenally stupid idea.

      Ah? I've been using it for over 10 years now. Works for me.

      $ mount
      / on /dev/root read/write/setuid/mincache=closesync on Wed Jan 7 22:31:18 2009
      /proc on /proc read/write on Wed Jan 7 22:31:59 2009
      /stand on /dev/vx/dsk/standvol read/write on Wed Jan 7 22:32:00 2009
      /home on /dev/vx/dsk/homevol read/write/log/setuid/mincache=closesync on Wed Jan 7 22:32:24 2009

      --
      Watch this Heartland Institute video
  87. Sounds like a KDE problem by fast+turtle · · Score: 1

    From what I can see, this was a case of kde writing to the primary config files in /usr/kde. If this is the case, then it's definately a screwup by the KDE Devs as there is absolutely no reason for KDE to be writing to those files after it's installed simply because the base installation has no way of knowing if that directory is mounted -ro (read-only) as I do for /usr. If the data loss occured in the /home/~.kde folder, then that indicates a problem with such a long write delay. I'm sorry but although I use Ext3 with longer write delay (15 seconds) I also have a UPS connected to the system to reduce probability of data loss but I will never extend the cache delay to more then 30 seconds unless it's a laptop.

    --
    Mod me up/Mod me down: I wont frown as I've no crown
  88. Re:Not a bug s by Anonymous Coward · · Score: 0

    fflush() just flushes the user mode buffer in FILE to the kernel buffer via _write (WriteFile).

    If you want fsync then you need to call _commit (FlushFileBuffers) or call fopen with open mode with a "c" (MS specific) and then call fflush. See documentation for fopen:

    const char *mode:

    c
    Enable the commit flag for the associated filename so that the contents of the file buffer are written directly to disk if either fflush or _flushall is called.

  89. If the disk isn't busy, write? by Anonymous Coward · · Score: 0

    If the disk isn't otherwise busy, why not flush dirty buffers, regardless of their age? This one simple heuristic would minimize a lot of these data-loss issues on forgotten fsyncs in typical desktop useage patterns.

    In a typical desktop, there is a lot of activity for a short time, and then the disk sits idle. There is no reason to hold on to the dirty data in the hopes of combining more writes, because the disk isn't doing anything right now anyway. So flush things out. Do them one at a time so that if a sudden burst of traffic comes in it doesn't sit behind a long queue.

    If the disk *is* busy, then the usual delayed algorithm applies. And the apps still should use their fsyncs. But the ones that don't, in the typical case of the computer doing a flurry of work and then sitting idle, it'll minimize the exposure should a power failure hit.

    The only exception to this heuristic might be for flash drives, where you wouldn't want to re-write the same block soon afterwards.

  90. Transactional NTFS by jpmorgan · · Score: 1

    In Vista and Win7, NTFS supports atomic transactions. With TxNTFS KDE could do all those config file updates as a transaction and have guaranteed atomicity. No need for an extra registry-like database.

    How ironic. :-)

  91. Re:fsync() is no substitute for ordered disk write by QuoteMstr · · Score: 1

    The problem here is no call exist to force writes to disk to be "ordered"

    Right. Atomic rename is a special case of ordered write, really. Atomic-but-asynchronous-rename is great, but something more powerful would be nice too.

    What we really need is a user-level fbarrier. I'm not the first person to think of this syscall.

    Also,

    ZFS guarantees the write(2) are ordered by the
    fact that either they show up in the order supplied or they
    don't at all.

    When you think about it, that's a very powerful guarantee. (Personally, though, I'd rather have fbarrier.)

  92. Hiding behind POSIX by antientropic · · Score: 2, Informative

    All the Idiots who scream here that the OS is doing something worng: no, it's not.

    This is called "hiding behind the standard" (a disease very common among kernel developers). Just because the standard doesn't specify behaviour in a certain situation doesn't mean that any behaviour is equally okay. In this case, ext4's behaviour very much hurts the robustness of the system, which is rather important in unreliable environments like laptops.

    In this case, what KDE does is certainly not unreasonable (and its developers are certainly not "idiots"). It doesn't overwrite configuration files in place, which would be bad even in the absence of system crashes, as doing it that way is not atomic. Instead it creates a new temporary file, writes the new contents, then renames the temporary file to the old one. This is an atomic operation on Unix: you either see the old contents or the new contents, but nothing in between. Now, the problem is that in case of a crash, ext4 gives you the worst possible outcome by reordering the operations: it will "recover" the rename for you, but not the actual write of the new data. So you end up with a 0-byte file - far from atomic. POSIX of course allows this, but POSIX allows just about anything: that doesn't mean its reasonable. The only guaranteed solution - use an fsync/fdatasync - is something that almost nobody does because the performance is horrible (ext3 in fact will write the entire journal, IIRC, when doing an fsync() on a single file - this really hurt Firefox 3 performance). So the KDE developers can be excused for not doing that.

    It's the job of a modern filesystem to ensure robustness and performance. If you don't use an fsync, you should expect that there is a time window during which transactions might become undone (not the end of the world for configuration files), but they should never be reordered. For instance, this is how Berkeley DB works if you disable fsync: it guarantees ACI but not ACID. For many desktop applications, that's good enough. Destroying every file that has been updated since the last fsync isn't. And your users aren't going to be impressed by the argument that POSIX allows it.

    1. Re:Hiding behind POSIX by drolli · · Score: 1

      I was told to rely on the documentation of a function when using it, not on my feeling what is reasonable. If the manpage says that you have to fsync it to get it reliable on the disk, then do it. The idea behind this is that it gives you the choice when to fsync. For example when assuming the file to be written in a way that you can do you "atomic" operation of moving it, it would be a good idea to fsync it before closing it. This does not hit the performance, because then the filesystem does exactly what you want it to do.

      As i mentioned before, i also had the wrong idea. And i am not ashamed of it, because it is non-obvious. I even may have spread this impression in a programming course. However: I was wrong. I read the documentation of the function i am using more carefully and i see that i clearly ignored a sentence there. As sorry as i am, i would be an extreme asshole if i now blame the person writing the correct function which sticks to the documentation for my mistake by saying that he is unreasonable to follow this documentation. Because i can imagine a *lot* of situations where i strongly would have this control over what is going on (without the fs guessing things). The solution is simple: if you are concerned about you closed files not beeing written, it should not take long to refactor the code.

    2. Re:Hiding behind POSIX by Anonymous Coward · · Score: 0

      Actually, older KDE versions used to do fdatasync, too, and that was removed after user complaints because it was doing things like spinning up drives.

    3. Re:Hiding behind POSIX by grumbel · · Score: 1

      There are two problems here, the first one is that fsync() is a POSIX function, some people however try to write ANSI-C or ISO-C++ code, neither of those languages has fsync(). So this means that portable C/C++ code will behave incorrectly on Ext4, same is likely true for many other languages. That is something completly unacceptable. The other problem here is that hiding behind the standard is just stupid, just because the standard doesn't forbid the worst possible outcome, doesn't mean its ok to just allow it. From a good implementation I expect that it handles the those edgecases of a standard sanely and not just starts destroying data.

      Now I haven't tried Ext4, so no idea how bad the problem is there, but I tried XFS and that was completly unusable, after almost every crash I lost files, gconf, rhythmbox, Wine config and plenty more ended up with a size of 0. That is not sane behavior, thats completly ridiculous insane shit.

  93. that's silly by Nicolas+MONNET · · Score: 1

    High end disks and modern FS (such as ext{3,4}) support write barriers to enforce ordering in critical sections.
    I suggest using high end disks on critical systems.

  94. semantic of fsync, an example by Nicolas+MONNET · · Score: 1

    In databases such as PostgreSQL, nothing is guaranteed to be recorded until a transaction has been committed *and* the DB has replied positively to the commit request. The application should not assume that the operation is succesful until then.
    There is a nice tuning option in PG (and other high end DBs I suppose) where you can tell it to wait a number of milliseconds on every query so that it has a chance to do just one fsync for several transactions. It slows down sequential operations with no load, so you might want to do disable this when doing certain maintenance operations (or you can arrange for those opeartions to be part of one large transaction instead of several small ones, such as with auto commit). In production and under load, however, this improves overall throughput dramatically.

  95. workarounds for hardware limitations by Anonymous Coward · · Score: 0

    I hate filesystems with a passion, as far as I'm concerned they all workaround a hardware limitation.

    Every hd should come with a tiny little battery, so if you write something to the hd (that hits the buffer) you can still be 100% sure it'll hit the platter.

    Would increase performance a lot, especially when dealing with software like databases that sync's a lot. No need for hardware sync's and barriers.

    Hell if a new sata standard came out, with that in the spec, and maybe even allowing the hd to use a configurable amount of system memory as a buffer... would be brilliant (yes, I realize that last part mixed with the first part would be considerably more complicated, and require a battery on the system memory).

  96. Read the articles again by Anonymous Coward · · Score: 0

    The old data was presumably synced before. Perhaps it was even written using an excellent editor that fsynced nicely, or perhaps half an hour had passed. But that old data is gone, because of failing to fsync on replacement. If you think that's okay, you shouldn't be programming filesystems. Consider the implications. Totally unnecessary data loss risk. Forced to do an fsync on changing config files, files that may change often and don't really need to be flushed to disk right away as long as the old data does not disappear. Now the thing is, I've read the POSIX spec and it doesn't clearly foresee in this particular situation. Which is peculiar, considering the UNIX habit of keeping config data like this. Reading it one way, one could come to the conclusion that ext4 doesn't conform to the spec. Reading it another way, one could come to the conlusion that we need a new spec, let's say POSIX 2. I'm already looking forward to POSIX 3.1. :-)

  97. Most people have a UPS these days anyway by Jessta · · Score: 1

    Most people have a UPS these days anyway(laptops), so data loss due to power failure is very rare.

    If you're writing important stuff to disk, using fsync() has been the rule for decades.

    The reason it's not the default is because most applications write large amounts of useless junk to disk (caches of network data, scratch space etc.) which makes disk access very slow.

    The KDE devs have no excuse and should know better.

    --
    ...and that is all I have to say about that.
    http://jessta.id.au
  98. AHHH LOL by Chutulu · · Score: 1

    JUST USE FAT 32 OT NTFS

  99. Checked sync IS necessary in power loss scenario by Sits · · Score: 1

    If you just write the file and rename without syncing and CHECKING(!) whether the sync worked you can get into the case where the file does not have what you thought might be in when a crash occurs before the file is completely written. If a crash occurs then you could learn that the rename did not act as a barrier to write. If no crash occurs then things will be as you expect (you won't see no/half the data in the file as the OS can present a "finished" view by showing you its buffers) while the OS is still writing the data to disk.

    When you don't check whether your data was synced to disk all bets are off as to what the files you are writing will contain (different filesystems will show different behaviour - e.g. XFS is good at showing applications with this problem much to the chagrin of unwary users). Apps need to either arrange it so that they don't care or do an explicit fsync and check the result before going further (or use O_DIRECT I guess). As a user you can arrange for your filesystem to be mounted in strict "sync" mode (which ensures everything is being written out all the time) but you'll pay a heavy speed price for doing so. I guess users could also force a sync and wait for it to finish before doing any crashes/abrupt losses of power (but this requires future seeing abilities to work every time)...

  100. Problem ios Delayed Allocation NOT delayed writes by grahamm · · Score: 1

    I think that most of the comments so far have missed the point. The problem is not caused by delayed flushing of buffers, but in delaying allocating the space for the new (or the data in a truncated) file. ext4 still flushes with a (default) 5s delay, the same as ext3, but it only does so for blocks which have been allocated disk space.

    There are a number of reasons for having delayed allocation. First it (together with ext4's use of extents) helps to lower file fragmentation where data is being written 'slowly' (eg when downloading from the internet). Secondly, in situations where files are being used as temporary scratchpads, it can remove the requirement to write the data to disk at all in cases where the file is unlinked before it is committed to disk.

  101. This is definitely an FS problem by Uzik2 · · Score: 2, Insightful

    If the guys writing the FS can't figure out how to properly write a cache that's not the problem of the application writers.
    If I save a file via an OS call and the OS tells me it didn't fail then if I can't immediately reread it then the OS is broken.

    Data loss from write caching is not a new problem either. Guess this year's crop of programmers can't figure out how to use google to find out about past problems or they just figure they're smarter than everyone else that came before them.

    --
    -- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
  102. Re:Exactly by Random+Walk · · Score: 1

    Some high-level languages (e.g. PHP) have no built-in fsync. Also fsync() is not part of the C standard, it's a POSIX extension. What you have in C is fflush(), but that will not fsync(). So books about programming in C usually don't cover fsync(), as it's not part of the language. I know that sounds like nitpicking, but truth is, if you've learned programming from books about some language, chances are you've never heard about fsync().

    Basically, what happens is that you need to understand OS design in order to program in a high-level language, and nobody (at least none of the books) tells you so. This is a WTF on more than one level... either make it part of the language, or make sure it isn't needed.

  103. Re:Works as expected...No! by Anonymous Coward · · Score: 0

    Data and metadata is never flushed at the same time. To do so would require that both go through the journal, and are committed at the same time. In ext3, data was written *before* metadata, which avoid the problem of empty files. However, you will just get the opposite problem - data being written, but the file size not being adjusted, so part of the new data is lost because the file size wasn't updated.

    File content being inconsistent after a crash is not something the filesystem can solve. It requires some kind of transaction support (fsync being kinda half a transaction - commit without begin). Which again means that the application needs to make use of the transaction feature for the data to be safe.

    The problem here is applications NOT using fsync. Even if there was better transaction support, they would just not use that in the same way as they are not using what we have now.

  104. Not a POSIX, Application or Crash Related Issue by sonpal · · Score: 1

    The bug is an out-of-order sequencing issue. The application sequence is CreateFile, WriteData, RenameFile. What is actually happening on disk is CreateFile, RenameFile, WriteData. If the crash happens between RenameFile and WriteData, you lose the data written to disk and have a zero length file. This is a filesystem / kernel issue.

    The length of time between disk writes exacerbates the problem. sync() forces a write and reduces the window when the filesystem is susceptible, but the bug is still there.

    This is a common bug when designing caches, because the sequence of writes of interdependent data must be in-order to maintain integrity.

    -- Hiten

    1. Re:Not a POSIX, Application or Crash Related Issue by swilver · · Score: 1

      Totally agree. When the filesystem syncs, it must sync everything up to a certain point in time, and nothing after that point. If it doesn't, then you can end up with situations where steps that occurred earlier are not committed while later steps are. I wrote a journaled filesystem myself long ago, and even using a high commit interval it atleast would not mess with the order that events occurred.

  105. ZFS - copy-on-write & checksums - today by toby · · Score: 2, Informative

    Re: "backup old file and write a new one" - A transactional copy-on-write filesystem such as Sun's ZFS is doing almost the same job, transparently.

    I have little doubt that copy-on-write will eventually supersede overwrite-and-pray filesystems. The wins are numerous, including cheap snapshotting, etc, etc. Install OpenSolaris and give ZFS a try today!

    --
    you had me at #!
  106. Dumb question, but.... by Joey+Vegetables · · Score: 1

    I do mostly app development, not system, but, as I understand it, many apps including KDE and Gnome are doing a bunch of small truncate-and-writes in style (a) or (b), presumably because style (c) would be too expensive due to the overhead of fsync().

    Am I missing something, or couldn't they just do the writes in style (c), except not do the fsync() each time, but rather call fsync() every five seconds or so in a separate thread? Wouldn't that allow for the reasonably fast writes without the risk of corruption?

  107. This is great by FunkyELF · · Score: 1

    A filesystem that takes something to the extreme will hunt down and kill bugs in programs that make assumptions.
    This is why porting a program to different OS's, trying it out on different architectures is great.
    I've ran into this kind of bug before when you write to a file and expect the file to be there right away. It worked on one setup where the fileserver was the same as the application server, but when we moved the appserver it started failing. NFS didn't report the file there right away, it took a little while.
    The less assumptions the better....better software.

  108. XFS binary NULLs problem fixed by clawsoon · · Score: 1

    In fairness to XFS, they finally accepted that binary NULLs were a problem and fixed it in the spring of 2007.

  109. $PROGRAMNAME does not care about my files by Looce · · Score: 1

    OptiPNG apparently doesn't care about my PNG files either, then. Firefox doesn't care enough about my downloads to write them fully to disk before saying the download is done; 'tar z' doesn't care enough about my backups to write them fully to disk before I can use the backup tarball, etc.

    And this is where I state that the programs a user uses do not know the intent of said user in all cases. Imagine if the 'tar' utility called fsync on each file when I restored a .tar.gz file containing 1500 small files. The disk would thrash, unless there was some sort of read-ahead done on the .tar.gz before... but then the filesystem metadata for the extracted files would need to be written too, which means that the disk would thrash on writes alone, never mind interwoven with reads.

    However, fsyncing a zip file which I'm only creating to send over my LAN and then deleting places unnecessary strain on my hard drive.

    Azureus, the well-known Java BitTorrent client, does fsync calls (actually via Java's FileChannel.force(), but that's another story), and I hate that. My connection is liable to filling up the hard drive's seek queue due to metadata updates while downloading, thereby giving less I/O time to other applications and starving them. I would rather see it fsync once at the end of the download, before the hash check, or do data-only fsyncs that need to seek less. I don't care that the file's last-modification time is wrong while I'm downloading.

    If all programs will now start to fsync files because of this POSIX rule and the ext4 filesystem, then I will use laptop_mode even on my desktop, because it drops fsyncs to delay writes up to its configured interval. The last thing I need as a desktop user is GNOME or KDE starting slower, which it will if it takes Tso's advice to heart... No more grouping writes across these hundreds of files!

  110. It's fixed in XFS now by clawsoon · · Score: 1

    The infamous XFS binary NULLs problem was fixed in 2007.

    It *was* a problem, despite the XFS developers saying before 2007 exactly what the ext4 developers are saying now: "We're following spec, so it's your problem if you lose data."

    Sooner or later, ext4 will be fixed, just like XFS was, once the developers realize that "omg my data is gone" is filesystem publicity death, no matter how on-spec they are.

    1. Re:It's fixed in XFS now by grumbel · · Score: 1

      Did that fix ever make it into Ubuntu? Since XFS on my Ubuntu 8.10 box trashed files on each and every single crash, completly unusable for desktop use.

  111. Delayed alloc helps with Windows client stupidity by clawsoon · · Score: 1

    You might be interested in this whitepaper from Intel. What they find is that the Windows CIFS client write pattern creates serious fragmentation problems for ext3. The problems are mitigated (though probably not completely solved) in XFS precisely by what you mention - delayed allocation.

  112. Mod up parent by clawsoon · · Score: 1

    Exactly. If I had mod points, I'd give them to you.

  113. Re:Exactly by gweihir · · Score: 1

    Basically, what happens is that you need to understand OS design in order to program in a high-level language, and nobody (at least none of the books) tells you so. This is a WTF on more than one level... either make it part of the language, or make sure it isn't needed.

    Maybe this is why some computer scientists write software: They typically have a mandatory OS course and were taught these things. Still a lot out there that do not remember.

    Face it: Writing good code is hard. Most code-writers cannot do it. Blaming the language for it is the wrong approach, as the language cannot fix all problems. Simply not possible. Well-wrtitten language books will also tell you that you also need to understand the system you are programming for, not only the language.

    Probably the line between a programmer and a software engineer runs somewhere here, between those that do understand the environment they are creating software in and those that do not.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  114. fsync IS NOT THE SOLUTION!!! by spitzak · · Score: 1

    There are some very intelligent posters above who have pointed out that the desired result is not achievable with EXT4. fsync fixes the symptom but is NOT the solution! fbarrier is also not right. And POSIX was not written by omnipotent oracles and can be ignored when it is obviously wrong. EXT4 is broken and needs to be fixed!

    I'm going to try my own explanation:

    If a program starts the call rename(A,B) and then the system crashes, and you reboot and you look at B (not A!!!!), you should either see the contents of A and A does not exist, or the contents of B and YOU DO NOT CARE WHAT IS IN A!!!!!

    However as EXT4 is implemented currently, you can get another result: B will contain a partially-written version of A (typically an empty file) and A does not exist. This is VERY BAD, as most likely there was interesting information in B that was copied to A but now both are lost.

    fsync as suggested will ensure that the contents of A are correct before the rename(). This is a stronger requirement. It is also dreadfully slow because it has to flush the disk.

    fbarrier would be similar. Not as strong as fsync, but it does guarantee that the contents of A are correct before the rename() is committed to disk.

    As a side effect of A being correct, fsync and fbarrier reduce the possible results in B of the rename on EXT4 to the desired set, so it does fix the symptoms.

    However there are problems with this. By far the most important one is that there are probably ten million programs (including ones written in scripting languages) that assume rename() works as stated above and that it is IMPOSSIBLE to add the fsync/fbarrier call to every one of them. And fsync is dreadfully slow as it really forces the disk to be written right then. fbarrier is better but it is enforcing a slightly stricter limit and thus can still be slower for no actual useful gain.

    A pile of locking options on files is not what is needed, despite Windows and VMS doing this. The Unix designers got it RIGHT, 30 years ago, with a correct and very small set of file operations that do what is really wanted (though I would add some sort of atomic create where the file does not appear until closed). Modern designers in OSS and at Microsoft should start showing a little humility and stop throwing all their highschool-level ideas into the designs without some research.

  115. Reminds me of reiserfs by Anonymous Coward · · Score: 0

    Hey, this situation reminds me of reiserfs. That filesystem also used to just randomly zero out some data when it crashed while writing.

    I find it pretty convenient that ext3 leaves files in some "useful" state even when it crashes in the middle of a write.

    I guess the Posix lawyers haven't defined exactly what "useful" is yet, and thus this feature is about to get lot in the next release of the filesystem. Too bad, actually...

  116. faulty operating system by speedtux · · Score: 1

    "It's a consequence of not writing software properly."

    When I write a file, I expect it to be stored quickly and reliably. Any operating system that doesn't do that is faulty. It's nice if an operating system manages to get a bit of extra performance through some clever caching, but that is secondary.

  117. Re:Exactly by spitzak · · Score: 1

    POSIX implies that rename() (well actually link()) is atomic. This breaks that assumption, as far as most programs are concerned.

    Yes you can redefine what "atomic" means in order to somehow imply that EXT4 is obeying it. I mean we could say that each letter is indidually changed and thus if the crash only leaves the first letter changed in the file name is ok.

    This violates POSIX for all practical understandings of the text. It has nothing to do with write(), it is rename()/link() that is at fault.

  118. Again i learned something by drolli · · Score: 1

    I cite from Ansi C:
    ----
    The fclose function causes the stream pointed to by stream to be
    flushed and the associated file to be closed. Any unwritten buffered
    data for the stream are delivered to the host environment to be
    written to the file; any unread buffered data are discarded. The
    stream is disassociated from the file. If the associated buffer was
    automatically allocated, it is deallocated.
    ----

    As you see, ANSI C says only 'delivered to the host environment to be
    written to the file' and not. 'return on the host environment having completed the write'.
    Well. Again something i did not realize before. I was always under the impression that an fflush also does the underlying synchronizations (however i usually dont use streams because i am aware of the fact that the additional buffering in unnecessary, and using unneeded libs in always a source of error.) But the documentation says, as sad as i am, that it is impossible to write a program in ANSI C which can determine at any point during it's runtime whether a specific file was written to the disk. While this makes me strongly doubt about what exactly the people writing the standard have been thinking, it is hardly the fault of the file system if a standard which and library based on system calls, implemented *correctly* according to another standard, has an undefined behaviour in a certain situation....

    And as hard as your mental trauma with XFS may be, i dont believe that it contributes to this discussion, besides that new FS should be taken with care if you dont need them. I started to use ext3 1.5 years after it entered the stable kernel and i plan to use ext4 not before 2010.

    1. Re:Again i learned something by grumbel · · Score: 1

      But the documentation says, as sad as i am, that it is impossible to write a program in ANSI C which can determine at any point during it's runtime whether a specific file was written to the disk.

      But that's not the fault of the ANSI-C standard, it is the fault of the implementation, since it would be perfectly free to do an fsync() on a fclose() and thus work as expected.

      besides that new FS should be taken with care if you dont need them.

      The troublesome part with this whole mess is that, so far, it is not considered a bug.

    2. Re:Again i learned something by drolli · · Score: 1

      But the documentation says, as sad as i am, that it is impossible to write a program in ANSI C which can determine at any point during it's runtime whether a specific file was written to the disk.

      But that's not the fault of the ANSI-C standard, it is the fault of the implementation, since it would be perfectly free to do an fsync() on a fclose() and thus work as expected.

      No. If the specification does not specify it, then dont rely on it. Not to implement something not specified is *not* a bug. And forgive my insistence "as expected" is a term which is highly undefined if it goes beyond the documented behavior. A programmer which writes a data acquisition daemon running on a reliable machine, but using a significant part of the io bandwidth (e.g. network cable) may "expect" something different on something not mentioned than the desktop programmer who is confronted with users sometime turning their laptops of to shut them down faster (i do that sometimes, e.g. when network fs are hanging, and never had a data loss using ext3). The first programmer may curse the stupid idiot who decided that hundreds of files which he could write out without problems before, now always cause the head of the hard drive to be hopping like mad, because someone decided that it is "reasonable" to fsync them at funny places. And maybe if you are so great in expecting things: should it be an fsync or an fdatasync?

    3. Re:Again i learned something by grumbel · · Score: 1

      But the documentation says, as sad as i am, that it is impossible to write a program in ANSI C which can determine at any point during it's runtime whether a specific file was written to the disk.

      So we should all just happily accept it that the filesystem shreds our files? I don't think so, I'd chose a filesystem that doesn't produce this behaviour.

      because someone decided that it is "reasonable" to fsync them at funny places.

      The implement a "I am ok with file shredding" flag as mount option or whatever. A filesystem should be save by default, not tuned for max performance at the cost of file safety.

      Not to implement something not specified is *not* a bug.

      Its not a bug, its a feature... yeah, we used that to ridicule Microsoft. There is nothing in POSIX that says you *have to* shred the users file, older filesystems like ext3 or reiserfs did not have those problems. If ext4 now has them, its ext4 faults and nobody else. Pointing to the spec isn't an excuse for creating an broken filesystem, since in the real world its important that it works, not that its 'ok' by some funky interpretation of the spec.

      should it be an fsync or an fdatasync?

      I don't care, give me an implementation that works properly with ANSI-C and I am happy, everything else I consider broken.

    4. Re:Again i learned something by drolli · · Score: 1

      I think i dont want to continue the discussion. At no point i stated that there is no problem, but i explained in pretty much detail *where* i see the problem and backed it by documentation. I am really not willing to discuss with some man-page illiterate on the basis what he feels in his guts how the world should be. To clarify that: Yes, i personally would not choose 100s as a commit interval - this is a "shred my data option". I do not consider it a wise choice, because i know the usual quality of software. But insisting in *not* using fsync where needed (because it is a posix function) and claiming that some filesystem should guess what was omitted from the ANSI C standard and what the programmer could have meant, despite no indication in the manual where this AI should sit is a little strange....

  119. Filesystem transactions by DragonHawk · · Score: 1

    "Telling application developers to use a database is bullshit."

    I'm not telling application developers to use a database; I'm explaining what's driving a remark others have made. Application developers can use whatever suits their need. If a database is what they want, then sure, use one. If something else is better, use that.

    "A open-write-close-rename sequence merely asks for atomicity without durability, something that's perfectly reasonable."

    You may think it's perfectly reasonable, but you're asking for atomicity across multiple operations. So really want you want is transactions. To the best of my knowledge, neither Linux, nor Windows, nor Mac OS X, nor any of the BSDs, offer transactions in the filesystem layer. I've always thought such would be a good idea, but I don't think it exists.

    Further, even if filesystem transactions did exist, the application would have to request it. There's no way for the OS to magically divine what an application considers a filesystem transaction to be; the application has to tell it. So the order of operations would need to be begin-open-write-close-rename-commit.

    "all the application wants is for either the old version of a file or the entire new version to appear on a reboot"

    Then the application should call fsync on the new file before removing the old one. That's the only mechanism the POSIX specification provides to guarantee something has been committed to disk. It may be more than the application really wants or needs, but it's all POSIX provides. One can argue POSIX should do more, of course. More on that below.

    "He doesn't care at the instant of the rename whether that replacement has been recorded on disk ..."

    Actually, yes he does, because the operation he's requesting is to destroy the only known-good file. It's not the OS's fault that the programmer didn't actually make sure his new copy was good before he destroyed the old one.

    The programmer may have intended for the OS to make sure the new copy was good, but he never asked it to do so (i.e., with fsync).

    "asking for that same durability in a multi-file configuration setup is just stupidly degrading performance."

    So, baring new system calls for filesystem transactions, what should the filesystem do, then? Serialize all I/O operations? Now you're destroying the I/O scheduler and killing multitasking performance.

    Maybe there's another option here that I'm not seeing.

    "open-write-close-rename is saying something fundamentally different from open-write-fsync-close-rename"

    Yes, one is safe, the other is unsafe.

    I think the problem here is you're implying semantics which don't actually exist in the OS or it's interface specification. Programming by "gee I really wish things worked this way" is a bad way to do things.

    Now, maybe you want to make the argument that the OS should provide transactions. I'd even agree with you. But one doesn't write code based on a feature request; you write code based on what the system actually does.

    Or am I missing something else?

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
    1. Re:Filesystem transactions by QuoteMstr · · Score: 1

      Read some of my other comments in this article. If you have read them, you haven't understood them.

      I'm not asking for transactions, though they'd be nice too. I'm saying that filesystems should respect the implicit write barrier that rename has always represented. UFS with soft-updates maintains a dependency graph and ensures data blocks are written before meta-data ones. Old non-journaled filesystems managed to get rename right anway. ext3 manages to do the right thing as well. ZFS goes far beyond what I'm proposing, actually, and guarantees the relative order of every write. The problem filesystems are XFS, JFS, and ext4.

      I'm not telling application developers to use a database; I'm explaining what's driving a remark others have made. Application developers can use whatever suits their need. If a database is what they want, then sure, use one. If something else is better, use that.

      It's absolutely ridiculous to require the full power of a database in order to atomically update the content of a file with decent performance.

      Actually, yes he does, because the operation he's requesting is to destroy the only known-good file. It's not the OS's fault that the programmer didn't actually make sure his new copy was good before he destroyed the old one.

      No: taken as a whole, the operation the programmer is requesting is to atomically replace the content of the file. He doesn't need that to happen now. The programmer only requires that the file being manipulated is not left in an invalid state if and when the operation is committed to disk; i.e., atomicity. fsync expresses an entirely different requirement, that the data be written now; i.e., durability.

      Data-before-rename isn't just an automatic fsync when rename is called. That's one way of implement a barrier, but far from the best. Far better would be to keep track of all outstanding rename requests, and flush the data blocks for the renamed file before the rename record is written out. The actual write can happen far in the future, and these writes can be coalesced.

      In fact, you can't even implement what I'm talking about in terms of filesystem transactions. Barriers are separate beasts.

      Say you're updating a few hundred small files. (And before you tell me that's bad design: I disagree. A file system is meant to manage files.) If you were to fsync before renaming each one, the whole operation would proceed slowly. You'd need to wait for the disk to finish writing each file before moving on to the next, creating a very stop-and-go dynamic and slowing everything down.

      On the other hand, if you write and rename all these files without an fsync, when the commit interval expires, the filesystem can pick up all these pending renames and flush all their data blocks at once. Then it can write all the rename records, at once, much improving the overall running time of the operation.

      The whole thing is still safe because if the system dies at any point, each of the 200 configuration files will either refer to the complete old file or the complete new file, never some NULL-filled or zero-length strangelet.

    2. Re:Filesystem transactions by GryMor · · Score: 1

      And the point is that in the presence of write reordering and the absence of fsync, there is no guarantee, express or implied, about the state of the tmp file's contents before you clobber the old file by renaming the new file.

      Rename doesn't say ANYTHING about the status of writes to a file's contents. The two are entirely uncoupled and any expectation to the contrary was anecdotal and wrong.

      --
      Realities just a bunch of bits.
    3. Re:Filesystem transactions by QuoteMstr · · Score: 1

      *sigh*

      Can you people not grasp basic logic?

      "If P, Q" does not imply "if not P, not Q".

      "If the documentation guarantees something, the implement should guarantee it" does not imply "if the document does not guarantee something, the implementation should not guarantee it."

      Nor does "there is no stated guarantee now" imply "there should be no stated guarantee in the future."

      The simple truth is that any self-respecting journaling filesystem should guarantee atomicity on rename.

    4. Re:Filesystem transactions by GryMor · · Score: 1

      The rename IS atomic, that is guaranteed. What isn't guaranteed is any other transaction that you haven't asked for a guarantee on. In a sane world we would have file system transactions so you could make a collection of filesystem operations atomic without requiring that they happen at any particular time.

      The application code in question has a bug, even if the filesystem was a bit more gentle in letting you know that, the application would STILL have a bug. It's doing a BAD thing.

      --
      Realities just a bunch of bits.
  120. Filesystem order of operations by DragonHawk · · Score: 1

    "POSIX may allow it, but I was under the impression that filesystems should try and remain in a sane state."

    You're asking for all I/O operations to be done serially. Linux doesn't do this today, and I don't think it has for more than a decade. Most OSes don't do this. The reason is performance. If you've got a bunch of writes to do in one part of the disk, you do them all there, and then do all the other writes for another part of the disk. Thus writes can be done out-of-order. This is called "I/O scheduling" or "elevator algorithm". If you've got multiple tasks doing serious I/O to the disk, you really want it.

    If you want a way to for an application to request a group of operations to be done atomically, that's called a transaction. I wrote about that in my cousin post.

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
    1. Re:Filesystem order of operations by swilver · · Score: 1

      No, that is not what I'm asking.

      I'm asking that when the filesystem does a sync (or commit or whatever you want to call it), that it commits EVERYTHING up to a point, not half. There's still plenty of room to reorder writes within those limits. For example, if I have 500 operations, and the filesystem is required to sync, then it can reorder those 500 operations all it wants. As long as when it's done, all of them are committed (or none). Not all of them except #399 cause it didn't like that one.

  121. That's not what that means--any of it by Rozzin · · Score: 1

    This is not appropriate for desktop systems -- desktop systems have to be robust against all kinds of stupid situations, like sudden power losses, users hitting the reset button because an application is hanging, and so forth.

    First off, PC hardware is not robust against sudden power-losses: it's literally possible for a write to be `half-done' inside the HDD, and no amount of higher-level `protection' can do anything about that.

    Secondly, the atomicity of rename() (or any other operation) isn't contagious: rename() does an unlink() and a link() in `one operation' (and that's the only amount of atomicity rename() claims--it's even specifically not atomic in that it's possible to see the original link and the new link at the same time), but that doesn't have any impact--or even any relation to other `nearby' operations.

    write() and rename() aren't even operating on the same object--write() is operating on the file, and rename() is operating on the directory (or directories that do/will contain links to the file). I don't quite get the `metadata' arguments, because `filenames' aren't `file metadata'--they're directory data, which is what allows you to have any number of links to the same file from any number of directories. File-metadata are things like timestamps, ownership, permissions....

    Lastly, having said all that: the reason that we go through the write-close-rename sequence is to prevent a race-condition while the system is running, and (to a lesser extent) to guard against failure of the acting process itself, not failure of the system as a whole.

    --
    -rozzin.
  122. But the app is doing something insane! by Grendel+Drago · · Score: 1

    The apps are doing this:

    1. Open the file.
    2. Delete the file (O_TRUNC).
    3. Write data to the file.

    Writing steps two and three to the disk is not generally bunched. It's certainly not an atomic operation. The fact that this ever worked is nigh miraculous.

    I'm reminded of the transition from bash to dash for the default /bin/sh in Ubuntu; people relied on nonstandard behavior for convenience, and when that was taken away, dash was blamed, people were going to Leave! Linux! Forever!, and so on.

    (This example was extraordinarily poorly handled; it should have been done like Debian Lenny did it, with a lot of lead time and making sure that everything worked as it should.)

    Of course ext4 shouldn't be released without the workaround, but applications need to actually handle their I/O, not chuck a bunch of stuff at the disk and act surprised when it's not guaranteed to be properly transaction-y. If this is "fixed" in the filesystem (as the current patches do), they do so by making the entire filesystem be careful about what gets written to disk immediately. The filesystem can't know what's vital to write atomically; the app must tell it.

    --
    Laws do not persuade just because they threaten. --Seneca
  123. RAID? by Slashdot+Parent · · Score: 1

    Umm.. Isn't the entire idea of RAID that if a disk fails in your array, it does not cause catastrophic failure?

    Every time I've ever had a disk failure, I find out about it in an email, and think to myself, "Hmmm... really should get to the store soon to buy a new HD, eh? ..."

    Certainly never had to wipe an array as a result of one measly disk failure. A single disk failure should never be an emergency.

    --
    They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock