We had some issues with not adding enough randomness in embedded devices, but that problem was largely fixed a year ago. At this point, I think urandom should be fine for session keys. It's not the best choice for long-lived keys in those embedded devices, but those devices (a) don't have RDRAND, since they tend to mips or ARM CPU's, and (b) since they don't have any peripherals other than the flash drive and the networking cards, there isn't that much entropy they can draw upon. There are things you can do to improve things in userspace, such as holding off on generating the host keys and generating the RSA keys for the certificates as long as possible, instead of right after the boot. But that's much more of a specialized problem for a specific class of system.
How would they detect any shared properties? The point is that they are providing a random number generator (not a stream of random numbers) which is supposedly "secure". Secure means that no one, including the person providing the RNG, can predict the stream of numbers coming form the RNG. If the RNG coming form the US source is not honest, that means that presumably the NSA can predict the stream of numbers coming out of the RNG. But the NSA (assuming that it distrusts the KGB and the MSS) wouldn't want the KGB and the MSS to be able to carry out the same feat. The same is true for each of the other devices. So there's no way that any one of the actors should be able to detect any shared properties --- that's the point of the proposal.
Now, if the NSA is able to gimmick the RNG coming from China, then that's a different story. And to the extent that many electronics are designed in the US and then manufacturered in China, that's certainly a concern. In order for a scheme like this to work, the parts would have to be designed and built in such a way that an outsider would believe that the NSA couldn't have possibly gimmicked an RNG, even if it could have been gimmicked by another spy agency. Then combine this with a device that you're sure couldn't have been gimmicked by the MSS, but may have been subject to pressure from the NSA, and so on.
The random driver has changed significantly since July 2012, which is we were given a heads up about the paper described at http://factorable.net/ which is also when I took back maintainership of the/dev/random driver. We gather entropy at every single interrupt, and mix it into the entropy pool. This is done unconditionally, you can't disable it, like what happened with the SA_SAMPLE_RANDOM flag.
The thing about entropy pools is that when you combine entropy sources, the result gets better, not worse. So the best thing would be if we had hardware random number generators sourced from China, Russia, and the USA. Since presumably the MSS, KGB, and the NSA mutually distrust each other, if we combine the entropy from those three soruces, the result will be stronger than any one alone.
This is why I don't recommend using RDRAND directly. Sure, an honest (emphasis on honest) hardware random number geneterator will always be able to source higher quality entropy than anything we can do by sampling OS events, such as interrupts. But the problem is it's hard to guarantee that a HWRNG is really honest. Especially given the Snowden revelations which seem to indicate the NSA has successfully leaned on at least one chip manufacturer. If you must use RDRAND, I'd recommend generating a random key via some other means, and then encrypting the output of RDRAND by that random key before use the resulting randomness for session keys, etc. Or better yet, do what we do in/dev/random, which is to mix RDRAND with other sources of entropy.
What I said is that/dev/urandom is much more important to get right than/dev/random. Realistically, far more programs use/dev/urandom than use/dev/random. GPG uses/dev/random for long-term key generatiom, but in terms of generating certs, creating session keys, etc.,/dev/urandom is far more important.
If you trust Intel not to have gimmicked RDRAND, by all means, feel free to use it. Please do it in open source, though, so I can fix said program not to, though.....
Also, I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.
So before I tried agitating for programmers to fix their buggy applications, I had already implemented both the heuristic that XFS uses (if you truncate a file descriptor, add an implicit fsync on the close of that fd), and in addition I had implemented another heuristic (if you rename on top of an existing file, fsync the source file of the rename). This was to work around buggy applications, and as you can see, ext4 does even more than XFS does.
At the end of the day, though, the heuristic can sometimes get things wrong, and sometimes the heuristic will be too aggressive in forcing fsync()'s when it's not really necessary, which is why it's good to at least try to education application programs about something which even you agree shouldn't be a new thing.
(For example, if you don't fsync, and you want to run your application on another OS, like say, Solaris, you will be very sad.)
But it wasn't backside covering, although most people don't seem to realize it, FIRST I added the hueristics to work around the buggy code, and THEN I agitated for people to fix their d*mn code. But application programmers don't like being told that they are wrong, so this seems to be a case of "blame/shoot the messenger" --- with me having been cast into the role of the messenger.
I'm aware that ext4 can run without a journal, but isn't that functionally equivalent to leaving it as ext2?
With ext4 you get the benefits of extents, delayed allocation, and other new-to-ext4 features. You also get directory hash trees, which was introduced in ext3 and therefore not in ext2. Running with out the journal means you have to run a full fsck after an unclean shutdown, but you still get all of the new features and performance improvements of ext4.
Any other questions? At the very least the applications are non-portable in the sense that they were depending on behavior not guaranteed by POSIX. XFS, btrfs, ZFS, and many if not most modern file systems do delayed allocation. It's one of the basic file system tricks to improve performance.
Read the answer to the FAQ very carefully. In fact, they agree with me:
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.
In certain cases it might make sense to turn off barriers and disable write caches, if you are writing huge amounts of bulk data and very little metadata in a RAID array --- and that is what XFS is optimized for. But they didn't say anything which contradicted what I said, although the conclusions might have been a little confusing and not necessarily applicable in workloads other than XFS's original design point of really big RAID arrays to support writing really big data sets.
You may be correct in saying that if you compare the guts of Soft Updates with that of (say) the JBD/JBD2 layer in Linux, which is what is responsible for handling the physical block journalling for ext3/ext4, the complexities involved might not be that different.
However, the difference comes when someone adds ACL support, or some other fs feature. When you are using physical block journalling, all you need to know is how many blocks a particular fs operation needs to dirty. That's it! With Soft Updates, you need to understand dependency diagrams and write code to implement rollbacks, etc. The person who is implementing the file system feature has to do many more things.
Now there are certainly downsides to doing physical block journalling. If you have workloads which are very high in metadata operations, physical block journalling will hurt. On the other hand, it's not clear how common such workloads are (although you can certainly find benchmarks that will stress that particular usage pattern). And in the face of hard drive errors, physical block journals can sometimes be better at recovering from certain failures than logical journalling or soft updates.
Like many things, there are always tradeoffs around, and if the goal is to play the "my file system has a longer d*ck" game, it's almost always possible to find some benchmark which "proves" that one file system is better than another. Yawn...
So Canonical has never reported this bug to LKML or to the linux-ext4 list as far as I am aware. No other distribution has complained about this > 512MB bug, either. The first I heard about it is when I scanned the Slashdot comments.
Now that I'll know about it, I'll try to reproduce it with an upstream kernel. I'll note that in 9.04, Ubuntu had a bug which as far as I know, must have been caused by their screwing up some patch backports. Only Ubuntu's kernel had a bug where rm'ing a large directory hierarchy would have a tendency to cause a hang. No one was able to reproduce it on an upstream kernel,
I will say that I don't ever push patches to Linus without running them through the XFS QA test suite. (Which is now generalized enough so it can be used on a number of file systems other than just XFS). If it doesn't have a "write a 640 MB file" and make sure it isn't corrupted, we can add it and then all of the file systems which use the XFSQA test suite can benefit from it.
(I was recently proselytizing the use of the XFS QA suite to some Reiserfs and BTRFS developers. The "competition" between file systems is really more of a fanboy/fangirl thing than at the developer level. In fact, Chris Mason, the head btrfs developer, has helped me with some tricky ext3/ext4 bugs, and in the past couple of years I've been encouraging various companies to donote engineering time to help work on btrfs. With the exception of Hans Reiser, who has in the past me of trying to actively sabotage his project --- not true as far as I'm concerned --- we all are a pretty friendly bunch and work together and help each other out as we can.)
So I'm an engineer, and not an academic. I'm not trying to get a Ph.D. The whole Keep it Simple, Stupid principle is an important one, especially as you say, "Journalling and Soft Updates have similar performance characteristics."
If sometimes Journalling posts better benchmarks, and sometimes Soft Updates produces better results, but Soft Updates is hideously more complex, thus inhibiting new features such as ACL's and Extended Attributes (which appeared in BSD much latter than Linux, and I think Soft Updates made it much harder to find people capable of extending the file system) --- then the choice of the simpler technology seems to be obvious. The performance gains are a toss up, and using a hideously complex algorithm for its own sake is only good if you are an academic gunning for a Ph.D. thesis or a paper publication, or if you are trying to ensure job security by implementing something so hard to maintain that only you and few other people can hack it.
What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption!
Journaling, and every other filesystem, has exactly the same problem. If consistence is required, YOU MUST DISABLE THE CACHE, unless it is battery-backed, or you are willing to depend on your UPS. This is the penalty we take for devices which lie to the OS about flush operations and the like.
Yes, there were, in the bad old days, devices which lied when the OS sent a flush cache command, and in order to get a better Winbench score, they would cheat and not actually flush the cache. But that hasn't been true for quite a while, even for commodity desktop/laptop drives. It's quite easy to test; you just time how many single block sector writes followed by a cache flush commands you can send per second. In practice, it won't be more than, oh, 50-60 write barriers per second. In general, if you use a reputable disk drive, it supports real cache flush commands. My personal favorites are Seagate momentus drives for laptops, and I can testify to the fact that they all handle cache flush commands correctly; I have quite a collection and it's really not hard to test.
The big difference between journalling and soft updates is we can batch potentially hundreds of metadata updates into a single journal transaction, and send down a single write barrier every few seconds. The journal commit is an all-or-nothing sort of thing, but that gives us reliability _and_ performance.
The problem with soft updates is that the relative ordering of nearly most (if not all) metadata writes are important. And putting a write barrier between each barrier operation is Slow And Painful. Yes, you can disable the write cache, but then you give up a huge amount of performance as a result. With journaling we can get the performance benefits of writes, but we only have to pay the cost of enforcing write ordering through the barrier once every few seconds.
Of course, there are workloads where soft updates plus a disabled write cache might be superior. If you have a very metadata-intensive workload that also happens to call fsync() between nearly every metadata operation, then it would probably do better than a physical block journalling solution that used barrier writes but run with an enabled write cache. But in the general case, if you compare a more normal workload where fsync()'s aren't happening _that_ often, and compare physical block journalling with a write cache and barrier ops, with a Soft Updates approach with the write cache disabled, I'm pretty sure the physical block journalling approach will end up benchmarking better.
>I mount these read-only in the interests of security, but that means, of course, >that I can't have journalling on them, which precludes the use of ext3 or 4.
#1. you can mount ext3 file systems read-only. The journal doesn't preclude a ro mount.
#2. ext4 supports running without a journal. Google engineers contributed that code to ext4 last year.
So I'm not sure what you're talking about. If you're talking about delayed allocation, XFS has it too, and the same buggy applications that don't use fsync() will also lose information after a buggy proprietary Nvidia video driver crashes your machine, regardless of whether you are using XFS or ext4.
If you are talking about the change to _ext3_ to use data=writeback, that was a change that Linus made, not me, and ext4 has always defaulted to data=ordered. Linus thought that since the vast majority of Linux machines are single-user desktop machines, the performance hit of data=ordered, which is designed to prevent exposure of uninitialized data blocks after a crash wasn't worth it. I and other file system engineers disagreed, but Linus's kernel, Linus's rules. I pushed a patch to ext3 which makes the default a config option, and as far as I know the enterprise distro's plan to use this config option to keep the defaults the same as before for ext3.
Since it was my choice, I actually changed the defaults for ext4 to use barriers=1. which Andrew Morton vetoed for ext3 because again, he didn't think it was worth the performance hit. But with ext4, the benefits of delayed allocation and extents are so vast that it completely dominated the performance hit of turning on write barriers. That is what most of the performance benefits for ext4 come from, and it is very much a huge step forward compared to ext3.
So with respect, you don't know what you are talking about.
So there's a major problem with Soft Updates, which is that you can't be sure that data has hit the disk platter and is on stable store unless you issue a barrier operation, which is very slow. What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption! The disk drive, especially modern ones with large caches, can reorder writes which are sent to the disk, sometimes (with the right pathological workloads) for minutes at a time. You won't notice this problem if you just crash the kernel, or even if you hit the reset button. But if you pull the plug or otherwise cause the system to drop power, data in the disk's write cache won't necessarily be written to disk. The problem that we saw with journal checksums and ext4 only showed up on a power drop, because there was a missing barrier operation, so this is not a hypothetical consideration.
In addition, if you have a very heavy write workload, the Soft Updates code will need to burn a fairly large amount of memory tracking the dependencies and burn quite a bit of CPU figuring out which dependencies need to be rolled back. I'm a bit suspicious of how well they perform and how much CPU they steal from applications --- which granted, may not show up in benchmarks which are disk bound. But if the applications or the large number of jobs running on a shared machine are trying to use lots of CPU as well as disk bandwidth, this could very much be an issue.
BTW, while I was doing some quick research for this reply. it seems that NetBSD is about to drop Soft Updates in favor of a physical block journaling technology (WAPBL), according to Wikipedia. They didn't get a reference to this, nor did they say why NetBSD was planning on dropping Soft Updates, but there is a description of the replacement technology here: http://www.wasabisystems.com/technology/wjfs. But if Soft Updates is so great, why is NetBSD replacing it and why did Free BSD add file system journaling alternative to UFS?
Actually FFS with Soft Updates is only about preserving file system metadata so they don't require fsck's. BSD with FFS and Soft Updates still pushes out meta-data after 5 seconds, and data blocks after 30 seconds. Soft Updates only worries about metadata blocks, and not data blocks.
In fact, after a crash with FFS you can sometimes access uninitialized data blocks that contain data from someone else's mail file, or p0rn stash. This was the problem which ext3's data=ordered was trying to solve; unfortunately it does so by making fsync==sync, which also had the unfortunate side effect of making people think that fsync()'s always had to be slow. It doesn't have to be, if it's properly implemented --- but I'll be the first to admit that ext3 didn't do a proper job.
It's really depressing that there are so many clueless comments in Slashdot --- but I guess I shouldn't be surprised.
Patches to work around buggy applications which don't call fsync() have been around long before this issue got slashdotted, and before the Ubuntu Laundpad page got slammed with comments. In fact, I commented very early in the Ubuntu log that patches that detected the buggy applications and implicitly forced the disk blocks to disk were already available. Since then, both Fedora and Ubuntu are shipping with these workaround patches.
And yet, people are still saying that ext4 is broken, and will never work, and that I'm saying all of this so that I don't have to change my code, etc ---- when in fact I created the patches to work around the broken applications *first*, and only then started trying to advocate that people fix their d*mn broken applications.
If you want to make your applications such that they are only safe on Linux and ext3/ext4, be my guest. The workaround patches are all you need for ext4. The fixes have been queued for 2.6.30 as soon as its merge window opens (probably in a week or so), and Fedora and Ubuntu have already merged them into their kernels for their beta releases which will be released in April/May. They will slow down filesystem performance in a few rare cases for properly written applications, so if you have a system that is reliable, and runs on a UPS, you can turn off the workaround patches with a mount option.
Applications that rely on this behaviour won't necessarily work well on other operating systems, and on other filesystems. But if you only care about Linux and ext3/ext4 file systems, you don't have to change anything. I will still reserve the right to call them broken, though.
It also depends on what type of filesystem you use. A journaling filesystem like ext3 can wear down a disk a lot faster than a non-journaling filesystem.
Not true. If you have a decent SSD that doesn't have Write Amplification problems (such as the X25-M), the extra overhead of journalling really isn't that bad. I wrote about this quite recently on my blog.
So interested people want to know --- how do you get the "insider" information from an X25-M (ie., total amount of writes written, and number of cycles for each block of NAND)?
I've added this capability to ext4, and on my brand-spanking new X25-M (paid for out of my own pocket because Intel was to cheap to give one to the ext4 developer:-), I have: <tytso@closure> {/usr/projects/e2fsprogs/e2fsprogs} [maint] 568% cat/sys/fs/ext4/dm-0/lifetime_write_kbytes 51960208
Or just about 50GB written to the disk (I also have a/boot partition which has about half a GB of writes to it).
But it would be nice to be able to get the real information straight from the horse's mouth.
Anyways, writing zeros, or writing something else sequentially should essentially be the same.
No writing sequentially is not the same as an ATA TRIM command, since the X25-M can't reuse the blocks for real data. It might (or might not) help the internal fragmentation of the X25-M's internal LBA redirection table --- but given that the PC Perspectives article pointed out that when things got bad, even a complete write pass across the entire disk was not sufficient to restore performance, I doubt it.
This makes sense, actually; without an ATA trim command, if you write the entire disk, the X25-M won't have much in the way of spare room in order for it to do its garbage collection/defragmentation operation. All it will have is the difference between 80 (real) GB (or GiB's for people who like that notation) and 80 (hd marketing) GB's. And apparently that is not enough.
I've had some people suggest that reserving a partition with a few gig's and never using it helps, since that provides some extra room for the X25-M to recover; but I don't have anything authoratative.
But back to the original point, what we really need is a way to tell the disk, "we don't care about the contents of the blocks any more". It *might* be that writing some magic pattern, whether all zero's or all one's --- and in fact, all one's makes more sense since an erased flash memory cell returns '1', not '0'. But the key question is whether or not the SSD's firmware treats this as "ok to reuse" or not. And for that we need a definitive answer from Intel.
Can you give me a URL or citation from someone official at Intel who has said this? As near as I can tell, Intel has been very tight-lipped about what the X25-M does internally.
I use 1GB for/boot because I'm a kernel developer and I end up experimenting with a large number of kernels (yes, on my laptop --- I travel way to much, and a lot of my development time happens while I'm on an airplane). In addition, SystemTap requires compiling kernels with debuginfo enabled, which makes the resulting kernels gargantuan --- it's actually not that uncommon for me to fill my/boot partition and need to garbage collect old kernels. So yes, I really do need a 1GB for/boot.
As far as LVM, of course I use more than a single volume; separate LV's get used for test filesystems (I'm a filesystem developer, remember), but more importantly, the most important reason to use LVM is because it allows you to take snapshots of your live filesystem and then run e2fsck on the snapshot volume --- if the e2fsck is clean you can then drop the snapshot volume, and run "tune2fs -C 0 -T now/dev/XXX" on the file system. This eliminates boot-time fsck's, while still allowing me to make sure the file system is consistent. And because I'm running e2fsck on the snapshot, I can be reading e-mail or browsing the web while the e2fsck is running in the background. LVM is definitely worth the overhead (which isn't that much, in any case).
It's not obvious to me that X25-M treats a block that has been zero'ed out as an "unallocated block". It could do this, but it's not at all guaranteed that it does this. Do you know for certain (via an Intel specification sheet) that writing all ZERO's is the equivalent of an ATA TRIM?
Flash using MLC cells have 10,000 write cycles; flash using SLC cells have 100,000 write cycles, and are much faster from a write perspective. The key is write amplification; if you have a flash device with an 128k erase block size, in the worst case, assuming the dumbest possible SSD controller, each 4k singleton write might require erasing and rewriting a 128k erase block. In that case, you would have a write amplification factor of 32. Intel claims that with their advanced LBA redirection table technology, they have a write amplification of 1.1, with a wear-leveling overhead of 1.4. So if these numbers are to be believed, on average, over time, a 4k write might actually cost a little over 6k of flash write. That is astonishingly good.
The X25-M uses MLC technology, and is rated for a life for 5 years writing 100GB a day. In fact, if you have an 80GB worth of flash, and you write 100GB a day, with an write amplification and wear-leveling overhead of (1.1 and 1.4, respectively), then over 5 years you will have used approximately 3200 write cycles. Given that MLC technology is good for 10,000 write cycles, that means Intel's specification has a factor of 3 safety margin built into them. (Or put another way, the claimed write amplification factors could be three times worse and they would still meet their 100GB/day, 5 year specification.)
And 100GB a day is a lot. Based on my personal usage of web browsing, e-mail and kernel development (multiple kernel compiles a day), I tend to average between 6 and 10GB a day. When Intel surveyed system integrators (i.e., like Dell, HP, et. al), the number they came up with as the maximum amount a "reasonable" user would tend to write in a day was 20GB. 100GB is 10 times my maximum observed write, and 5 times the maximum estimated amount that a typical user might write in a day.
For those of you who are Linux users, you can measure this number yourselves. Just use the iostat command, which will return the number of 512 byte sectors written since the system was booted. Take that number, and divide it by 2097152 (2*1024*1024) to get gigabytes. Then take that number and divide it by the number of days since your system was booted to get your GB/day figure.
We had some issues with not adding enough randomness in embedded devices, but that problem was largely fixed a year ago. At this point, I think urandom should be fine for session keys. It's not the best choice for long-lived keys in those embedded devices, but those devices (a) don't have RDRAND, since they tend to mips or ARM CPU's, and (b) since they don't have any peripherals other than the flash drive and the networking cards, there isn't that much entropy they can draw upon. There are things you can do to improve things in userspace, such as holding off on generating the host keys and generating the RSA keys for the certificates as long as possible, instead of right after the boot. But that's much more of a specialized problem for a specific class of system.
How would they detect any shared properties? The point is that they are providing a random number generator (not a stream of random numbers) which is supposedly "secure". Secure means that no one, including the person providing the RNG, can predict the stream of numbers coming form the RNG. If the RNG coming form the US source is not honest, that means that presumably the NSA can predict the stream of numbers coming out of the RNG. But the NSA (assuming that it distrusts the KGB and the MSS) wouldn't want the KGB and the MSS to be able to carry out the same feat. The same is true for each of the other devices. So there's no way that any one of the actors should be able to detect any shared properties --- that's the point of the proposal.
Now, if the NSA is able to gimmick the RNG coming from China, then that's a different story. And to the extent that many electronics are designed in the US and then manufacturered in China, that's certainly a concern. In order for a scheme like this to work, the parts would have to be designed and built in such a way that an outsider would believe that the NSA couldn't have possibly gimmicked an RNG, even if it could have been gimmicked by another spy agency. Then combine this with a device that you're sure couldn't have been gimmicked by the MSS, but may have been subject to pressure from the NSA, and so on.
The random driver has changed significantly since July 2012, which is we were given a heads up about the paper described at http://factorable.net/ which is also when I took back maintainership of the /dev/random driver. We gather entropy at every single interrupt, and mix it into the entropy pool. This is done unconditionally, you can't disable it, like what happened with the SA_SAMPLE_RANDOM flag.
The thing about entropy pools is that when you combine entropy sources, the result gets better, not worse. So the best thing would be if we had hardware random number generators sourced from China, Russia, and the USA. Since presumably the MSS, KGB, and the NSA mutually distrust each other, if we combine the entropy from those three soruces, the result will be stronger than any one alone.
This is why I don't recommend using RDRAND directly. Sure, an honest (emphasis on honest) hardware random number geneterator will always be able to source higher quality entropy than anything we can do by sampling OS events, such as interrupts. But the problem is it's hard to guarantee that a HWRNG is really honest. Especially given the Snowden revelations which seem to indicate the NSA has successfully leaned on at least one chip manufacturer. If you must use RDRAND, I'd recommend generating a random key via some other means, and then encrypting the output of RDRAND by that random key before use the resulting randomness for session keys, etc. Or better yet, do what we do in /dev/random, which is to mix RDRAND with other sources of entropy.
What I said is that /dev/urandom is much more important to get right than /dev/random. Realistically, far more programs use /dev/urandom than use /dev/random. GPG uses /dev/random for long-term key generatiom, but in terms of generating certs, creating session keys, etc., /dev/urandom is far more important.
If you trust Intel not to have gimmicked RDRAND, by all means, feel free to use it. Please do it in open source, though, so I can fix said program not to, though.....
I have a Google+ post where I've posted my latest updates to this still-developing story:
https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7
Also, I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.
So before I tried agitating for programmers to fix their buggy applications, I had already implemented both the heuristic that XFS uses (if you truncate a file descriptor, add an implicit fsync on the close of that fd), and in addition I had implemented another heuristic (if you rename on top of an existing file, fsync the source file of the rename). This was to work around buggy applications, and as you can see, ext4 does even more than XFS does.
At the end of the day, though, the heuristic can sometimes get things wrong, and sometimes the heuristic will be too aggressive in forcing fsync()'s when it's not really necessary, which is why it's good to at least try to education application programs about something which even you agree shouldn't be a new thing.
(For example, if you don't fsync, and you want to run your application on another OS, like say, Solaris, you will be very sad.)
But it wasn't backside covering, although most people don't seem to realize it, FIRST I added the hueristics to work around the buggy code, and THEN I agitated for people to fix their d*mn code. But application programmers don't like being told that they are wrong, so this seems to be a case of "blame/shoot the messenger" --- with me having been cast into the role of the messenger.
I'm aware that ext4 can run without a journal, but isn't that functionally equivalent to leaving it as ext2?
With ext4 you get the benefits of extents, delayed allocation, and other new-to-ext4 features. You also get directory hash trees, which was introduced in ext3 and therefore not in ext2. Running with out the journal means you have to run a full fsck after an unclean shutdown, but you still get all of the new features and performance improvements of ext4.
Stop blaming the applications for a filesystem problem Ted. The excuse doesn't wash no matter how many times you use it, and no, XFS does not have it.
http://en.wikipedia.org/wiki/XFS#Delayed_allocation
Any other questions? At the very least the applications are non-portable in the sense that they were depending on behavior not guaranteed by POSIX. XFS, btrfs, ZFS, and many if not most modern file systems do delayed allocation. It's one of the basic file system tricks to improve performance.
Read the answer to the FAQ very carefully. In fact, they agree with me:
In certain cases it might make sense to turn off barriers and disable write caches, if you are writing huge amounts of bulk data and very little metadata in a RAID array --- and that is what XFS is optimized for. But they didn't say anything which contradicted what I said, although the conclusions might have been a little confusing and not necessarily applicable in workloads other than XFS's original design point of really big RAID arrays to support writing really big data sets.
Jeff,
You may be correct in saying that if you compare the guts of Soft Updates with that of (say) the JBD/JBD2 layer in Linux, which is what is responsible for handling the physical block journalling for ext3/ext4, the complexities involved might not be that different.
However, the difference comes when someone adds ACL support, or some other fs feature. When you are using physical block journalling, all you need to know is how many blocks a particular fs operation needs to dirty. That's it! With Soft Updates, you need to understand dependency diagrams and write code to implement rollbacks, etc. The person who is implementing the file system feature has to do many more things.
Now there are certainly downsides to doing physical block journalling. If you have workloads which are very high in metadata operations, physical block journalling will hurt. On the other hand, it's not clear how common such workloads are (although you can certainly find benchmarks that will stress that particular usage pattern). And in the face of hard drive errors, physical block journals can sometimes be better at recovering from certain failures than logical journalling or soft updates.
Like many things, there are always tradeoffs around, and if the goal is to play the "my file system has a longer d*ck" game, it's almost always possible to find some benchmark which "proves" that one file system is better than another. Yawn...
So Canonical has never reported this bug to LKML or to the linux-ext4 list as far as I am aware. No other distribution has complained about this > 512MB bug, either. The first I heard about it is when I scanned the Slashdot comments.
Now that I'll know about it, I'll try to reproduce it with an upstream kernel. I'll note that in 9.04, Ubuntu had a bug which as far as I know, must have been caused by their screwing up some patch backports. Only Ubuntu's kernel had a bug where rm'ing a large directory hierarchy would have a tendency to cause a hang. No one was able to reproduce it on an upstream kernel,
I will say that I don't ever push patches to Linus without running them through the XFS QA test suite. (Which is now generalized enough so it can be used on a number of file systems other than just XFS). If it doesn't have a "write a 640 MB file" and make sure it isn't corrupted, we can add it and then all of the file systems which use the XFSQA test suite can benefit from it.
(I was recently proselytizing the use of the XFS QA suite to some Reiserfs and BTRFS developers. The "competition" between file systems is really more of a fanboy/fangirl thing than at the developer level. In fact, Chris Mason, the head btrfs developer, has helped me with some tricky ext3/ext4 bugs, and in the past couple of years I've been encouraging various companies to donote engineering time to help work on btrfs. With the exception of Hans Reiser, who has in the past me of trying to actively sabotage his project --- not true as far as I'm concerned --- we all are a pretty friendly bunch and work together and help each other out as we can.)
So I'm an engineer, and not an academic. I'm not trying to get a Ph.D. The whole Keep it Simple, Stupid principle is an important one, especially as you say, "Journalling and Soft Updates have similar performance characteristics."
If sometimes Journalling posts better benchmarks, and sometimes Soft Updates produces better results, but Soft Updates is hideously more complex, thus inhibiting new features such as ACL's and Extended Attributes (which appeared in BSD much latter than Linux, and I think Soft Updates made it much harder to find people capable of extending the file system) --- then the choice of the simpler technology seems to be obvious. The performance gains are a toss up, and using a hideously complex algorithm for its own sake is only good if you are an academic gunning for a Ph.D. thesis or a paper publication, or if you are trying to ensure job security by implementing something so hard to maintain that only you and few other people can hack it.
Journaling, and every other filesystem, has exactly the same problem. If consistence is required, YOU MUST DISABLE THE CACHE, unless it is battery-backed, or you are willing to depend on your UPS. This is the penalty we take for devices which lie to the OS about flush operations and the like.
Yes, there were, in the bad old days, devices which lied when the OS sent a flush cache command, and in order to get a better Winbench score, they would cheat and not actually flush the cache. But that hasn't been true for quite a while, even for commodity desktop/laptop drives. It's quite easy to test; you just time how many single block sector writes followed by a cache flush commands you can send per second. In practice, it won't be more than, oh, 50-60 write barriers per second. In general, if you use a reputable disk drive, it supports real cache flush commands. My personal favorites are Seagate momentus drives for laptops, and I can testify to the fact that they all handle cache flush commands correctly; I have quite a collection and it's really not hard to test.
The big difference between journalling and soft updates is we can batch potentially hundreds of metadata updates into a single journal transaction, and send down a single write barrier every few seconds. The journal commit is an all-or-nothing sort of thing, but that gives us reliability _and_ performance.
The problem with soft updates is that the relative ordering of nearly most (if not all) metadata writes are important. And putting a write barrier between each barrier operation is Slow And Painful. Yes, you can disable the write cache, but then you give up a huge amount of performance as a result. With journaling we can get the performance benefits of writes, but we only have to pay the cost of enforcing write ordering through the barrier once every few seconds.
Of course, there are workloads where soft updates plus a disabled write cache might be superior. If you have a very metadata-intensive workload that also happens to call fsync() between nearly every metadata operation, then it would probably do better than a physical block journalling solution that used barrier writes but run with an enabled write cache. But in the general case, if you compare a more normal workload where fsync()'s aren't happening _that_ often, and compare physical block journalling with a write cache and barrier ops, with a Soft Updates approach with the write cache disabled, I'm pretty sure the physical block journalling approach will end up benchmarking better.
>I mount these read-only in the interests of security, but that means, of course,
>that I can't have journalling on them, which precludes the use of ext3 or 4.
#1. you can mount ext3 file systems read-only. The journal doesn't preclude a ro mount.
#2. ext4 supports running without a journal. Google engineers contributed that code to ext4 last year.
So I'm not sure what you're talking about. If you're talking about delayed allocation, XFS has it too, and the same buggy applications that don't use fsync() will also lose information after a buggy proprietary Nvidia video driver crashes your machine, regardless of whether you are using XFS or ext4.
If you are talking about the change to _ext3_ to use data=writeback, that was a change that Linus made, not me, and ext4 has always defaulted to data=ordered. Linus thought that since the vast majority of Linux machines are single-user desktop machines, the performance hit of data=ordered, which is designed to prevent exposure of uninitialized data blocks after a crash wasn't worth it. I and other file system engineers disagreed, but Linus's kernel, Linus's rules. I pushed a patch to ext3 which makes the default a config option, and as far as I know the enterprise distro's plan to use this config option to keep the defaults the same as before for ext3.
Since it was my choice, I actually changed the defaults for ext4 to use barriers=1. which Andrew Morton vetoed for ext3 because again, he didn't think it was worth the performance hit. But with ext4, the benefits of delayed allocation and extents are so vast that it completely dominated the performance hit of turning on write barriers. That is what most of the performance benefits for ext4 come from, and it is very much a huge step forward compared to ext3.
So with respect, you don't know what you are talking about.
-- Ted
So there's a major problem with Soft Updates, which is that you can't be sure that data has hit the disk platter and is on stable store unless you issue a barrier operation, which is very slow. What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption! The disk drive, especially modern ones with large caches, can reorder writes which are sent to the disk, sometimes (with the right pathological workloads) for minutes at a time. You won't notice this problem if you just crash the kernel, or even if you hit the reset button. But if you pull the plug or otherwise cause the system to drop power, data in the disk's write cache won't necessarily be written to disk. The problem that we saw with journal checksums and ext4 only showed up on a power drop, because there was a missing barrier operation, so this is not a hypothetical consideration.
In addition, if you have a very heavy write workload, the Soft Updates code will need to burn a fairly large amount of memory tracking the dependencies and burn quite a bit of CPU figuring out which dependencies need to be rolled back. I'm a bit suspicious of how well they perform and how much CPU they steal from applications --- which granted, may not show up in benchmarks which are disk bound. But if the applications or the large number of jobs running on a shared machine are trying to use lots of CPU as well as disk bandwidth, this could very much be an issue.
BTW, while I was doing some quick research for this reply. it seems that NetBSD is about to drop Soft Updates in favor of a physical block journaling technology (WAPBL), according to Wikipedia. They didn't get a reference to this, nor did they say why NetBSD was planning on dropping Soft Updates, but there is a description of the replacement technology here: http://www.wasabisystems.com/technology/wjfs. But if Soft Updates is so great, why is NetBSD replacing it and why did Free BSD add file system journaling alternative to UFS?
Actually FFS with Soft Updates is only about preserving file system metadata so they don't require fsck's. BSD with FFS and Soft Updates still pushes out meta-data after 5 seconds, and data blocks after 30 seconds. Soft Updates only worries about metadata blocks, and not data blocks.
In fact, after a crash with FFS you can sometimes access uninitialized data blocks that contain data from someone else's mail file, or p0rn stash. This was the problem which ext3's data=ordered was trying to solve; unfortunately it does so by making fsync==sync, which also had the unfortunate side effect of making people think that fsync()'s always had to be slow. It doesn't have to be, if it's properly implemented --- but I'll be the first to admit that ext3 didn't do a proper job.
It's really depressing that there are so many clueless comments in Slashdot --- but I guess I shouldn't be surprised.
Patches to work around buggy applications which don't call fsync() have been around long before this issue got slashdotted, and before the Ubuntu Laundpad page got slammed with comments. In fact, I commented very early in the Ubuntu log that patches that detected the buggy applications and implicitly forced the disk blocks to disk were already available. Since then, both Fedora and Ubuntu are shipping with these workaround patches.
And yet, people are still saying that ext4 is broken, and will never work, and that I'm saying all of this so that I don't have to change my code, etc ---- when in fact I created the patches to work around the broken applications *first*, and only then started trying to advocate that people fix their d*mn broken applications.
If you want to make your applications such that they are only safe on Linux and ext3/ext4, be my guest. The workaround patches are all you need for ext4. The fixes have been queued for 2.6.30 as soon as its merge window opens (probably in a week or so), and Fedora and Ubuntu have already merged them into their kernels for their beta releases which will be released in April/May. They will slow down filesystem performance in a few rare cases for properly written applications, so if you have a system that is reliable, and runs on a UPS, you can turn off the workaround patches with a mount option.
Applications that rely on this behaviour won't necessarily work well on other operating systems, and on other filesystems. But if you only care about Linux and ext3/ext4 file systems, you don't have to change anything. I will still reserve the right to call them broken, though.
It also depends on what type of filesystem you use. A journaling filesystem like ext3 can wear down a disk a lot faster than a non-journaling filesystem.
Not true. If you have a decent SSD that doesn't have Write Amplification problems (such as the X25-M), the extra overhead of journalling really isn't that bad. I wrote about this quite recently on my blog.
So interested people want to know --- how do you get the "insider" information from an X25-M (ie., total amount of writes written, and number of cycles for each block of NAND)?
I've added this capability to ext4, and on my brand-spanking new X25-M (paid for out of my own pocket because Intel was to cheap to give one to the ext4 developer :-), I have:
/sys/fs/ext4/dm-0/lifetime_write_kbytes
<tytso@closure> {/usr/projects/e2fsprogs/e2fsprogs} [maint]
568% cat
51960208
Or just about 50GB written to the disk (I also have a /boot partition which has about half a GB of writes to it).
But it would be nice to be able to get the real information straight from the horse's mouth.
Anyways, writing zeros, or writing something else sequentially should essentially be the same.
No writing sequentially is not the same as an ATA TRIM command, since the X25-M can't reuse the blocks for real data. It might (or might not) help the internal fragmentation of the X25-M's internal LBA redirection table --- but given that the PC Perspectives article pointed out that when things got bad, even a complete write pass across the entire disk was not sufficient to restore performance, I doubt it.
This makes sense, actually; without an ATA trim command, if you write the entire disk, the X25-M won't have much in the way of spare room in order for it to do its garbage collection/defragmentation operation. All it will have is the difference between 80 (real) GB (or GiB's for people who like that notation) and 80 (hd marketing) GB's. And apparently that is not enough.
I've had some people suggest that reserving a partition with a few gig's and never using it helps, since that provides some extra room for the X25-M to recover; but I don't have anything authoratative.
But back to the original point, what we really need is a way to tell the disk, "we don't care about the contents of the blocks any more". It *might* be that writing some magic pattern, whether all zero's or all one's --- and in fact, all one's makes more sense since an erased flash memory cell returns '1', not '0'. But the key question is whether or not the SSD's firmware treats this as "ok to reuse" or not. And for that we need a definitive answer from Intel.
Can you give me a URL or citation from someone official at Intel who has said this? As near as I can tell, Intel has been very tight-lipped about what the X25-M does internally.
I use 1GB for /boot because I'm a kernel developer and I end up experimenting with a large number of kernels (yes, on my laptop --- I travel way to much, and a lot of my development time happens while I'm on an airplane). In addition, SystemTap requires compiling kernels with debuginfo enabled, which makes the resulting kernels gargantuan --- it's actually not that uncommon for me to fill my /boot partition and need to garbage collect old kernels. So yes, I really do need a 1GB for /boot.
As far as LVM, of course I use more than a single volume; separate LV's get used for test filesystems (I'm a filesystem developer, remember), but more importantly, the most important reason to use LVM is because it allows you to take snapshots of your live filesystem and then run e2fsck on the snapshot volume --- if the e2fsck is clean you can then drop the snapshot volume, and run "tune2fs -C 0 -T now /dev/XXX" on the file system. This eliminates boot-time fsck's, while still allowing me to make sure the file system is consistent. And because I'm running e2fsck on the snapshot, I can be reading e-mail or browsing the web while the e2fsck is running in the background. LVM is definitely worth the overhead (which isn't that much, in any case).
It's not obvious to me that X25-M treats a block that has been zero'ed out as an "unallocated block". It could do this, but it's not at all guaranteed that it does this. Do you know for certain (via an Intel specification sheet) that writing all ZERO's is the equivalent of an ATA TRIM?
Flash using MLC cells have 10,000 write cycles; flash using SLC cells have 100,000 write cycles, and are much faster from a write perspective. The key is write amplification; if you have a flash device with an 128k erase block size, in the worst case, assuming the dumbest possible SSD controller, each 4k singleton write might require erasing and rewriting a 128k erase block. In that case, you would have a write amplification factor of 32. Intel claims that with their advanced LBA redirection table technology, they have a write amplification of 1.1, with a wear-leveling overhead of 1.4. So if these numbers are to be believed, on average, over time, a 4k write might actually cost a little over 6k of flash write. That is astonishingly good.
The X25-M uses MLC technology, and is rated for a life for 5 years writing 100GB a day. In fact, if you have an 80GB worth of flash, and you write 100GB a day, with an write amplification and wear-leveling overhead of (1.1 and 1.4, respectively), then over 5 years you will have used approximately 3200 write cycles. Given that MLC technology is good for 10,000 write cycles, that means Intel's specification has a factor of 3 safety margin built into them. (Or put another way, the claimed write amplification factors could be three times worse and they would still meet their 100GB/day, 5 year specification.)
And 100GB a day is a lot. Based on my personal usage of web browsing, e-mail and kernel development (multiple kernel compiles a day), I tend to average between 6 and 10GB a day. When Intel surveyed system integrators (i.e., like Dell, HP, et. al), the number they came up with as the maximum amount a "reasonable" user would tend to write in a day was 20GB. 100GB is 10 times my maximum observed write, and 5 times the maximum estimated amount that a typical user might write in a day.
For those of you who are Linux users, you can measure this number yourselves. Just use the iostat command, which will return the number of 512 byte sectors written since the system was booted. Take that number, and divide it by 2097152 (2*1024*1024) to get gigabytes. Then take that number and divide it by the number of days since your system was booted to get your GB/day figure.