Slashdot Mirror


Denial-of-Service Attack Found In Btrfs File-System

An anonymous reader writes "It's been found that the Btrfs file-system is vulnerable to a Hash-DOS attack, a denial-of-service attack caused by hash collisions within the file-system. Two DOS attack vectors were uncovered by Pascal Junod that he described as causing astonishing and unexpected success. It's hoped that the security vulnerability will be fixed for the next Linux kernel release." The article points out that these exploits require local access.

144 of 210 comments (clear)

  1. Who ported btrfs to DOS? by Nimey · · Score: 4, Funny

    and should we give him a medal or lynch him?

    --
    Hail Eris, full of mischief...

    E pluribus sanguinem
    1. Re:Who ported btrfs to DOS? by macraig · · Score: 5, Funny

      Do I have to choose? Can I hang a medal on him, and then hang him? I'll make the medal 20 pounds to speed up the lynching.

    2. Re:Who ported btrfs to DOS? by maxwell+demon · · Score: 3, Informative

      DOS = Disk Operating System
      DoS = Denial of Service

      --
      The Tao of math: The numbers you can count are not the real numbers.
    3. Re:Who ported btrfs to DOS? by byornski · · Score: 3, Funny
    4. Re:Who ported btrfs to DOS? by Pf0tzenpfritz · · Score: 1

      DOS++; if (!DOS==TRES) { return ("unit test failed"); }

      --
      Oh, the beautiful gloss of greality!
  2. Can we get a real Linux filesystem, please? by Anonymous Coward · · Score: 5, Interesting

    btrfs is a step in the right direction, but even now, Linux does not have production-level deduplication (which even Windows has, for crying out loud), encryption, snapshots, or something even close to supplanting LVM2.

    I just got out of a meeting at my job because we are replacing some old large servers... and because Linux has no stable filesystem with enterprise features, looks like things are either going to Windows, or perhaps Solaris x86 (which is expensive.)

    This doesn't mean to suck Sun's teat for ZFS access... but at least try to come close to what even NTFS or even ReFS offers...

    1. Re:Can we get a real Linux filesystem, please? by Anonymous Coward · · Score: 1

      What's Sun?

    2. Re:Can we get a real Linux filesystem, please? by Anonymous Coward · · Score: 5, Informative

      ZFS on FreeBSD or FreeNAS is great. Easily saturates gigE with a simple mirror of recent 7200rpm disks. It scales up from there, and FreeBSD is pretty rock solid.

    3. Re:Can we get a real Linux filesystem, please? by Anonymous Coward · · Score: 4, Interesting

      btrfs is a step in the right direction, but even now, Linux does not have production-level deduplication (which even Windows has, for crying out loud), encryption, snapshots, or something even close to supplanting LVM2.

      I just got out of a meeting at my job because we are replacing some old large servers... and because Linux has no stable filesystem with enterprise features, looks like things are either going to Windows, or perhaps Solaris x86 (which is expensive.)

      This doesn't mean to suck Sun's teat for ZFS access... but at least try to come close to what even NTFS or even ReFS offers...

      Hear hear! Backup admin here, just want to add before the unwashed masses of armchair Linux admins show up, one example of an enterprise filesystem feature is the NTFS change journal. It makes the file system scan as part of an incremental backup run in constant time.

      It's sad on other systems with large numbers of files to schedule subdirectories for different times of day to deal with scanning overhead.

    4. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 5, Informative

      NTFS doesn't have snapshots. Instead it relies on volume shadow copies, with known severe performance artifacts caused by needing to move snapshotted data out of the way when new writes come in. Btrfs, like ZFS and Netapp's WAFL, use a far more efficient copy-on-write strategy that avoids the write penalty. The takeaway: I would not go so far as to claim Microsoft has an enterprise-worthy solution either. If you want something with industrial strength dedup, snapshots and fault tolerance, you won't be getting it from Micorosft.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    5. Re:Can we get a real Linux filesystem, please? by smash · · Score: 2

      Data integrity for one?

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    6. Re:Can we get a real Linux filesystem, please? by grumbel · · Score: 3, Informative

      I have seen the userlevel ZFS crash multiple times, it's also slow as hell. It's still worth it if you are short on storage and want to reduce the size of your backup, but I wouldn't exactly call it ready for production.

    7. Re:Can we get a real Linux filesystem, please? by Anonymous Coward · · Score: 2, Interesting

      Wouldn't it be cheaper and just as effective to use FreeBSD or FreeNAS for your data? if you're considering either Windows or Solaris then obviously you don't need a specific operating system. I would think FreeBSD (or even ZFS on Linux) would suit your purposed better 9and with less expense) than Windows or Solaris.

    8. Re:Can we get a real Linux filesystem, please? by maz2331 · · Score: 5, Informative

      ZFS on Linux does exist as a kernel module that is pretty stable and works well. http://zfsonlinux.org/ -- it was put out by Lawrence Livermore National Lab, but can't be included with the kernel distros due to GPL / CDDL license compatability issues.

    9. Re:Can we get a real Linux filesystem, please? by Anonymous Coward · · Score: 1

      That's what you should expect from your storage array. And that is what you get with real storage arrays.

      A filesystem level approach to the problem can only be a bandaid, at best part of a larger solution.

    10. Re:Can we get a real Linux filesystem, please? by dbIII · · Score: 3, Informative

      Kernel level probably is ready, but not on 32bit (big hassles there but probably not a big deal to most) and on 64 bit there are some memory usage problems and performance seems to suck when there's a dozen or so hosts keeping connections to files on ZFS open via NFS at the same time. There's still a way to go before ZFS on linux gets to where it is on FreeBSD but it's still early days, and for many usage patterns it looks like it is ready for production.

    11. Re:Can we get a real Linux filesystem, please? by Anonymous Coward · · Score: 2, Informative

      Linux has production level encryption, snapshots, and LVM2. What are you talking about?

      Unless you have very specific uses, deduplication should be done at your storage array really. It's not a high priority to implement in the filesystem. (No, your anecdote does not make it a high priority).

    12. Re:Can we get a real Linux filesystem, please? by Anonymous Coward · · Score: 1

      I have seen the userlevel ZFS crash multiple times, it's also slow as hell. It's still worth it if you are short on storage and want to reduce the size of your backup, but I wouldn't exactly call it ready for production.

      I think parent is talking about this, not the userlevel FUSE-based ZFS:
      http://zfsonlinux.org/

    13. Re:Can we get a real Linux filesystem, please? by WWJohnBrowningDo · · Score: 2

      Did you guys look at FreeBSD?

    14. Re:Can we get a real Linux filesystem, please? by Marxdot · · Score: 1

      Why should deduplication and snapshots (and even encryption, I suppose) be done by filesystems themselves? Why require a repetition of effort in implementing every filesystem? Also, ZFS is an insane thing written by people who don't seem to understand that keeping a good separation of concerns can lead to a rather slick set of general tools that can be used on almost any fs.

      Oh, right, 'enterprise features'. That certainly sets the alarm bells ringing.

    15. Re:Can we get a real Linux filesystem, please? by blade8086 · · Score: 1

      LVM has snapshots and DM has encryption.

      And since when is deduplication a 'critical' enterprise feature?

      e.g. who else has it other than ZFS in the unix world without having an expensive addon product etc?

      (other than DragonFlyBSD's hammer, which unfortunately corporate weenies have testicles too small to deploy)

      maybe critical for your application - but this doesn't mean its mega-lagging behind.

    16. Re:Can we get a real Linux filesystem, please? by jamesh · · Score: 4, Insightful

      NTFS doesn't have snapshots. Instead it relies on volume shadow copies, with known severe performance artifacts caused by needing to move snapshotted data out of the way when new writes come in. Btrfs, like ZFS and Netapp's WAFL, use a far more efficient copy-on-write strategy that avoids the write penalty. The takeaway: I would not go so far as to claim Microsoft has an enterprise-worthy solution either. If you want something with industrial strength dedup, snapshots and fault tolerance, you won't be getting it from Micorosft.

      What nonsense. VSS is the snapshot solution for NTFS, and of course it uses copy-on-write. Microsoft VSS backup architecture is years ahead of Linux... LVM is kind of cool but if you have a single database spread across multiple LV's then you can't snapshot them all as an atomic operation so it becomes useless. MS VSS does this, and always has.

      I'm normally a Linux fanboi but when you sprout rubbish like this I have no hesitation in correcting you.

    17. Re:Can we get a real Linux filesystem, please? by Agent+ME · · Score: 2

      If snapshots are handled by the filesystem, then it could be possible to snapshot a specific directory or file rather than a whole partition for example. Snapshots in the filesystem also prevents stuff like changes to space that was free when the snapshot was taken from being unnecessarily remembered.

    18. Re:Can we get a real Linux filesystem, please? by Anonymous Coward · · Score: 3, Informative

      Tried to find some more information on this. First discovery: VSS stands for "Volume Shadow copy Service", not "Visual SourceSafe", as was my first association. :)

      AFAICT he's saying pretty much what Microsoft is saying:

      When a change to the original volume occurs, but before it is written to disk, the block about to be modified is read and then written to a "differences area", which preserves a copy of the data block before it is overwritten with the change. Using the blocks in the differences area and unchanged blocks in the original volume, a shadow copy can be logically constructed that represents the shadow copy at the point in time in which it was created.

      The disadvantage is that in order to fully restore the data, the original data must still be available. Without the original data, the shadow copy is incomplete and cannot be used. Another disadvantage is that the performance of copy-on-write implementations can affect the performance of the original volume.

      Do you have a newer reference?

    19. Re:Can we get a real Linux filesystem, please? by LordLimecat · · Score: 3, Informative

      FAT32 is going to be faster than a LOT of filesystems precisely because it lacks features like dedup, any notion of real ACLs, and, oh, I dont know, data integrity. Thats why if you want a really fast RAMDisk, you dont use NTFS or ReFS, you use FAT16 or FAT32.

    20. Re:Can we get a real Linux filesystem, please? by Anonymous Coward · · Score: 1

      This is a tricky issue. If you keep all old file where in their original sectors and write changes in new places, your files get fragmented to hell. Only your original snapshot is contiguous, while your current data is scattered about your disk. This may work fine if you have dozens of spindles making up your volume, or for an SSD, but it's not going to work for a regular HDD.

      What you'll end up with is fast write performance and horrible read performance. Since most files are read far more often than they're written, it's generally better to make the current data contiguous and the rarely-used snapshots fragmented.

      Of course, it's probably best to write the new data where it's convenient and later on do some defragmentation to put the data where it's fastest to read.

      dom

    21. Re:Can we get a real Linux filesystem, please? by guruevi · · Score: 1

      Solaris and it's derivatives can be had for free. You don't HAVE to buy it and it's derivatives like OpenIndiana are very stable.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    22. Re:Can we get a real Linux filesystem, please? by smash · · Score: 1

      1. No storage array does it properly. 2. You can BUILD a ZFS storage array with de-dup, compression, self-healing, etc. for cheaper than you can buy a Netapp or EMC. A filesystem approach is the only way to ensure end-to-end data integrity, correcting tranmission errors between the host and the storage, etc.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    23. Re:Can we get a real Linux filesystem, please? by belrick · · Score: 1

      Btrfs, like ZFS and Netapp's WAFL, use a far more efficient copy-on-write strategy that avoids the write penalty.

      WAFL doesn't do copy-on-write. Copy-on-write means a write to a block in a file requires the original block to be read, written elsewhere for the snapshot, then the new block written in the original location. That's exactly what WAFL doesn't do. WAFL writes all changed blocks for multiple files in big RAID stripes, updating pointers to current copies and leaving snapshot pointers pointing to old copies of the updated files. Very efficient for writes, but changes almost all reads, random or sequential (within a file) into random reads (within the filesystem) because file blocks get scattered according to write order, not location of the block within the file. That's why they want lots of spindles in an aggregate and they love RAM cache and flash cache.

      But since you say that copy-on-write avoids the write penalty I think you know what is does but simply don't know that it isn't copy-on-write.

    24. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 5, Informative

      VSS is the snapshot solution for NTFS, and of course it uses copy-on-write

      Well. Maybe you better sit down in a comfortable chair and think about this a bit. From Microsoft's site: When a change to the original volume occurs, but before it is written to disk, the block about to be modified is read and then written to a “differences area”, which preserves a copy of the data block before it is overwritten with the change.

      Think about what this means. It is not a "copy-on-write", it is a "copy-before-write". Gross abuse of terminology if anybody tries to call it a "copy-on-write", which has the very specific meaning of "don't modify the destination data". Instead, copy it, then modify the copy. OK, are we clear? VSS does not do copy-on-write, it does copy-before-write.

      Now let's think about the implications of that. First, the write needs to be blocked until the copy-before-write completes, otherwise the copied data is not sure to be on stable storage. The copy-before-write needs to read the data from its original position, write it to some save area, then update some metadata to remember which data was saved where. How many disk seeks is that, if it's a spinning disk? If the save area is on the same spinning disk? If it's flash, how much write multiplication is that? When all of that is finally done, the original write can be unblocked and allowed to proceed. In total, how much slower is that than a simple, linear write? If you said "on the order of an order of magnitude" you would be in the ballpark. In face, it can get way worse than that if you are unlucky. In the best imaginable case, your write performance is going to take a hit by a factor of three. Usually, much much worse.

      OK, did we get this straight? As a final exercise, see if you can figure out who was talking nonsense.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    25. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 2

      If you keep all old file where in their original sectors and write changes in new places, your files get fragmented to hell.

      Microsoft's "shadow copy" doesn't work at the file level, it works at the block level, so it doesn't know anything about files. Btrfs and its ilk try to leave some empty space distributed across the volume, so copy-on-write can leave the copies in fairly reasonable places. After the copy is committed, the original space can be freed, so the next update won't mess things up too badly either. Snapshots mess this up because the original space doesn't get freed. But then, snapshots are always messed up, there is no such thing as a perfect snapshot strategy with respect to disk seeking. Incidentally, with flash you don't care about that any more, there is no seek time.

      Anyway, yes, with a crappy copy-on-write (like Netapp's) you get horrible read fragmentation. With an intelligent implementation, it isn't so bad. Note that Btrfs is turning in good benchmarks, including read performance in mixed read/write loads.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    26. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      LVM is kind of cool but if you have a single database spread across multiple LV's then you can't snapshot them all as an atomic operation so it becomes useless.

      You're also wrong about that. You can concatenate multiple logical volumes as a single logical volume and snapshot that atomically.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    27. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      Btrfs, like ZFS and Netapp's WAFL, use a far more efficient copy-on-write strategy that avoids the write penalty.

      WAFL doesn't do copy-on-write. Copy-on-write means a write to a block in a file requires the original block to be read, written elsewhere for the snapshot, then the new block written in the original location. That's exactly what WAFL doesn't do. WAFL writes all changed blocks for multiple files in big RAID stripes, updating pointers to current copies and leaving snapshot pointers pointing to old copies of the updated files. Very efficient for writes, but changes almost all reads, random or sequential (within a file) into random reads (within the filesystem) because file blocks get scattered according to write order, not location of the block within the file. That's why they want lots of spindles in an aggregate and they love RAM cache and flash cache.

      But since you say that copy-on-write avoids the write penalty I think you know what is does but simply don't know that it isn't copy-on-write.

      We both know what we're talking about, we just disagree on terminology. Properly, a "copy-on-write" doesn't modify the original destination. Nobody should ever use the term "copy-on-write" to describe the algorithm that is properly "copy-before-write". The strategy that leaves the original destination untouched and updates pointers to point at the modified copy is correctly called "copy-on-write", but because the terminology has been so commonly abused by the likes of Microsoft and their followers, it is better to be clear and call that "redirect-on-write".

      Finally, Netapp gets massive read fragmentation because they suck, not because it can't be avoided.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    28. Re:Can we get a real Linux filesystem, please? by jamesh · · Score: 4, Insightful

      VSS is the snapshot solution for NTFS, and of course it uses copy-on-write

      Well. Maybe you better sit down in a comfortable chair and think about this a bit. From Microsoft's site: When a change to the original volume occurs, but before it is written to disk, the block about to be modified is read and then written to a “differences area”, which preserves a copy of the data block before it is overwritten with the change.

      Think about what this means. It is not a "copy-on-write", it is a "copy-before-write". Gross abuse of terminology if anybody tries to call it a "copy-on-write", which has the very specific meaning of "don't modify the destination data". Instead, copy it, then modify the copy. OK, are we clear? VSS does not do copy-on-write, it does copy-before-write.

      Now let's think about the implications of that. First, the write needs to be blocked until the copy-before-write completes, otherwise the copied data is not sure to be on stable storage. The copy-before-write needs to read the data from its original position, write it to some save area, then update some metadata to remember which data was saved where. How many disk seeks is that, if it's a spinning disk? If the save area is on the same spinning disk? If it's flash, how much write multiplication is that? When all of that is finally done, the original write can be unblocked and allowed to proceed. In total, how much slower is that than a simple, linear write? If you said "on the order of an order of magnitude" you would be in the ballpark. In face, it can get way worse than that if you are unlucky. In the best imaginable case, your write performance is going to take a hit by a factor of three. Usually, much much worse.

      OK, did we get this straight? As a final exercise, see if you can figure out who was talking nonsense.

      I concede that the terminology used by the MS article is misused. I don't think you're thinking the performance issues through though. You start with a file nicely laid out linearly on disk, and you take a snapshot so you can make a backup. Now you make a modification to the middle of the file and what happens? Suddenly the middle of the file is elsewhere on disk, and in the case of LVM this is invisible to the filesystem so no amount of defragging is going to fix it. This situation persists long after you have taken your backup and thrown the snapshot away. Of course this doesn't matter for flash but we're not all there yet. If BTRFS does snapshots using copy-on-write (correct definition) then this will be a problem too, although if BTRFS is smart enough it should be able to repair the situation once the snapshot is discarded.

      VSS's way leaves the original data in-order on the storage medium. The difference area is likely on a completely different disk anyway so the copy-on-write (MS definition) could not be performed any other way.

    29. Re:Can we get a real Linux filesystem, please? by jamesh · · Score: 1

      LVM is kind of cool but if you have a single database spread across multiple LV's then you can't snapshot them all as an atomic operation so it becomes useless.

      You're also wrong about that. You can concatenate multiple logical volumes as a single logical volume and snapshot that atomically.

      OK this is news to me. When I last asked about that it couldn't be done but that was a few years go. Google doesn't tell me how I can concatenate (say) my database lv and my logs lv (separate vg's because separate spindles), snapshot them, then un-concatenate them... a link would be appreciated.

    30. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      lvm lets you concatenate any block devices into a virtual block device

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    31. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 2, Informative

      Modifications in the middle of files are extremely rare. It's true, running a database on top of a snapshotted spinning disk is probably going to suck. For normal users, keeping regular files mostly linear, and files in the same directory nearby each other is what matters, and yes, Btrfs does a credible job of that.

      I know why shadow copy works the way it does. 1) It's simple, therefore likely to work. 2) It's an easy answer to the "how do you control fragmentation" question. But the write performance issue is so bad that it's a poor solution no matter how you justify it. It's just an attempt to get away with being lazy for a largely uncritical audience that isn't big into benchmarking, or indeed, isn't used to good disk performance.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    32. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      Sure, Microsoft abuses the CoW terminology and Wikipedia documents that. More politely than necessary, IMHO.

      Copy-on-write leaves the original data unchanged. Copy on write makes a private copy, leaving the orignal unchanged. Microsoft has a different definition, but then Microsoft has a lot of different definitions. Let's you and me be precise about it, and avoid the terminology that Microsoft has wantonly polluted in its ignorance. Copy-before-write or redirect-on-write.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    33. Re:Can we get a real Linux filesystem, please? by MikeBabcock · · Score: 1

      Funny, even my home box uses LVM over dm-crypt over RAID on Linux just fine. And that's with Ext4 file systems.

      LVM lets me create a snapshot for consistent backups any time I want.

      --
      - Michael T. Babcock (Yes, I blog)
    34. Re:Can we get a real Linux filesystem, please? by MikeBabcock · · Score: 1

      Totally aside from your main point, what does the spindle count have to do with your VG naming?

      pvcreate /dev/sda1
      pvcreate /dev/sdb1
      pvcreate /dev/sdc1

      vgcreate LotsOfDrives /dev/sda1 /dev/sdb1 /dev/sdc1

      Now if you want spindle-specific LVs:
      lvcreate -n dbdata LotsOfDrives /dev/sdb1
      lvcreate -n logdata LotsOfDrives /dev/sdc1

      --
      - Michael T. Babcock (Yes, I blog)
    35. Re:Can we get a real Linux filesystem, please? by jamesh · · Score: 2

      I'm still not getting how you can simultaneously snapshot dbdata (optimised for read and write) and logdata (optimised for write) as an atomic operation. "Tough Love (215404)" said "concatenate them together" but I don't get what that means in this context.

      Last time I checked you would still have to snapshot one, then the other, and the resulting snapshots are almost certainly not going to give you a consistent backup because there would have been writes between the first and the second snapshots.

    36. Re:Can we get a real Linux filesystem, please? by aix+tom · · Score: 1

      Which of course you can do that, but then you can't have the database LV and the log LV on different physical disks any more, which is what was asked.

      Can you post an example how you would concatenate two existing LVs, with existing file systems on them, mounted and being modified at the time. into a "new virtual block device" without even un-mounting them, and then make a consistent snapshot of them?

    37. Re:Can we get a real Linux filesystem, please? by LWATCDR · · Score: 1

      I would say that you should look at BSD then. If you are willing to go open souce anyway FreeBSD offers ZFS. Too bad that more hardware and software companies do not support BSD as well as Linux.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    38. Re:Can we get a real Linux filesystem, please? by drinkypoo · · Score: 1

      There's still a way to go before ZFS on linux gets to where it is on FreeBSD but it's still early days, and for many usage patterns it looks like it is ready for production.

      Can I get it as just a module, or do I need to build a custom kernel package? I can do that, but I prefer not to.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    39. Re:Can we get a real Linux filesystem, please? by LWATCDR · · Score: 2

      Wow, just how many clueless people are on Slashdot posting as ACs?
      "No good deduplication software? Don't put duplicate data on the system in the first place!"
      Okay Sparky you have 5000 users on a server and that all save that email about vacation time or the pictures from the office party. Redundant data. This is a large system with lots of users, it is not for you leet Linux box you have in your mom's basement. Your plays on Microsoft's name are also childish and over done. Now there is a valid argument that deduplication of data should be done at the array level so that it does not need to be filesystem dependant but that is an argument for knowledgeable adults and not for the likes of you.

      You may go now, you bore me. You may come back when you learn enough to be interesting.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    40. Re:Can we get a real Linux filesystem, please? by smallfries · · Score: 1

      When you have a filesystem that understands hard links, deduplication is still required to find files that have the same content and link them together. You are possibly thinking of a filesystem that hashes contents to decide on storage locations.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    41. Re:Can we get a real Linux filesystem, please? by T-Ranger · · Score: 1

      If you ... your employer ... are prepared to spend money, then why not spend money? I mean, and this is a serious question, why not go with something like a EMC VNX or VNXe? Byte for byte of real physical storage SANs are pretty expensive, I grant, but the features can oft make up for that.

    42. Re:Can we get a real Linux filesystem, please? by TCM · · Score: 1

      What a load of BS. What if two files happen to have the same content, but shouldn't really be tied to each other?

      Two hardlinked files are forever stuck together until you unlink them manually, down to their file access times and everything. If I write to one, the other changes.

      Deduplication doesn't have this semantic tie. Two files happen to have the same content? Fine, save space. But write to one file and the other stays as it was. Plus you _still_ have hardlinks if you want to create a semantic connection.

      Not to even mention the fact that deduplication also works if only parts of the files are common.

      So please, think before you post.

      --
      Of course it runs NetBSD. BTC: 1NT7QvbetmANwaMzhpVL6
    43. Re:Can we get a real Linux filesystem, please? by dna_(c)(tm)(r) · · Score: 2

      [...]I just got out of a meeting at my job [...]and because Linux has no stable filesystem with enterprise features [...]

      Sure, AC has some real complex stuff to handle on an enterprise level. That's why all the big boys like Google, Facebook and Twitter are using Windows to host their data...

      You're either a silly moron, a self deluding enterprisy [a-z]+architect or a very capable troll.

    44. Re:Can we get a real Linux filesystem, please? by TCM · · Score: 1

      So the also non-existent data integrity is the reason they don't have deduplication? Why don't you just say "Yes, we don't have a real filesystem" instead of these laughable arguments?

      --
      Of course it runs NetBSD. BTC: 1NT7QvbetmANwaMzhpVL6
    45. Re:Can we get a real Linux filesystem, please? by iggymanz · · Score: 1

      opensolaris is long dead. OpenIndiana has never put out a stable release and never met their 2011 q1 stable release target. they put out a development release once in a while, but that is NOT production grade nor matained at a level suitable for production use

    46. Re:Can we get a real Linux filesystem, please? by thrift24 · · Score: 1

      Why would you spread a database over multiple Logical Volumes. That just sounds like a poorly engineered LVM setup. Am I wrong?

    47. Re:Can we get a real Linux filesystem, please? by thrift24 · · Score: 1

      Linux absolutely has production level encryption through the device mapper and support for snapshots with LVM.

      Data deduplication is something I'm not as familiar with, but Microsoft just got support for this in Windows 2012 and Linux has had some dedup support for at least this long. I don't know how production ready either are, but I'm pretty sure I don't trust your accuracy on the matter after your previous claims.

    48. Re:Can we get a real Linux filesystem, please? by Zero__Kelvin · · Score: 2

      "I just got out of a meeting at my job because we are replacing some old large servers... and because Linux has no stable filesystem with enterprise features, looks like things are either going to Windows, or perhaps Solaris x86 (which is expensive.)"

      Somebody notify the millions of Enterprise servers that are Linux based, and serving up a major portion of the internet's content every day! Talk about throwing the baby out with the bathwater. Basically, you don't want to take a chance that established filesystems that have been in use in a corporate environment for well over a decade might fail someday, so you are considering going with an OS that is known to be unstable and requires regular reboots just to keep it's security "up to date" (which doesn't mean secure). Bravo! Way to totally foul up a basic system analysis!

      Somebody should invent RAID and regular backups!

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
    49. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      Dear Microsoft spinmods: you don't change the fact that your volume snapshots suck by modding down my post.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    50. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      You say: "Copy-on-write leaves the original data unchanged" and VSS leaves the original data unchanged.

      the implementation details doesn't change the logical concept. COW says: "The fundamental idea is that if multiple callers ask for resources which are initially indistinguishable, they can all be given pointers to the same resource" but it looks like you don't get it.

      I did not say that VSS leaves the original data unchanged, I said the opposite. And this is not an "implementation detail", it's a fundamental property of the operation. And could you please read the next sentence after the one you quoted from Wikipedia, it invalidates your argument. And could you please stop chewing on my toes and learn something about computer science.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    51. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      Sounds like Btrfs envy. Question is, can they get to work reliably?

      Here is an informative post that details why Microsoft's Refs sucks and you don't need to care about it. Even if it works reliably, which is not at all assured (see many reports on the net of issues) this filesystem is pathetically feature poor. What's the point.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    52. Re:Can we get a real Linux filesystem, please? by jamesh · · Score: 1

      Why would you spread a database over multiple Logical Volumes. That just sounds like a poorly engineered LVM setup. Am I wrong?

      The idea is to spread it over separate underlying disks or RAID sets. MSSQL and Exchange transaction logs are pretty much write only. The databases themselves are read/write, obviously, but still might be read-mostly or write-mostly. By putting them on separate array's you can optimize the caching, RAID type, and RAID stripe size in each array for its intended purpose. Even spreading different database tables over different arrays can help too depending on the usage patterns.

      Oracle have the similar recommendations for their database setups too.

      Even under a basic Linux setup with / in one lv, /var in another, and /home in another, the delay between snapshotting each one isn't desirable, although it is unlikely to have any real-world impact.

    53. Re:Can we get a real Linux filesystem, please? by jamesh · · Score: 1

      Dear Microsoft spinmods: you don't change the fact that your volume snapshots suck by modding down my post.

      Troll is a little harsh... I disagree with you but I know you're not trolling and the discussion is still an Interesting one.

    54. Re:Can we get a real Linux filesystem, please? by jamesh · · Score: 1

      When you have a filesystem that understands hard links, deduplication is redundant.

      I would argue that maybe it doesn't belong in the filesystem in the first place. If you have a bunch of VM's all with (say) Debian Wheezy then deduplication in the backend storage would do much more than simple FS deduplication. Some FS knowledge in the storage would be useful (eg files with the same name in each FS are probably a good place to start to look for duplicates) but even that is just an optimisation and isn't required.

    55. Re:Can we get a real Linux filesystem, please? by dbIII · · Score: 1

      By default it builds as a module from source and I don't think anybody is packaging it yet. It seems to use close to 4GB (which seems well over twice what ZFS on FreeBSD appears to be using) so I wouldn't recommend it on anything with less memory than that from what I've seen of it.

    56. Re:Can we get a real Linux filesystem, please? by bzipitidoo · · Score: 1

      Last time I ran a benchmark, FAT was by far the slowest file system. Ext2, 3 and 4, Reiser 3 and 4, btrfs, xfs, jfs, and even ntfs were all much faster. Each varied on different kinds of loads, but the differences between them was insignificant next to the difference in speed between all of them and FAT. Simplicity often doesn't translate to speed. FAT does many things in brain dead ways. Let's rewrite the entire file for every tiny change, and do it right away, no caching! Insert a little something in the middle? Rewrite all the data that comes after it! And don't even try to at least defragment a little bit while doing so.

      --
      Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
    57. Re:Can we get a real Linux filesystem, please? by drinkypoo · · Score: 1

      It seems to use close to 4GB (which seems well over twice what ZFS on FreeBSD appears to be using) so I wouldn't recommend it on anything with less memory than that from what I've seen of it.

      That's an awful lot for a filesystem. What does it use on slowlaris?

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    58. Re:Can we get a real Linux filesystem, please? by Lennie · · Score: 1

      It depends on your needs.

      Take for example the top500, if I'm not mistaken more than 50% of that uses Lustre as the filesystem. Which is obviously Linux based.

      I think both Ceph ("inspired" by Lustre) and btrfs are interresting and I'm sure they'll be more than production ready next year.

      Hopefully with bcache in the mainline kernel too.

      --
      New things are always on the horizon
    59. Re:Can we get a real Linux filesystem, please? by dbIII · · Score: 1

      I'm not sure, my solaris boxes don't have a lot of storage so I haven't touched ZFS on solaris. I've got it running on two FreeBSD machines, one with a total of 2GB memory, and total memory usage rarely goes above 512MB (it went past that when I was moving a 350GB file) so it looks like just a sign that the linux version is still in it's early days. I'm moving the 4GB linux machine over to FreeBSD this week since all the memory slots are used.
      I haven't used it a lot, but so far FreeBSD with the ports collection looks to me a lot like what Gentoo linux was intended to become.

    60. Re:Can we get a real Linux filesystem, please? by drinkypoo · · Score: 1

      I haven't used it a lot, but so far FreeBSD with the ports collection looks to me a lot like what Gentoo linux was intended to become.

      Maybe I'll look at it again for my next filer. Last time it seemed to be annoying for the sake of being annoying, which was also my impression of the FreeBSD users I knew personally, but that doesn't mean they're all like that. I've used netbsd and OpenBSD and even 4.3BSD-lite on ROMP but not FreeBSD.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    61. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      Which of course you can do that, but then you can't have the database LV and the log LV on different physical disks any more, which is what was asked. Can you post an example how you would concatenate two existing LVs, with existing file systems on them, mounted and being modified at the time. into a "new virtual block device" without even un-mounting them, and then make a consistent snapshot of them?

      You're delusional, "without even unmounting them" appeared nowhere in the discussion above, nor did the concept of making separate filesystems work together atomically. Your assertion about "different physical disks" doesn't make any sense at all. Of course you can combine different physical disks into a single logical volume. You would then create a single filesystem on the logical volume. Look here for examples.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    62. Re:Can we get a real Linux filesystem, please? by LordLimecat · · Score: 1

      If ext3 is showing as faster than FAT in your benchmarks, your benchmarks are horribly flawed. A non-journaling filesystem with real metadata is going to be oodles faster than any journaling filesystem.

      Heck, I wouldnt be suprised if NTFS were competitive with EXT3, ext3 isnt exactly known as a speed demon.

    63. Re:Can we get a real Linux filesystem, please? by thrift24 · · Score: 1

      You can make a single LV that is striped across separate underlying disks or RAID sets. That's kind of half the point of LVM.

      If you really absolutely wanted a consistent snapshot of the whole fileystem you could just use one LV, although there are of course many good reasons not to do that, but I can't really see a need for /home and /var to be consistent with one another. If you want a DB snapshot, then just snapshot /var, unless your database product is crazy files shouldn't really be changing anywhere else.

    64. Re:Can we get a real Linux filesystem, please? by donaldm · · Score: 1

      Actually ext4 is way faster than ext3 and has been out a few years now. I make all my file-systems ext4 including my backup disks and have never had any issues. The only thing I have FAT on is some flash drives that I sometimes use to transfer files to MS Windows machines and it is rare for me to go the other way since I normally don't have anything I want on MS Windows machines.

      I did try BtrFS about a year ago but for home use I found it not worth the effort (actually it is really easy) and I am very familiar with AdvFS and ZFS as well as many other types of file-systems. Oh well I guess I will wait for BtrFS to become more main stream.

      BTW the original AC post was very good troll since it was only praising Microsoft file-systems and seemed ignorant about other highly reliable enterprise file-systems such as ext3 which is being superseded by ext4 and JFS to name a few. It must also be noted that JFS is IBM's enterprise file-system that is also run on multimillion disk farms and computer systems and is open source which means it also runs on Linux and is fully supported by IBM.

      --
      There ain't no such thing as proprietary standards only proprietary formats. Standards are by definition open.
    65. Re:Can we get a real Linux filesystem, please? by donaldm · · Score: 1

      I will second a very capable troll. :)

      --
      There ain't no such thing as proprietary standards only proprietary formats. Standards are by definition open.
    66. Re:Can we get a real Linux filesystem, please? by Rich0 · · Score: 1

      btrfs is a step in the right direction, but even now, Linux does not have production-level deduplication (which even Windows has, for crying out loud), encryption, snapshots, or something even close to supplanting LVM2.

      Well, that might be why they're working on btrfs, then. :) I'm not sure about encryption, but everything else on your list is something likely to be in the feature list at some point. It obviously isn't stable yet, but that is a matter of time, and if somebody wanted to make a push to get something stable they'd get there a lot faster with btrfs than reinventing something else.

      btrfs already supports reflink copies (think of a copy that behaves like a hard link on initial copy, but each file tracks its own separate changes, sharing file regions that have not changed). That isn't quite deduplication, but I imagine that somebody will get around to implementing that once things settle down (more important issues like not losing your data to work on right now). If you wanted to do a scan and de-duplicate pass that would be pretty easy, even implemented in userspace (just find files with common regions, delete one, reflink it from the other, and replay the modifications). Obviously directly manipulating the filesystem would be more efficient, and you'd want to build in some kind of index to avoid scans in the filesystem as well.

      Snapshots are already fully supported on btrfs - and can be done at the level of the root or any folder within the filesystem. A file-level snapshot is just a reflink. Snapshots are first-class citizens and can be mounted, resnapshotted, etc.

      Btrfs can also have quotas at any level of the filesystem, so that basically covers your lvm2 need. It can expand across multiple storage devices, with a few raid-like options supported now, and with more likely to come. You can also tag individual files and tell the system to store just that file with increased redundancy.

      Btrfs is basically the future of linux filesystems - I don't really hear anybody disputing that. It just isn't quite the present, hence ongoing efforts around ext4.

    67. Re:Can we get a real Linux filesystem, please? by kasperd · · Score: 1

      A filesystem approach is the only way to ensure end-to-end data integrity

      Integrity checks in the file system certainly provides much better guarantees than integrity checks on the storage level. And anybody designing file systems today should build integrity checks into their file systems. But the higher a layer you move the integrity checks to, the closer you get to real end-to-end integrity. File system integrity checks don't protect data while it is sitting in memory.

      If you copy a file from one file system to another, it can still be corrupted in transit. Even if both source and destination file system have build in integrity checks, the copy could get corrupted in the process. But if the source and destination file systems both have integrity checks, they could provide some API to facilitate simpler integrity checks at the higher level. For example if both source and destination file system use a hash-tree for integrity checks, there could be an ioctl to retrieve the root of said hash-tree. Then the cp command could call this after copying and compare hashes of source and destination.

      --

      Do you care about the security of your wireless mouse?
    68. Re:Can we get a real Linux filesystem, please? by petermgreen · · Score: 2

      Also, ZFS is an insane thing written by people who don't seem to understand that keeping a good separation of concerns can lead to a rather slick set of general tools that can be used on almost any fs.

      Separating stuff into layers has benefits but it also has costs. Sometimes merging layers can make things practical that aren't practical with them separate. Afaict this is what drove the creation of zfs and btrfs.

      Lets first look at RAID. traditional raid provides protection against reads that fail but not against reads that silently return wrong data. Experience has shown that hard drives cannot be trusted not to silently return wrong data. Worse still raid resyncs after power failure may silently overwrite good data with corrupt data. Adding checksums in the raid layer is difficult because there is nowhere good to put them (you can't just make the blocks slightly smaller because filesystems expect power of two block sizes). Putting checksums in the filesystem helps a bit but even if there is an API to request the "other copy" when the filesystem detects corrupt data the aforementioned resync may have already overwritten the good version with the bad one. By moving the responsibility for storing data redundantly into the filesystem we can avoid this problem, when going a consistency check the filesystem can check both copies against the checksum it keeps and ensure it overwrites the bad version with the good one rather than vice-versa

      Also traditional raid requires the whole array to have the same level of redundancy. It's possible to work around this by having multiple arrays but that then means you have to manually allocate space between the arrays. Yes there are ways to grow and shrink arrays but it's extra work and may involve downtime. With redundancy at the filesystem layer you should just be able to tell the filesystem what level of redundancy you want for each directory and let the free space be used for any of them.

      Now lets consider snapshots. Snapshots below the fileystem layer mean that you waste effort snapshotting free space. Worse still if writing to a snapshoted volume works by remapping blocks then it creates fragmentation in the mapping which is likely to stay around forever. This fragmentation happens even if the blocks were previously free (due to the fact we are snapshoting free space) and may stick around even after the file that caused it to happen is long gone. With snapshots at the filesystem level you don't snapshot free space and while you still get fragmentation you only get it when modifying an existing file (not when creating a new file) and it goes away when the file is deleted. Finally having snapshots at the filesystem layer means you don't have to snapshot the whole filesystem, you can snapshot individual directories within it.

      Now lets consider dedupe if you do it below the filesystem layer then to get much benefit from it you have to make your logical devices larger (in aggregate) than your physical devices. That is likely to lead to some very strange errors when you run out of physical blocks but the filesystems still think they have free space. It can also lead to the problem of "garbage" that the dedupe layer thinks needs to be preserved even though it's not actually in use by the filsystem (granted a trim-like API between the filesystem and the dedupe layer could fix this).

      With encyrption having encryption as part of the filesystem allows you to chose what you do and don't want encrypted without having to mess arround with seperate volumes and the administrative overhead threof (see previous comments about raid) though how useful this is and whether it is worth the increased risk (too easy to leave clues behind unencyrpted) depends hugely on your threat model.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    69. Re:Can we get a real Linux filesystem, please? by aix+tom · · Score: 1

      Sorry, but "LVM is kind of cool but if you have a single database spread across multiple LV's then you can't snapshot them all as an atomic operation" actually IMPLIES the "without even unmounting them". It also implies the "different physical disks".

      If you don't know these basic concepts, then please stop with those "Of course it's possible" post, because it is not.

    70. Re:Can we get a real Linux filesystem, please? by lsatenstein · · Score: 1

      btrfs is a step in the right direction, but even now, Linux does not have production-level deduplication (which even Windows has, for crying out loud), encryption, snapshots, or something even close to supplanting LVM2.

      I just got out of a meeting at my job because we are replacing some old large servers... and because Linux has no stable filesystem with enterprise features, looks like things are either going to Windows, or perhaps Solaris x86 (which is expensive.)

      This doesn't mean to suck Sun's teat for ZFS access... but at least try to come close to what even NTFS or even ReFS offers...

      ===
      What is the big complaint about btfrs. Is it the egg that it should be hatched at perfection? With the new kernel out this or next week, btfrs will gain major performance improvements. btfrs will surely be a desktop file system, until all security issues are resolved. The DOS attack is done by someone able to use the keyboard on your desktop or server. In my opinion the DOS is really an academic study. ZFS and even EXT4 will have some form of weakness. And NTFS too, if you are a windows server user.

      I've been using btfrs with Fedora 18 beta since November. I can pull the plug, and it recovers nicely. I have not tested all the wonderful features that come with it, but I will.
      ZFS looks interesting too. Am I stuck on one or the other? Benchmarks measuring speed recommend EXT4 as the best choice. I leave the evaluations and recommendations to you, the reader of my reply.

      --
      Leslie Satenstein Montreal Quebec Canada
    71. Re:Can we get a real Linux filesystem, please? by stoatwblr · · Score: 1

      Try using native zfs instead of zfs-fuse - just make sure you have enough ram if you want to futz around with dedupication.

    72. Re:Can we get a real Linux filesystem, please? by haruchai · · Score: 1

      Not at the enterprise level - we've added 96GB of ECC RAM to each of our chassis for about $1700 each.
      Adding 3TB to our SAN, having the disks validated, new LUNs provisioned, etc, cost over $20k.

      --
      Pain is merely failure leaving the body
    73. Re:Can we get a real Linux filesystem, please? by LordLimecat · · Score: 1

      I wasnt aware that ext3 was considered "highly reliable"; certainly it is a journaling filesystem, but I think ive seen a relatively (compared to number of systems seen) equal number of filesystem disasters on both. Ive seen chkdsk recover NTFS from some pretty bad states, and ive seen fsdisk fail spectacularly.

      Ext3 is "reliable" because it has been in service basically forever and because it is journaling, not for any other reason. AFAIK it doesnt have any super-advanced reliability features like checksumming or automatic snapshotting.

    74. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      You're not even talking about LVM, you're talking about making multiple databases operate together atomically. In the context of LVM, that's complete nonsense. It's not an LVM concept, it's a high level application concept. Why are you even wasting bandwidth conflating these issues? If you want to do the job with LVM, you concatenate the volumes and run a single filesystem, or single database on the aggregate volume. Conflating this with the application level consistency somebody dragged into the discussion is just idiotic. By the way, you should keep a lid on the hubris about who knows what.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    75. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      That's not data, it's only a representation, can you get it?.

      Sorry, the person not getting things here is not me. Copy-on-write is a technique of avoiding changing shared data in place. Let's not get all abstract and confused about that, ok? I mean, you can wank on about your data abstractions, but in the process you will also wank away the computer science and miss the point entirely.

      The motivation for not changing the original, shared data in copy-on-write is, we might not know about all the previously existing references to that data, so there may be no practical way to find them all and change them. Microsoft uses a different technique in shadow copy, they do know about all the incoming references from previously snapshots, so they change them to point somewhere else, copy the original data there, then proceed to change the original data. That is not copy-on-write because it does not avoid changing the original, shared data, in place. (See, I spelled out the "in place" thing as a comprehension aid.)

      Two very different techniques, with very different performance characteristics. Copy-on-write is O(1) in number of incoming references, while copy-before-write is O(N) in number of incoming references. Copy-before-write requires stalling a write operation for the duration of the copy and metadata update, while copy-on-write does not. Fundamentally different algorithms, as is apparent because of the different complexity characterics. Irrespective of the computer science involved, Microsoft ignores the subtleties and calls the second thing copy-on-write anyway. Opening up plenty of opportunity for wanking on Slashdot by the likes of you. Now, if you want to be precise about it, use the term redirect-on-write every time we actually mean the classic computer science concept copy-on-write (which is documented reasonably well on Wikipedia, except in a certain paragraph containing the word "Microsoft").

      As for your "just a representation" argument, let's extend that. A sorted list is "just a representation" of an unsorted list, therefore sorting costs nothing. Oh wait, that's nonsense. Just as your "that's not data" argument is nonsense.

      Finally, I hope you agree that copy-before-write performance sucks pretty badly compared to classic copy-on-write (redirect-on-write). And by extension, Microsoft's shadow copy feature sucks badly. Confusing the issue by confusing the terminology does help Microsoft avoid criticism over poor performance. But I would not go so far as ascribing to malice what can be more easily explained by incompetence.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    76. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      you said tha VSS changes the original data, don't you? which is false

      It's not false, Microsoft clearly documents that they do exactly that: they change the original data in place after saving a copy of it somewhere else.

      As far as computer science goes, I feel more stupid after reading your post, I'll need to stop doing that now.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    77. Re:Can we get a real Linux filesystem, please? by smash · · Score: 1

      This is why you use ECC ram.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    78. Re:Can we get a real Linux filesystem, please? by MikeBabcock · · Score: 1

      In nearly every case where the files aren't semantically related and hard-linkable, deduplication is silly due to storage costs.
      That's already covered by many others' comments. Do your own thinking.

      --
      - Michael T. Babcock (Yes, I blog)
    79. Re:Can we get a real Linux filesystem, please? by aix+tom · · Score: 1

      I'm NOT talking about multiple databases, I'm talking about the SINGLE database spread across multiple LVs, like Jamesh did.

      It was YOU that came up with the "OH, it's possible with LVMs" nonsense in a reply to him, not me.

    80. Re:Can we get a real Linux filesystem, please? by kasperd · · Score: 1

      This is why you use ECC ram.

      ECC RAM does reduce the rate of such errors, but it does not eliminate them. I have seen undetected single bit errors on systems that were entirely using ECC RAM. I am not saying the errors happened in the RAM, it could have happened on the bus or even inside the CPU. Once you start handling multiple PB of data, such errors show up, even if you are using ECC RAM. Using good hardware helps, but no matter how good hardware you choose, you shouldn't trust it. You need end-to-end integrity at a higher level, that is how we noticed that undetected bit errors had been introduced by the hardware.

      The integrity checks at the lower level only protects data for a small part of the flow. With gaps between the part of the data flow protected by one checksum and the part protected by another checksum, there is a window for errors to be introduced. You can design for such low level integrity checks to overlap and thereby giving you something that is as good in detecting random corruption as an end-to-end integrity check. But it requires lots of understanding of how the hardware operates at the very lowest levels to design for such an overlap of of integrity checks. A design with a single end-to-end integrity check has much less risk of introducing windows for corruption through design flaws.

      --

      Do you care about the security of your wireless mouse?
    81. Re:Can we get a real Linux filesystem, please? by TCM · · Score: 1

      In nearly every case where the files aren't semantically related and hard-linkable, deduplication is silly due to storage costs.

      What do you mean, storage costs?

      Cases where hardlinking is wrong but deduplication works: Mail servers. VM storage.

      --
      Of course it runs NetBSD. BTC: 1NT7QvbetmANwaMzhpVL6
    82. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      I'm NOT talking about multiple databases, I'm talking about the SINGLE database spread across multiple LVs, like Jamesh did.

      It was YOU that came up with the "OH, it's possible with LVMs" nonsense in a reply to him, not me.

      Gosh, you're hard to talk to, do people ever tell you that? If it is a single database then lvm volume concatenation of lvm volumes will work perfectly well. If it is multiple databases then your assertion immediately above is false. There is no third possibility, what are you going on about?

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    83. Re:Can we get a real Linux filesystem, please? by aix+tom · · Score: 1

      Well, you're not easy to talk to either. ;-) I'm talking about ONE database on MULTIPLE LVMs. (or, in the old days, on multiple RAIDs)

      For example, with Oracle you can put different tables OF THE SAME DATABASE. into different table spaces, on completely different storage devices. For example, put some not-often accessed tables OF THE SAME DATABASE on one LV with a few big big and not so fast SATA disks, and put other tables OF THE SAME DATABASE on a lot of smaller, faster, fibre channel disks on a different LV. And/Or separate data and redo/undo table spaces that way. And definetly put different Online Redo logs on different physical disks / LVs like recommended here.

      I don't blame any volume management that it's not possible to do consistent snapshots for those scenarios (in fact it's not possible with ANY volume management that I now of, though I don't now ZFS at all.), I just wanted to point out that it is indeed impossible do do consistent snapshots in that regard purely on the LVM level.

      With a "small" database, that can be put on one single LV I can just do a snapshot of the LV, copy the content, and start the DB on another machine without a hitch. With a "big" database that is spread over multipe LVs I have to put the Database into some sort of online backup mode, do snapshots of all LVs one after the other, copy the content, and then do a database recovery when I start up the copy to iron out the "discrepancies" between the snapshots.

    84. Re:Can we get a real Linux filesystem, please? by Tough+Love · · Score: 1

      OK, I see where you're coming from. The Oracle database you describe depends only on write completion semantics of independent block devices, it does its own recovery, but if random volumes "jump back in time" due to replicating a snapshot, the database can't recover reliably. A well known issue. You can fix this with LVM, though I will not claim that this is elegant. Concatenate all the physical volumes together, then allocate separate logical volumes on top of that, that exactly match the underlying physical volumes. Run the database on those logical volumes. To create a consistent state of all the underlying volumes, pause the database and flush all the logical volumes. You now have a consistent set of physical volumes you can copy somewhere and recover the database from.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
  3. CRC by RedHackTea · · Score: 1

    My knowledge of file-systems is minimial. But since it's a CRC attack, can you just turn off the ability of Btrfs to check errors (if that's possible)? However, I'm sure data corruption would then ensue.

    Anyway, I'm glad I always use ext4/3. I thought about trying ZFS at one point, but decided that using Solaris as a non-server OS is pointless. Does anyone still use Solaris?

    --
    The G
    1. Re:CRC by Mike+Domanski · · Score: 1
      From the linked blog:

      Directories are indexed in two different ways. For filename lookup, there is an index comprised of keys:

      Directory Objectid | BTRFS_DIR_ITEM_KEY | 64 bit filename hash

      The default directory hash used is crc32c, although other hashes may be added later on. A flags field in the super block will indicate which hash is used for a given FS.

      Sounds like btrfs uses a CRC as a hash. I assume it's a performance optimization, but using CRC as a hash is insane.

    2. Re:CRC by Anonymous Coward · · Score: 1

      For short messages like filenames, MD5 takes 70 times as long to compute as CRC... And since the published attacks on MD5 lets you create collisions pretty cheaply, you could still do the same attack.

      If anything you'd use a construct like SipHash, but SipHash requires a secret key and a 64-bit output isn't really collision resistant anyway.

    3. Re:CRC by Tough+Love · · Score: 1

      a 64-bit output isn't really collision resistant anyway

      Plenty good enough for a hashed directory key, which doesn't need to be crypticographically secure, just to have good distribution and random results affected as much as possible by all input bits. The size of the output is not the dominant factor, the quality of the input mixing is.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    4. Re:CRC by maxwell+demon · · Score: 2

      Or just use a RB tree instead of a linear list for hash collisions, then you get only O(log n) instead of O(n) worst case search performance.

      To quote Wikipedia:

      Instead of a list, one can use any other data structure that supports the required operations. For example, by using a self-balancing tree, the theoretical worst-case time of common hash table operations (insertion, deletion, lookup) can be brought down to O(log n) rather than O(n). However, this approach is only worth the trouble and extra memory cost if [...] one must guard against many entries hashed to the same slot (e.g.[...] in the case of web sites or other publicly accessible services, which are vulnerable to malicious key distributions in requests).

      While a file system is not generally publicly available (actually it may be, if e.g. used on an FTP server), it is still shared.

      --
      The Tao of math: The numbers you can count are not the real numbers.
    5. Re:CRC by maxwell+demon · · Score: 1

      Does anyone still use Solaris?

      That's a brand of cooking oil, right?

      No, it's a novel by Stanislav Lem.

      --
      The Tao of math: The numbers you can count are not the real numbers.
    6. Re:CRC by MikeBabcock · · Score: 1

      There are much more efficient hashes than MD5 that would work as well for fewer clock cycles. http://cr.yp.to/hash127.html comes to mind.

      --
      - Michael T. Babcock (Yes, I blog)
    7. Re:CRC by cpghost · · Score: 1

      I thought about trying ZFS at one point, but decided that using Solaris as a non-server OS is pointless. Does anyone still use Solaris?

      Have you thought about using ZFS on FreeBSD? Running FreeBSD/amd64 here on a desktop machine with ZFS file systems without any problems.

      --
      cpghost at Cordula's Web.
    8. Re:CRC by Wolfrider · · Score: 1

      --Yah, the movie version with George Clooney had a pretty hot chick in it, too... :D

      --
      .
      == WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
    9. Re:CRC by maxwell+demon · · Score: 1

      So it's effectively a hash table where the hash is stored in a B-tree instead of a being used as array index. It still has the base characteristics of a hash table, though:

      Step 1: You calculate a hash from your key.
      Step 2: You map that hash key to a container (in a standard hash table: Use it as index; in btrfs: look it up in a B-tree)
      Step 3: You seek your actual item in that container (in the usual hash table, and apparently also in btrfs: A linear list; to protect against malicious attacs: a balanced tree).

      What you described is how step 2 differs in btrfs from the usual hash table. My comment was about how to do better on step 3. And yes, a B-tree would be a better option to an RB-tree there. When I wrote "RB tree" I actually meant "balanced tree".

      --
      The Tao of math: The numbers you can count are not the real numbers.
  4. Requires local access by Anonymous Coward · · Score: 5, Funny

    no more dangerous than a fork bomb or filling up /tmp or trying to compile open office.

    1. Re:Requires local access by cryptizard · · Score: 5, Informative

      Sort of, but at least you can recover from those attacks by restarting or booting from an external source to clean up your filesystem. The second attack here leaves you with undeletable files because the file system code responsible for deleting cannot handle the multiple hash collisions. There is no way to recover from that until a patch is pushed out that fixes the problem.

    2. Re:Requires local access by blade8086 · · Score: 2

      Which, without the over sensationalized BS that is this story, will probably be in about a week tops.

      And since BTRFS is not in any 'enterprise' Linux Distributions, means that it will pretty much be available
      immediately since everyone running it in critical production environments will probably be running
      pretty bleeding edge linuxen

    3. Re:Requires local access by Anonymous Coward · · Score: 1

      Requires local access

      Well, it requires the ability to create named files. That could happen through a Wiki upload page, by extraction of an archive to a temporary folder for processing, etc.
      And unlike filling up /tmp, this will not be stopped by setting a quota.

    4. Re:Requires local access by someones · · Score: 1

      this will be easily stopped by adding a filename prefix or suffix. There goes this script kiddie's while about experimental software not being perfect.

    5. Re:Requires local access by ArsenneLupin · · Score: 1

      Well, it requires the ability to create named files. That could happen through a Wiki upload page, by extraction of an archive to a temporary folder for processing, etc.

      ... or worse, web caches which preserve original file names...

    6. Re:Requires local access by drinkypoo · · Score: 1

      The second attack here leaves you with undeletable files because the file system code responsible for deleting cannot handle the multiple hash collisions. There is no way to recover from that until a patch is pushed out that fixes the problem.

      There's no filesystem debugger for btrfs?

      Seems to me like fsck ought to be able to solve this problem, too. Two files with the same hash? Delete the one with the newer timestamp.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    7. Re:Requires local access by GrievousMistake · · Score: 1

      this will be easily stopped by adding a filename prefix or suffix

      No it won't. It is still easy to make collisions with a known prefix or suffix. You would have to include a random component.
      Even if that was a feasible workaround, it's hardly a common best practice, nor should it be.

      There goes this script kiddie's

      He discovered this vulnerability himself, and wrote the attack code; he is by definition not a script kiddie. Never mind that he's a professor and published cryptographer.

      while about experimental software not being perfect.

      This has nothing to do with being experimental software. This is not a bug, it is a weakness in the design. Furthermore, the bad behaviour will not manifest by accident - you have to deliberately provoke it.
      This is the type of problem that isn't fixed before someone finds and reports it -- like Junod did.

      Please cease your inane babbling.

      --
      In a fair world, refrigerators would make electricity.
    8. Re:Requires local access by cryptizard · · Score: 2

      Two files with the same hash is not a problem, it is allowed. This will happen just by chance many times on your filesystem because the hash is relatively short (64 bits). The problem is when you engineer many files to have the same hash and your data structure (hash table) degrades to an array. There is also some other problem in the code here that makes it so the the hash table can't store or for some reason can't process more than a certain number of collisions.

  5. Nice! by gweihir · · Score: 3, Interesting

    "Algorithmic Complexity Attacks" like this one have long been known, but rarely been documented publicly. One good example to point out why hash-randomization is a good idea!

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    1. Re:Nice! by Anonymous Coward · · Score: 3, Funny

      Words, they mean nothing! Take 'rarely' for example, who gives a shit, I'll read it as 'never' same thing.

    2. Re:Nice! by gweihir · · Score: 1

      Well, nice. An example for somebody completely missing the point! This is not about cryptographic hash collisions at all, they are a completely different problem. This is about hash-tables, a data-structure.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  6. Nice this was found before BTRFS goes stable by Anonymous Coward · · Score: 5, Insightful

    Hopefully more people start fuzzing btrfs so it is that much better when it is declared stable.

    1. Re:Nice this was found before BTRFS goes stable by Rich0 · · Score: 1

      Lots of people have been doing testing on btrfs. Filesystems aren't so much declared as stable as they become used as stable. Unless the fix changes the on-disk format in some non-backwards-compatible way, it doesn't really matter when the fix gets deployed. Most likely the fixes will be in git in a week or two.

      Oh, and anybody who really wants to run btrfs should probably be running the git version anyway. They're doing so many bugfixes per month that this is one of those rare times where the mainline kernel sources are likely to be in much worse shape. Once things settle down that will obviously change.

    2. Re:Nice this was found before BTRFS goes stable by Rich0 · · Score: 1

      Just one of those issues with running an open-source OS published by the vendor of a proprietary OS. OpenSUSE and Fedora tend to be treated like guinea pigs.

      And this isn't necessarily a bad thing. If you're a RHEL shop then you probably want to have some Fedora test systems to get a sense for how your applications will operate in future versions.

      If you want something free and stable, you run something like Debian or CentOS, or whatever.

  7. Who cares? by UltraZelda64 · · Score: 1

    Unstable software that is still under heavy development is actually unstable. Who would've guessed?
    I think that based on this ingenious discovery, we should all switch over to it by next week.

  8. Good god man by tomp · · Score: 2

    "Denial-of-Service Attack Found In Btrfs File-System" didn't happen. A vulnerability was found. That's a big deal, no reason to obscure it.

    1. Re:Good god man by blade8086 · · Score: 1

      No, actually, this is NEITHER a DOS Attack, nor a vulnerability. It is a *bug*

      But oh so much better to douschebag promote yourself by being the super terducken 31337 hax0r sekuritah expert
      by mislabeling it and having it get picked up by the tech press.

  9. Re:Can I install btrfs on windows? by Anonymous Coward · · Score: 1

    Yeah, I'll send you the installer. What's your e-mail address?

  10. Attack? by Decameron81 · · Score: 2

    An attack was found in the filesystem? What's that supposed to mean?

    --
    diegoT
    1. Re:Attack? by dr2chase · · Score: 1

      Carefully chosen file names (a lot of them) can DOS file system performance. Whether this could be escalated to a network vulnerability, hard to say -- if an attacker over the net can figure out a way to induce particular file names on the server, that would be worse.

      It's a little sad that people are still forgetting about this failure mode of hash tables and hash functions; either there's got to be a randomizing secret swizzled in, or a better (more nearly cryptographically strong) hash function, or both.

    2. Re:Attack? by dr2chase · · Score: 1

      True, but good random numbers (good hashes) have interesting and powerful statistical properties.

    3. Re:Attack? by dr2chase · · Score: 1

      Read about universal hash functions (the writeup on wikipedia is not that bad). They're not a hack.

      You don't necessarily use a small space, either -- a 64-bit hash is not normally regarded as a small space, thought it is often smaller than the bit size of what is hashed into it.

      Two problems with trees are that you need to define a comparison (you can often concoct one, but they're not always given to you) and though memory is cheap, *probes* into memory are not. If a hash function can get you there in 1 step with high probability, that's interesting.

    4. Re:Attack? by maxwell+demon · · Score: 1

      The two approaches are not mutually exclusive. A hash is an array of containers. Usually people use linear lists as containers because it's the simplest, and hash collisions are considered rare so the O(n) characteristics shouldn't matter. But when hash collisions may be intentionally caused, it's obvious that you should use a container more suited to your problem. Just think about what container you'd use if you weren't able to use a hash table, and then use that same container for the hash table array entries.

      Or in short, make your hash table an array of balanced trees instead of linked lists. That way you get O(1) typical behaviour (assuming a good hash function) and O(log n) worst case (which includes malicious attacks).

      --
      The Tao of math: The numbers you can count are not the real numbers.
    5. Re:Attack? by Noughmad · · Score: 1

      An attack was found in the filesystem? What's that supposed to mean?

      I'm not sure, but it sure sounds like Mr. Reiser had something to do with it.

      --
      PlusFive Slashdot reader for Android. Can post comments.
    6. Re:Attack? by maxwell+demon · · Score: 1

      Worst case for a tree is O(n), not O(log n).

      What exactly did you not understand about balanced tree?

      --
      The Tao of math: The numbers you can count are not the real numbers.
    7. Re:Attack? by petermgreen · · Score: 1

      A simple binary tree has similar problems to a simple hash table. Namely that by controlling the items that are added to the strucuture it's possible to effectively turn it into a list (in a hash table you do it by putting everything in one bucket, in a binary tree you do it by making sure that each node you add ends up as a child of the previous node you added).

      Of course there are countermeasures against this attack on binary trees just like there are countermeasures against attacks on hash tables.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    8. Re:Attack? by Decameron81 · · Score: 1

      So ... a vulnerability was found.
      VULNERABILITY. ATTACK. Different words. Different meanings.

      Exactly! That's what I meant with my question, although I think it went unnoticed for some. You just dont find an attack!

      --
      diegoT
  11. Re:This just in... by FranTaylor · · Score: 1

    Are you saying that google's file systems are corrupt?

  12. No by ArchieBunker · · Score: 2, Interesting

    Instead of picking a filesystem and moving forward people will moan and cry and eventually split into a few different groups with beta level implementations. Sound on Linux is a great example. Two completely different sound drivers that both work half assed. What's the word with XFS these days?

    --
    Only the State obtains its revenue by coercion. - Murray Rothbard
    1. Re:No by drinkypoo · · Score: 1

      What's the word with XFS these days?

      I don't know, but my last word is that I dropped it due to data corruption and now I'm using ext4 while I'm waiting for btrfs.

      I was hoping to be using bcache by now too, but alas, no. I have an 80GB SSD and a 320GB HDD, which I will bump up to 2x1TB stripe and backup to 2TB external... just as soon as I can install with bcache without having to do it all manually.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    2. Re:No by diegocg · · Score: 1

      What's the word with XFS these days?

      http://www.youtube.com/watch?v=FegjLbCnoBw

    3. Re:No by Wolfrider · · Score: 1

      --Have you tried JFS? I'm a heavy Vmware user and it works really well, with minimal CPU usage.

      --
      .
      == WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
  13. Attack vector by aNonnyMouseCowered · · Score: 1

    Indeed, the title makes you think that BTRFS was trojaned or worse is malware.

  14. So a script kiddie found by someones · · Score: 1

    So a script kiddie found a vulnerability on an experimental filesystem.

    There are warnings not to use btrfs in a stable envioronment EVERYWHERE, as its in development and pretty buggy.
    But its the amazing feature set that btrfs offers even if its still pretty broken and you cannot rely on it without doing daily backups... ... and all this script kiddie is concerned about are colliding hashes on a shared envioronment, what is pretty uncertain, as this will not happen naturally?
    A btrfs filesystem becomming corrupt because it fills up would be something to care about at this time.

    1. Re:So a script kiddie found by iggymanz · · Score: 1

      actually most client/server file systems can be DOS'd by too many requests.....local access generally implies the ablility to clog things up

  15. Re:This just in... by someones · · Score: 1

    do you have write access to their filesystem?
    Or do you just have write access to some database where you can tah the data with a "filename"?

  16. Re:Dedupe doesn't belong in a filesystem by LWATCDR · · Score: 3, Informative

    You then turn it off.... And go take your meds.
    I do not think you know what DeDup means. You as a user still see two copies of the file. If you make changes to one copy of the file it will only change that copy of the file. It is not like a link. In other words it is totally transparent to the end user but saves drive space. So if you work in a large organization and someone sends out an email to all 4000 people that email will only take up the space of one email. Even if everyone saves it the imap server.

    In other words you do not know what you are talking about, you probably do not need these functions because you probably do not run a server or servers for a large organization, you seem to have some anger issues, and maybe just a little nuts.

    --
    See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
  17. Can we get a real editor? by Anonymous Coward · · Score: 2, Insightful

    Editors please! I normally expect even a submitter to know the difference between an attack and a vulnerability. However the editor damn well better know the difference. When I read that an ATTACK had been found in btrfs I went to read about how some malicious code had been placed into the code for btrfs. Maybe this code modified data, erases stuff, sends data to China, or just renames files. But no, this was a simple vulnerability. They didn't find an attack in btrfs, they found the potential for an attack - which is called a vulnerability. Let's at least make an effort here.

    1. Re:Can we get a real editor? by Nimey · · Score: 2

      ed(1) is the standard text editor.

      --
      Hail Eris, full of mischief...

      E pluribus sanguinem
  18. Enterprise architect here by iggymanz · · Score: 1

    Deduplication typically isn't done by the operating system in production systems, it is a feature of enterprise grade storage, backup and archival systems.

    Snapshots and encryption can be done in GNU/Linux, or done outside the OS.

    What enterprise grade storage/backup/archival systems are you using, the obvious solution will already be evident from that answer in most cases.

  19. Re:Another crazy white guy goes on a rampage by Zero__Kelvin · · Score: 2

    It is stupid to make this racial, but since you did, when was the last time a black guy opened up on a group of innocent school children?

    --
    Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
  20. Epic Fail (The joke's on you) by Zero__Kelvin · · Score: 2

    A good joke requires significantly planning.

    --
    Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
  21. Vulnerability if you already have access .... by JasterBobaMereel · · Score: 1

    So if you get local access to a system running a btfs filsystem then you can destroy it ...but if you have local access you can easily do that anyway with any filesystem ....?

    --
    Puteulanus fenestra mortis