Slashdot Mirror


ZFS Gets Built-In Deduplication

elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."

386 comments

  1. Does that mean... by Anonymous Coward · · Score: 4, Funny

    Duplicate slashdot articles will be links back to the original one?

    1. Re:Does that mean... by Shikaku · · Score: 1

      Er, isn't block deduplication really really bad at a hard drive block failure point of view? You'd have to compress or otherwise change the data to have a copy now, or it'd just be marked redundant; if that block where all those redundant nodes are pointing to go bad, all of those files are now bad.

    2. Re:Does that mean... by ezzzD55J · · Score: 3, Insightful

      The single block is still stored redundantly, of course. Just not redundantly more than once.

    3. Re:Does that mean... by Anonymous Coward · · Score: 0

      Doesn't matter in ZFS's case - if there's a single unrecoverable bad block anywhere in the filesystem, it becomes unusable. (To be fair, it's really good at recovering from bad blocks.)

    4. Re:Does that mean... by Methlin · · Score: 2, Insightful

      Er, isn't block deduplication really really bad at a hard drive block failure point of view? You'd have to compress or otherwise change the data to have a copy now, or it'd just be marked redundant; if that block where all those redundant nodes are pointing to go bad, all of those files are now bad.

      If you were concerned about block level failure or even just drive level failure, you wouldn't be running your ZFS pool without redundancy (mirror or raidz(2)).

    5. Re:Does that mean... by hedwards · · Score: 2, Insightful

      That requires a citation.

      ZFS isn't that much different than traditional file systems. I'm not quite sure how that reconciles with the fact that it reports unrecoverable bits of information when it couldn't self heal to you. If it were that unusable there'd be no point. Additionally there isn't really much likelihood of that happening considering that ZFS isn't really supposed to be used outside of a ZMIRROR or RAIDZ environment. Sure you can do it, but most of the goodness comes from multiple disks.

    6. Re:Does that mean... by mistshadow · · Score: 1

      As the other poster said, you need to cite this. If the unrecoverable block is in user data, ZFS will return EIO when that block is read. If the unrecoverable block is in metadata, ZFS keeps at least 1 other copy of the metadata, so it will just read the other copy.

      The same thing will happen if the checksum doesn't match.

    7. Re:Does that mean... by mysidia · · Score: 1

      That's why you use a 2-way mirror, RaidZ, or RaidZ2, depending on your performance/space/MTTL tradeoffs.

      Or set copies=2

      It's also worth noting, that the dedup allows setting a threshold, such that additional copies of the block are kept each time the threshold number of duplicates has been exceeded.

    8. Re:Does that mean... by CMonk · · Score: 1

      Additionally there isn't really much likelihood of that happening considering that ZFS isn't really supposed to be used outside of a ZMIRROR or RAIDZ environment.

      This requires a citation.

      And why not? ZFS is still great in single disk setups. It will tell you when you data is corrupt way more reliably than older filesystems. It just won't auto-correct it because thare are only check sums, not parity for those blocks. It still is way more robust and resistant to corruption that older file-systems. It's easier to manage ... blah blah blah. Oh yeah, if I set copies=2 or 3 or 4 or 5 I CAN indeed automatically repair corruption due to sector failures,etc just like in a multi disk setup. I can also tell it to make multiple copies of the meta data on a single disk.

    9. Re:Does that mean... by mistshadow · · Score: 1

      All true, except that copies can only be 1, 2, or 3; there's only room for three locations in a block pointer.

    10. Re:Does that mean... by Anonymous Coward · · Score: 0

      That requires a citation.

      ZFS isn't that much different than traditional file systems. I'm not quite sure how that reconciles with the fact that it reports unrecoverable bits of information when it couldn't self heal to you. If it were that unusable there'd be no point. Additionally there isn't really much likelihood of that happening considering that ZFS isn't really supposed to be used outside of a ZMIRROR or RAIDZ environment. Sure you can do it, but most of the goodness comes from multiple disks.

      Actually in a single drive situation you could still use zfs ditto copies for important filesystems. So it could still be useful for single drive situations.

    11. Re:Does that mean... by Anonymous Coward · · Score: 0

      Heh, I used to worry about this as well as speed - since you have lots and lots of spindles to aggregate lots of I/O, but now you're reducing that workload onto what amounts to less spindles (in theory). So often-used blocks would become a bottleneck (for example: a common block of a DLL in your VMs or something). I spoke with a Netapp engineer, and their solution to this was they cached aggresively, so the hits on those blocks were actually performance *boosters*, not bottlenecks.

    12. Re:Does that mean... by noidentity · · Score: 2, Insightful

      Duplicate slashdot articles will be links back to the original one?

      No, see, this de-duplication is transparent at the interface level. So while dupes won't take extra disk space on Slashdot servers, we'll still see them as normal. Isn't it nice to know that this optimization will be taking place?

    13. Re:Does that mean... by Sillygates · · Score: 1

      but ditto blocks are highly replicated. Even if a file gets corrupted, and ZFS is unable to recover the error, the metadata should not be damaged. This means that whole directories, full of files, and such.

      --
      I fear the Y2038 bug
    14. Re:Does that mean... by mr+crypto · · Score: 2, Interesting

      De-dup also means some unexpected behavior. Want to copy a 5 GB file? Done in less than a second.

      Over-write a section of a dup'ed file with new content? Suddenly you're using more disk space, or could even get a "disk full" message even though you were just replacing data, but not increasing it's size in an obvious way.

      Trying to make space on a drive by deleting lots of big files that happen to be dup'ed? No effect.

    15. Re:Does that mean... by ezzzD55J · · Score: 1

      True, to a lesser extent this is already true with files that have large holes in them, or are hardlinks, though. ZFS makes it a little weirder I admit.

  2. This is good news... by The+Ancients · · Score: 1, Offtopic

    ...and would normally make me happy; except I'm a Mac user. Still good news, but could've been better for a certain sub-set of the population, darn it.

    File systems are one area where computer technology is lagging, comparatively speaking, so good to see innovation such as this.

    1. Re:This is good news... by Anonymous Coward · · Score: 0

      In same boat. Had a SolarisX86 box around here at one time years ago.

      they must be reading my mind. I cant tell you how many times in the datacenter I wished this existed.

    2. Re:This is good news... by bcmm · · Score: 4, Insightful

      ...and would normally make me happy; except I'm a Mac user. Still good news, but could've been better for a certain sub-set of the population, darn it.

      Use open source, get cutting edge things.

      --
      # cat /dev/mem | strings | grep -i llama
      Damn, my RAM is full of llamas.
    3. Re:This is good news... by MBCook · · Score: 1

      It's neat. I can see it being rather useful for our systems at work to de-duplicate our VMs (and perhaps our DB files, since we have replicated slaves). Network storage (where multiple users may have their own copies of static documents that they've never edited) could benefit, perhaps email storage as well.

      Personally though, I don't think there is too much on my hard drive that would benefit from this. I would love for OS X to get the built in checksumming that ZFS has so it can detect silent corruption that may have happened during a bad boot/power loss etc when I try to read the file later.

      It's pretty obvious that HFS+ will have to be replaced soon, and Apple is reportedly working on it (since they ditched ZFS). I'd really like the checksumming, at this point (having so much cheap storage and extra CPU cycles) it should be a gimme.

      --
      Comment forecast: Bits of genius surrounded by a sea of mediocrity.
    4. Re:This is good news... by jeffb+(2.718) · · Score: 3, Funny

      Use open source, get cutting edge things.

      The last time I tried to build an Intel box for Linux work, I lost my grip on the cheap generic case, and sustained a cut that sent me to the emergency room. One of the things I like about my Mac is the lack of cutting edges.

    5. Re:This is good news... by The+Ancients · · Score: 1

      Use open source, get cutting edge things.

      Cutting edge is nice for the functionality; unfortunately it more often than not comes with unintended functionality. I like standing back a bit - not too much mind you, but enough to avoid the bleeding edge.

    6. Re:This is good news... by Anonymous Coward · · Score: 4, Funny

      Shoulda gone with a blade server, then you wouldn't have had to worry about the emergency room.

    7. Re:This is good news... by MrCrassic · · Score: 1

      This is called doin it wrong! :)

    8. Re:This is good news... by Anonymous Coward · · Score: 0

      You surely meant: 'the bloody edge'.

      damn that edge.

    9. Re:This is good news... by Tynin · · Score: 2, Informative

      Not sure when you tried building it, but I build cheap computers for friends / family, at least 2 or 3 computers a year. Almost a decade ago... maybe really only 8 years ago, all cheapo generic cases stopped having razor sharp edges. I used to get cuts all the time, but cheap cases, at least in the realm of having sharp edges, haven't been an issue in a long time. (I purchase all my cheapo cases from newegg these days)

    10. Re:This is good news... by Trepidity · · Score: 4, Informative

      If you're running a normal desktop or laptop, this isn't likely to be of great use in any case. There's non-negligible overhead in doing the deduplication process, and drive space at consumer-level sizes is dirt-cheap, so it's only really worth doing this you have a lot of block-level duplicate data. That might be the case if e.g. you have 30 VMs on the same machine each with a separate install of the same OS, but is unlikely to be the case on a normal Mac laptop.

    11. Re:This is good news... by 644bd346996 · · Score: 1

      What about archived snapshots of my Windows VM? Or for that matter, my entire Time Machine back-end? Block-level deduplication is basically a prerequisite for being able to have Time Machine back up a virtual hard drive image.

    12. Re:This is good news... by Trepidity · · Score: 1

      Why not just use a better archiver than Time Machine, which includes backup-level deduplication, instead of mucking with the filesystem? Any of the bazillions of incremental-backup solutions based on rsync will avoid putting a new copy of the whole VM image in every snapshot.

    13. Re:This is good news... by MightyYar · · Score: 1

      Buy a nice case? You're a Mac user, so I know you have the coin :)

      I have Macs, but when I built my wife's PC I used a nice Antec P150 which looks nice and is really quiet (it's in our bedroom).

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    14. Re:This is good news... by evilviper · · Score: 1

      it's only really worth doing this you have a lot of block-level duplicate data. That might be the case if e.g. you have 30 VMs on the same machine

      ...or if you, I dunno, EVER BACK-UP YOUR DAMN SYSTEM!!!

      Duplicate data is exactly what filesystem snapshots are all about. Try backing up your data every day, for 10 days, keeping all the changed versions of files... Gee, do I want to buy a HDD that is 10X as large as all the rest of my storage space combined, or to I want one that is 1.5X as large? Tough choice.

      I've been doing this for a long time with rsync and hard links. I'm not sure how reliably and well-performing this is going to be built-in to the filesystem, and at the block level, but I'm damn sure happy to try it out... at home, at work, everywhere.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    15. Re:This is good news... by joe_bruin · · Score: 2, Insightful

      Use open source, get cutting edge things.

      I run Linux, where's my ZFS? No, FUSE doesn't count.

    16. Re:This is good news... by 644bd346996 · · Score: 2, Insightful

      Time Machine has by far the easiest to use interface of any backup solution that is at least as powerful. And because it does file-level deduplication using hardlinks, the backups themselves are standard directory trees, browsable in every way that the rest of the filesystem is. If the block-level deduplication is part of the backup software and not the filesystem, the the archives will be opaque files that usually can only be manipulated by the backup software itself. This means the user has little or no choice between UIs for restoring from the archive, and it usually prevents the archives from being indexed by something like Spotlight.

      By adding one small filesystem feature (hardlinks to directories), Apple made it possible to trivially implement a good incremental backup system. (The under-the-hood parts of Time Machine could be implemented in a fairly short shell script run as a cron job.) They then proceeded to put the slickest UI ever around their backup system, but still left it open for other programs. If Apple added block-level deduplication to their filesystem, they wouldn't even have to touch the Time Machine code and it would become the best personal backup software in history.

    17. Re:This is good news... by Trepidity · · Score: 1

      Sure, I agree that backups shouldn't duplicate unchanged data in every backup snapshot. But that would only happen if your "backup software" were literally imaging the drive and storing the images. Nearly every decent piece of backup software already does incremental snapshots, and imo it makes a lot more sense for that functionality to be in the backup system than as part of the filesystem.

    18. Re:This is good news... by evilviper · · Score: 1

      Nearly every decent piece of backup software already does incremental snapshots

      Incremental doesn't cut it... Not even close. That's why anybody even knows a stupid term like "dedupe".

      When a full backup is corrupted or deleted, the incrementals become useless. When one of the incrementals is corrupted or deleted, those following it similarly become (almost-)useless. Differentals would improve the situation, but then you're back to duplicating a large amount of data.

      When you have linked snapshots ("deduped" backups), ALL are both full and incremental backups... if you delete the oldest, the data size shrinks, but the next-most-recent works just fine. You NEVER run a "full backup" over again, only ever incrementals, dramatically reducing time, network bandwidth usage, etc.

      imo it makes a lot more sense for that functionality to be in the backup system than as part of the filesystem.

      If you do it at the file-level, you're wasting a lot of space unnecessarily. If you do it at the application level... you're re-creating a "deduping" filesystem in higher-level software, with the overhead of being on top of an existing filesystem.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    19. Re:This is good news... by Trepidity · · Score: 1

      If you do it at the application level... you're re-creating a "deduping" filesystem in higher-level software, with the overhead of being on top of an existing filesystem.

      Yes, but if the backup software is cross-platform (as many are), you get portability, which is less common with filesystem drivers (try mounting that ZFS disk on Windows). You also restrict the deduplication performance hit to the backups, instead of anything else on the volume also suffering the hit.

    20. Re:This is good news... by bennomatic · · Score: 1

      whoosh?

      --
      The CB App. What's your 20?
    21. Re:This is good news... by Anonymous Coward · · Score: 0

      And why doesn't FUSE count?

    22. Re:This is good news... by BigMeanBear · · Score: 1

      FUSE doesn't count because it suffers limitations that are inherent with being executed as user-mode software. FUSE ZFS is also an incomplete implementation.

      --
      += E
    23. Re:This is good news... by Hurricane78 · · Score: 1

      Well, if you had payed the the same buttload of money, you payed for that Mac case, you wouldn't have gotten any cutting edges. And, for that price, the PSU and a whole expensive Freon-based cooling, or passive water cooling system would have been yours too. (Or half of the hardware inside, at Mac "quality".)

      Mac hardware may *seem* like high quality, because on the absolute scale, it is. But on the scale defined by its *price*, it's rather el-cheapo, and sometimes utter crap.

      Now if only absolute scales would exist in this universe! ^^ (If you disagree, learn your physics!)

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    24. Re:This is good news... by Anonymous Coward · · Score: 0

      I wish different forms of payed existed! pay paid and paying sheesh people, and you actually seem intelligent!

    25. Re:This is good news... by ILongForDarkness · · Score: 1

      True but have you ever tried to open up a Mac laptop or Mac Mini? Every time I see it done (I'm a server guy so don't do it myself :-)) it makes me wince. Always seems like its going to break using a putty knife like thing to wedge the plastic bit off the corner and pry the thing open. That said still better than the 40 or so screws needed to disassemble the Toshibas at my last job.

    26. Re:This is good news... by ILongForDarkness · · Score: 1

      A blade server would a perfect format for a desktop system if it wasn't for that damned 4-8RU chassis :-)

    27. Re:This is good news... by Anonymous Coward · · Score: 0

      ZFS is under "license problems" since the license its under is incompatible with GPL.
      And its troublesome :(

    28. Re:This is good news... by TheRaven64 · · Score: 1

      Time Machine does file-based incremental backups, but if you change one block in a file then you get an entire new copy of it. Rsync does incremental backups on byte ranges, but it updates the remote copy so you need to combine it with a snapshot mechanism if you want to be able to restore from any of the incremental backups. If you run Time Machine on a ZFS ZVOL exported via iSCSI with dedup enabled then it will send complete copies of the modified files but the server will only store the modified blocks within these files.

      Lots of backup solutions already include block or byte-range dedup support. The nice thing about putting it in the filesystem is that it means that you can use a fairly naive backup program. Rather than requiring every application that might benefit from deduplication to implement its own version, you let every application just store data and let the OS do the job of managing resources.

      --
      I am TheRaven on Soylent News
    29. Re:This is good news... by TheRaven64 · · Score: 1

      That's what iSCSI is for. You can export a ZVOL via iSCSI (trivial from OpenSolaris, a couple of commands on FreeBSD) and mount it on any modern OS. The OS can then run its own FS in the ZVOL, but you still get O(1) snapshots, dedup, on-disk checksums and almost all of the other nice ZFS features.

      --
      I am TheRaven on Soylent News
    30. Re:This is good news... by Anonymous Coward · · Score: 0

      ... That might be the case if e.g. you have 30 VMs on the same machine each with a separate install of the same OS...

      You are doing it wrong!
      The correct approach is ot install the OS in one VM and then clone said VM 30 times. This way the original virtual disk becomes read-only and each of the clones stores only the diff. Done! Saves almost the same amount of space (except in the off-chance that the clones all contain similar differences to the original), works on every filesystem and no performance penalty.

    31. Re:This is good news... by Anonymous Coward · · Score: 0

      Rsync does incremental backups on byte ranges, but it updates the remote copy so you need to combine it with a snapshot mechanism if you want to be able to restore from any of the incremental backups

      Use the --link-dest option for file-level deduplication with rsync. rsync -av --link-dest=$PWD/prior_dir host:src_dir/ new_dir/ (the example in the man page) would copy files from host:src_dir/ to new_dir/, except when there's already an identical file in $PWD/prior_dir -- in that case it would hard-link to it.

    32. Re:This is good news... by Man+Eating+Duck · · Score: 1

      One of the things I like about my Mac is the lack of cutting edges.

      Well... My computer went bust the other day, and the only replacement computer I had was an ibook G4. It was painful.
      When I replaced the PSU in my desktop I sliced my hand pretty badly on the Zalman cooler, those are sharp as razorblades. I didn't even feel it at first. I still prefer those cutting edges to the MacIntosh :)
      (No, I'm not making this up)

      --
      Are you a grammar Nazi? I'm trying to improve my English; please correct my errors! :)
    33. Re:This is good news... by BrentH · · Score: 1

      I want my Zee Ef Ess.....

      guitar solo

    34. Re:This is good news... by BrentH · · Score: 1

      Thing is, there not good reason not to do it. CPU's have more cycles to spare than HD's have bandwidth (dont forget that aspect of dedup!) or bytes. The state of filesystems in 2009 is basically the same as it was 15 years ago, while ZFS shows that you can do lot's of little things that make life easier, simpler, more effecient, more secure, etc etc. Why not have that in 2009? The machines can do it, ZFS is the software that can do it, why not?

    35. Re:This is good news... by TheRaven64 · · Score: 1

      And then you are still storing complete copies of the files. With ZFS, you don't actually need deduplication to turn this in to a complete incremental backup solution, because a copy in ZFS can really be a hard link with copy-on-write support, so you just need rsync to copy the file from prior_dir then update the modified byte ranges. If it doesn't do that, then you want to turn on deduplication.

      --
      I am TheRaven on Soylent News
    36. Re:This is good news... by beerbear · · Score: 1

      Intelligent non-native speakers. On a website, where natives regularly use 'there' instead of 'their', don't be too hard on us bloody foreigners.

      --
      Hold my beer and watch this!
    37. Re:This is good news... by puthan · · Score: 1

      I beg to differ. I have scratched myself with the sharp corner of the access panel on my macpro!

    38. Re:This is good news... by jeffb+(2.718) · · Score: 1

      I dunno. The story was true. I still occasionally get the urge to put together a cheap box for something or other, and GP was informative to me, at least.

    39. Re:This is good news... by Ant+P. · · Score: 1

      Here you go.

      I really don't see why ZFS deserves a front page article every time it assimilates another piece of old news. I guess it's a lesson to any software project looking for publicity - build everything into a monolithic ball of mud and win free slashvertisements.

    40. Re:This is good news... by ckaminski · · Score: 1

      HP makes desktop-class blades that have a head unit on your desk, and the computer itself is back in the datacenter. It makes an interesting case for having easy-to-replace parts in an environment where laptops are a no-go.

    41. Re:This is good news... by Anonymous Coward · · Score: 0

      http://en.wikipedia.org/wiki/Nexenta_OS

    42. Re:This is good news... by pankkake · · Score: 1

      Typical Mac user. Just buy a good case by, for example, Antec. Better and cheaper than any Mac.

      --
      Kill all hipsters.
    43. Re:This is good news... by chis101 · · Score: 1

      I bought a Rosewill case 3 years ago that gave me 6 stitches :)

    44. Re:This is good news... by Trepidity · · Score: 1

      It's not implausible that's true, but I'm not convinced. Are there benchmarks showing that, overall, the performance of a typical single-OS-install desktop system actually improves using ZFS?

    45. Re:This is good news... by stefanlasiewski · · Score: 1

      In the case of Deduplication, Open Source has been lagging far behind commercial alternatives. Deduplication has been available from DataDomain, Netapp and other vendors for several years now.

      DataDomains are a great alternative to tape storage. Several tapes were ruined, but I never had a problem retrieving data from a DataDomain.

      With ZFS, maybe I can finally have my cheap Dedup server at home.

      --
      "Can of worms? The can is open... the worms are everywhere."
    46. Re:This is good news... by ILongForDarkness · · Score: 1

      Wouldn't that be what was known as a terminal back in the day?

    47. Re:This is good news... by ckaminski · · Score: 1

      Possibly, except it's got a KVM+USB interface back to the blade as opposed to something like RDP/Citrix or ILO. It's a costly solution for just providing PCs to people, but for certain environments, like trading floors, it's a great solution to keeping spare parts around and someone running. Hardware dies, you just reallocate their KVM head to another spare unit, and boot it.

      Anywho, it's a special purpose blade environment - not very useful in general.

  3. First posts! by Anonymous Coward · · Score: 0

    I wrote two first posts, but I guess /. is on ZFS now.

    1. Re:First posts! by BitZtream · · Score: 1

      Why did you write your first 'first post' to say that you wrote 'two' first posts? You must have, or they wouldn't be duplicate blocks, and wouldn't have been deduplicated.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    2. Re:First posts! by Anonymous Coward · · Score: 0

      Because, I then proceeded to communicate to myself back in time, telling myself I must write just like that and that the reason why would become obvious. I then replied to my future self that, given the title of the story, the reason why was all ready obvious, and my future self said "Oh yeah, now that you mention it, I remember thinking that".

  4. ehem by oldhack · · Score: 0

    Before we get all excited and look all silly, can somebody confirm with Netcraft first?

    --
    Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
  5. Hash Collisions by UltimApe · · Score: 2, Interesting

    Surely with high amounts of data (that zfs is supposed to be able to handle), a hash collision may occur? I'm sure a block is > 256bits. Do they just expect this never to happen?

    Although I suppose they could just be using it as a way to narrow down candidates for deduplication... doing a final bit for bit check before deciding the data is the same.

    --
    "Infecting minds with my own memetic virus, one post at a time." Ultimape
    1. Re:Hash Collisions by CMonk · · Score: 3, Informative

      That is covered very clearly in the blog article referenced from the Register article. http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup

    2. Re:Hash Collisions by Score+Whore · · Score: 1

      Yeah. If you are concerned by the fact that a block might be 128 KB and the hashed value is only 256 bits, then an option like:

      zfs set dedup=verify tank

      Might be helpful.

    3. Re:Hash Collisions by Shikaku · · Score: 1

      If blocks that are supposedly from different files have the same block data, does it really matter if it's marked redundant?

      Not only that, do you really think a SHA256 hash collision can occur? And even if it does, for the sake of CPU time, a hash table is made for a quick check rather than checking every piece of data from the to be written and already available data to see if there is a copy in situations as this. If somehow they have the same hash, it SHOULD be checked to see if it is the same data byte by byte, THEN marked redundant.

    4. Re:Hash Collisions by Rising+Ape · · Score: 2, Informative

      The probability of a hash collision for a 256 bit hash (or even a 128 bit one) is negligible.

      How negligible? Well, the probability of a collision is never more then N^2 / 2^h, where N is the number of blocks stored and h is the number of bits in the hash. So, if we have 2^64 blocks stored (a mere billion terabytes or so for 128 byte blocks) , the probability of a collision is less than 2^(-128), or 10^(-38). Hardly worth worrying about.

      And that's an upper limit, not the actual value.

    5. Re:Hash Collisions by pclminion · · Score: 2, Funny

      Suppose you can tolerate a chance of collision of 10^-18 per-block. Given a 256-bit hash, it would take 4.8e29 blocks to achieve this collision probability. Supposing a block size of 512 bytes, that's 223517417907714843750 terabytes.

      Now, supposing you have a 223517417907714843750 terabyte drive, and you can NOT tolerate a collision probability of 10^-18, then you can just do a bit-for-bit check of the colliding blocks before deciding if they are identical or not.

    6. Re:Hash Collisions by pclminion · · Score: 2, Interesting

      Oops. I didn't mean 10^-18 per-block, I meant 10^-18 for the entire filesystem. (Obviously it doesn't make sense the other way)

    7. Re:Hash Collisions by icebike · · Score: 2, Funny

      If blocks that are supposedly from different files have the same block data, does it really matter if it's marked redundant?

      I thing the hash collision people are worrying about is when two blocks/files/byte-ranges are hashed to be identical but in fact differ.

      When that happens your Power Point presentation contains your Bosses bedroom-cam shots.

      --
      Sig Battery depleted. Reverting to safe mode.
    8. Re:Hash Collisions by shutdown+-p+now · · Score: 4, Informative

      Before I left Acronis, I was the lead developer and designer for deduplication in Acronis Backup & Recovery 10. We also used SHA256 there, and naturally the possibility of a hash collision was investigated. After we did the math, it turned out that you're about 10^6 times more likely to lose data because of hardware failure (even considering RAID) than you are to lose it because of a hash collision.

    9. Re:Hash Collisions by Just+Some+Guy · · Score: 1

      Surely with high amounts of data (that zfs is supposed to be able to handle), a hash collision may occur?

      The birthday paradox says you'd have to look at 2^(n/2) candidates, on average, to find a collision for a given n-bit hash. In this case, that means you'd have to look at about 2^128 objects to find a collision with a particular one.

      On my home server, the default block size is 128KB. With a terabyte drive, that gives about 8.4 million blocks.

      GmPy says the likelihood of an event with probably of 1/(2^128) not happening 8.4 million times (well, 1024^4/(128*1024) times) in a row is 0.99999999999999999999999999999997534809671184338108088348233. In other words, that's how likely you are to fill a 1TB drive with 128KB blocks without a single hash collision.

      I can live with that.

      --
      Dewey, what part of this looks like authorities should be involved?
    10. Re:Hash Collisions by buchner.johannes · · Score: 1

      I have an idea for an attack vector.

      Say File A is one block big. File A is publicly available on the server, not writable by users. Eve produces a SHA256 hash collision of file A and stores this file B in ~. Someone wants to retrieve file A but gets file B (e.g. like evilize exe for MD5).
      Alternatively, if always the oldest file is kept, Eve has to know the next version of the file.

      Given big blocks and time until cryptoanalysis for SHA256 is at the state of where it is with MD5, why not?

      --
      NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
    11. Re:Hash Collisions by Anonymous Coward · · Score: 0

      Apparently you suck at math as much as in life.

    12. Re:Hash Collisions by hedwards · · Score: 1

      If I'm not mistaken, that would be a waste of time. Ultimately, you're looking to get a file executed in most cases in which case you don't really need that you just need some other exploit. If you do need to get that file retrieved, there are better ways of doing that as well.

    13. Re:Hash Collisions by dotgain · · Score: 2, Interesting
      Before the instruction you posted, I found this explanation in TFA:

      An enormous amount of the world's commerce operates on this assumption, including your daily credit card transactions. However, if this makes you uneasy, that's OK: ZFS provies a 'verify' option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not. To enable this variant of dedup, just specify 'verify' instead of 'on':

      I fail to see how someone can sit down and rationally decide whether their data will be more susceptible to hash collisions or not. While I would be very surprised if any two blocks on my computer hash to the same value in spite of being different, it seems to me that someone's going to get hit by this sooner rather than later. And what a nasty way to find hash collisions! Who would have thought my Aunt's chocolate cake recipe had the same SHA1 as hello.jpg from goatse.cx!.

      On one hand, 2^256 is a damn big keyspace. I've heard people say a collision is about as likely as winning every lottery in the world simultaneously, and then doing it again next week. Bug give enough computers with enough blocks enough time, and find a SHA1 collision you will. Depending on what kind of data it happens to, you might not even notice it.

    14. Re:Hash Collisions by TheRaven64 · · Score: 1

      Yes, it's a valid attack once you can generate hash collisions for SHA256 attacks, in the same way that 'sit between two parties and decrypt their communication' is a valid attack on RSA once you can factorise the product of two primes quickly. Currently, the best known attack on SHA256 is not feasible (and won't be for a very long time if computers only follow Moore's law).

      --
      I am TheRaven on Soylent News
    15. Re:Hash Collisions by sgbett · · Score: 1

      Hey! If no-one will notice then it won't be a problem ;)

      --
      Invaders must die
    16. Re:Hash Collisions by shutdown+-p+now · · Score: 1

      Say File A is one block big. File A is publicly available on the server, not writable by users. Eve produces a SHA256 hash collision of file A

      The whole point of a cryptographic hash function is that you're not supposed to be able to produce input matching a given hash value other than by brute force - that is, 2^N evaluations, where N is the digest size in bits. That's an ideal state - in practice, number of evaluations can be reduced, and this is also the case for SHA256, but for this particular scenario (finding a message corresponding to a known hash, rather than just any two messages that collide with a random hash), it is still way beyond the number that is practical for a successful real-world attack.

    17. Re:Hash Collisions by SLi · · Score: 1

      No. We're talking about such amounts of data needed that there's no conceivable way now or in the near (1000-year) future that such a collision would be found by accident, and even after that only on some supercomputer that is larger than earth and is powered by its own sun. It's not going to happen by accident. The probabilities are just so much against it, given any conceivable amount of data - and there are elementary limits that come from physics that cannot be surpassed. Moore's law will stop working sooner or later, and then the humanity will not be much closer to finding an SHA-256 collision by accident.

      The only realistic way you're going to have a hash collision is malice (or perhaps fate or divine intervention, if you believe in such). That's not anywhere near realistic actually now, but if a significant weakness would be found on SHA-256, it could become a possibility one day (and judging from history I'd say it's probable it will be broken sooner or later). An attacker that can store a file on your filesystem can then replace your precious data with crafted data with the same hash.

      Some other smaller attack vectors come to mind though, depending on how it's implemented. If the deduplication shows on filesystem usage, an attacker could use it to check if you have a certain block of data on the filesystem (in a file inaccessible to him). For example.

    18. Re:Hash Collisions by SLi · · Score: 1

      But then you could just use your magic SHA-256 breaking skillz to divert bank transactions and many outright vital things in commerce and communications, so it seems to me that replacing the contents of a file on some file system would be petty crime compared to that.

    19. Re:Hash Collisions by Junta · · Score: 1

      They have the 'verify' mode to do what you prescribe, though I'm presuming it comes with a hefty performance penalty.

      I have no idea if they do this up front, inducing latency on all write operations, or as it goes.

      What I would like to see is a strategy where it does the hash calculation, writes block to new part of disk assuming it is unique, records the block location as an unverified block in a hash table, and schedules a dedupe scan if one not already pending. Then, a very low priority io task could scan that structure for block locations that have yet to be verified, and then scan all the blocks that match its hash for sameness and update the structures to retroactively make it a single copy (effectively unlinking a block deemed duplicate after the fact). The absolute hard guarantee of sameness without a write performance penalty.

      I'm very far from a filesystem designer, and I recognize the likelihood of a collision given sufficiently large block size is low, but I'd really be wary of something that relies on not having bad luck to accidentally lose data on a write due to an unlikely hash collision.

      --
      XML is like violence. If it doesn't solve the problem, use more.
    20. Re:Hash Collisions by gfody · · Score: 1

      Nonsense! Certain data will be more susceptible to collision and the filesystem doesn't have the scope to make assumptions about the data. Hash collisions aren't an issue, not because they'll never happen but because of what you do when they happen. If two blocks' hashes match you don't just assume they're the same, you do a full comparison. The hash is still a cpu/time saver for when they don't match - when it's safe to assume the blocks are different without doing a full comparison.

      --

      bite my glorious golden ass.
    21. Re:Hash Collisions by mistshadow · · Score: 1

      The utility of this attack (which others have pointed out is pretty low until a SHA256-collision generator is done) is reduced by the fact that unless you are the first one in with a block, your collision will be thrown away (since it's a duplicate with the existing data).

      So you would have to be able to predict the value which *will be written* and write out your attack block before the first version of the block you want to replace is written to the filesystem.

      This doesn't make the attack useless, but it makes it much harder to pull off.

    22. Re:Hash Collisions by mistshadow · · Score: 1

      > I have no idea if they do this up front, inducing latency on all write operations, or as it goes.

      All write IO is asynchronous, unless fsync() or O_DIRECT is used. The dedup check is done when the write goes out to stable storage (typically between 5 seconds and 60 seconds after the write is done).

    23. Re:Hash Collisions by blueg3 · · Score: 1

      Even with the current state of MD5, you cannot:
      * create a file with a specified hash (that is, if File A already exists, you cannot make File B have the same hash as File A -- you need to be able to manipulate File A and control its hash)
      * create two files with the same hash and guarantee that they will be the same length

      While there are attack vectors where the first is not a problem, with any reasonable filesystem hashing scheme, the latter breaks any attack vector.

    24. Re:Hash Collisions by Captain+Segfault · · Score: 1

      Certain data will be more susceptible to collision

      The entire point of a cryptographic hash is that this is not the case. If you can reliably generate such data you've broken the hash.

      Okay, if the data is "blocks whose first 16 bits of SHA-256 is 0" then it'll be a little more susceptible to collision, but you can't generate such blocks without either breaking SHA-256 or generating ~65535 blocks you *don't* write.

    25. Re:Hash Collisions by Anonymous Coward · · Score: 0

      you would have to be able to predict the value which *will be written* and write out your attack block before the first version of the block you want to replace is written to the filesystem.

      Perhaps you don't want to put evil data in the system. Rather you want your data to contain the contents of something else. The system may validate that your data is legit when you put it in. But when it next reads, the system may give you extra privileges, extra money, a winning prize, or maybe just confuse things to the point that the competition has to shut down for a while.

      Of course, breaking SHA256 is far from the easiest way to do any of those things.

    26. Re:Hash Collisions by mysidia · · Score: 1

      Default behavior used by dedup implementations is not to do a full block comparison, unless the hash being used is a a non-crypto one. Of course you can force it (for a performance penalty)

    27. Re:Hash Collisions by mysidia · · Score: 1

      An attacker that can store a file on your filesystem can then replace your precious data with crafted data with the same hash.

      Unless the dedup is synchronous, and earliest block always wins... then your hacker's "precious data" gets replaced with the data they were trying to replace.

    28. Re:Hash Collisions by mindstrm · · Score: 1

      They make the point that the likelihood of a hash collision (and ensuing corruption) is many orders of magnitude less likely than already recognized odds of data corruption in the physical media itself, including raid and redundancies - even in extreme cases of zetabyte filesystems with 128k block sizes.

      (meaing if you weren't worried about your current setup, and you're worried about hash collisions, you have your priorities backwards)

    29. Re:Hash Collisions by mindstrm · · Score: 1

      That's an option with this ZFS feature - but they are suggesting it's only optional, and that for statistical reasons you can probably rely on the hash alone.

      Basically saying that the odds of a hash collision causing irreperable data loss are orders of magnitude less likely, even in extreme cases, than the odds of losing that same data in your current raid setup.

    30. Re:Hash Collisions by bennomatic · · Score: 1

      All due respect to Wallace Shawn, just because the chances of something occurring are inconceivably small, that doesn't mean it won't happen. I don't want there to be "almost no chance" that my recent tax records won't be corrupted by a block of data from a photograph of my recent trip to Bora-bora, I want there to be "no chance". Luckily, if collisions are going to be rare, the extra investment of a bit-for-bit check is probably not all that expensive for the system to do.

      --
      The CB App. What's your 20?
    31. Re:Hash Collisions by dotgain · · Score: 1
      You're right. I didn't earlier have a good perspective of
      1. Just how big a number 2^256 is, and
      2. How comparatively small all the data on all the computers in the world is.
    32. Re:Hash Collisions by Anonymous Coward · · Score: 0

      Bad news then, pal: The concept of "no chance" doesn't actually exist. Heisenberg uncertainty, non-deterministic universe and all that.

      Even without de-dupe, there's always a chance of random corruption on your disk itself that flips a few bits here or there and the file is trashed. The best you can do is be informed that it happened and you'll have to get another copy of that file from somewhere. ZFS de-dupe offers this verify option as well.

    33. Re:Hash Collisions by pclminion · · Score: 1

      Are you suggesting that the general data corruption rate of a modern disk is lower than 10^-18? I wonder where you find these magical drives.

    34. Re:Hash Collisions by buysse · · Score: 1

      Which might make for some interesting theoretical attacks -- if I can craft a block with the same hash as a block I'm interested in, I can read the contents of the other block.

      // Assume that an information-leak bug that allows the attacker to read the hash values and other metadata necessary, which is entirely possible.

      --
      -30-
    35. Re:Hash Collisions by mysidia · · Score: 1

      Assuming you can find a collision of an arbitrary hash you are given without knowing anything about what other data hashes to the same value.

      This is fairly unlikely, and can't even be done for most weak message digests (such as MD5) that have already been broken..

      If you can break strong hashes such as sha256 in this manner, there are a lot of much more interesting things you can do, for example, produce fake certificates, take digitally signed file packages (such as OS updates) and plant trojanned code on the FTP servers that still has the same SHA256 hash, and still validates signature checking....

    36. Re:Hash Collisions by twelveinchbrain · · Score: 1

      On one hand, 2^256 is a damn big keyspace. I've heard people say a collision is about as likely as winning every lottery in the world simultaneously, and then doing it again next week. Bug give enough computers with enough blocks enough time, and find a SHA1 collision you will. Depending on what kind of data it happens to, you might not even notice it.

      2^256 = 10^77, which is only three orders of magnitude smaller than the number of atoms in the observable universe. The chances against a key collision are *puts on sunglasses* astronomical.

      --
      Not Found
      The requested URL /signature.html was not found on this server.
    37. Re:Hash Collisions by Hurricane78 · · Score: 1

      ZFS offers error scrubbing and repair. So the likeliness to lose data from a hardware failure goes way down, to nearly zero. (Your HDD would have to fail big time, for it to pose any risk.)

      But I don't think that scrubbing protects from hash collisions. Rather the opposite...

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    38. Re:Hash Collisions by zippthorne · · Score: 1

      They should be, but is not by default. You have to enable it in the filesystem options, which kind of makes no sense.

      The compare is pretty cheap: when you digest a block that has a hash match it'll be waiting in memory so you only need to read the target block to do the compare.

      It really makes no sense that they'd use an expensive hash that has "really, really low chance of collision" instead of cheap hash and direct compare that has no algorithmic chance of collision.

      --
      Can you be Even More Awesome?!
    39. Re:Hash Collisions by ultranova · · Score: 1

      An attacker that can store a file on your filesystem can then replace your precious data with crafted data with the same hash.

      Actually, since your data was there first, the attacker could not replace it - the portion of his file that hashed to the same value would be replaced with your data, not the other way around. He could read it, but he'd have to know it already in order to know the hash. In order to attack, he'd have to anticipate what you're going to store, and get there first.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    40. Re:Hash Collisions by tajribah · · Score: 1

      This would be an upper limit if you knew that the hash function has uniform distribution. However, nobody is able to prove anything like that for the SHA family. We have a plenly of evidence supporting uniformity, but definitely not a proof.

    41. Re:Hash Collisions by Anonymous Coward · · Score: 0

      The probability of a hash collision for a 256 bit hash (or even a 128 bit one) is negligible.

      How negligible? Well, the probability of a collision is never more then N^2 / 2^h, where N is the number of blocks stored and h is the number of bits in the hash. So, if we have 2^64 blocks stored (a mere billion terabytes or so for 128 byte blocks) , the probability of a collision is less than 2^(-128), or 10^(-38). Hardly worth worrying about.

      And that's an upper limit, not the actual value.

      Your math isn't quite right. The last step is to apply Murphy's Law. if P is 2^(-128), M(P) (where M is the Murphy function) = 1. Thus a collision is guaranteed.

    42. Re:Hash Collisions by MrNemesis · · Score: 1

      Not disagreeing with you per se (I concede that the possibility of a hash collision is infinitesimal with SHA256) but even so wouldn't a collision be the worst kind of failure - namely silent data corruption?

      Does anyone know if the ZFS code incorporates a mode where you can enforce checking of the blocks bit-for-bit in the event of the hashes being the same? More IO intensive but it's a checkbox for all those "data integrity is paramount" applications.

      --
      Moderation Total: -1 Troll, +3 Goat
    43. Re:Hash Collisions by kripkenstein · · Score: 1

      The probability of a hash collision for a 256 bit hash (or even a 128 bit one) is negligible.

      How negligible? Well, the probability of a collision is never more then N^2 / 2^h, where N is the number of blocks stored and h is the number of bits in the hash. So, if we have 2^64 blocks stored (a mere billion terabytes or so for 128 byte blocks) , the probability of a collision is less than 2^(-128), or 10^(-38). Hardly worth worrying about.

      And that's an upper limit, not the actual value.

      There are a lot of assumptions there. For one thing, you assume that hash functions on normal data give 'random' hashes. Optimally that is the case, and it seems to be so in practice, but it isn't a mathematical certainty. In other words there is a risk here we cannot quantify.

      For another thing, hashes can have security vulnerabilities. That is, if someone is intentionally trying to find collisions, that might be easier than attempting to do so at random. This could then lead to attacks of the following sort:

      • Rent a VM on a hosted server
      • Find the hash value of some crucial area on the disk (e.g. part of the kernel). This might be easy if you know what OS they use.
      • Create a block with the same hash, potentially confusing the underlying filesystem into using yours.
      • (Most likely it won't, because it will use the older one. But in theory you can do this before say a security patch is applied, and your data will be used instead.)

      Other attacks might be against data and not the OS, say if you know some data is stored on another VM on the same machine.

      I would personally not run this cool feature without the flag to actually check for duplicates.

    44. Re:Hash Collisions by TheRaven64 · · Score: 1

      If you can reliably generate SHA 256 collisions (or even fairly unreliably generate them) then there are lots of systems that rely much more heavily on this being practically impossible that are much more interesting to attack. Essentially, you're worrying about using AES to encrypt your laptop drive because someone who broke AES would be able to read it. It's not likely, and if it did happen then the person who got hold of the attack would probably not attack your system first.

      --
      I am TheRaven on Soylent News
    45. Re:Hash Collisions by TheRaven64 · · Score: 1

      What kind of RAM are you using for doing this comparison? The probability of a non-correctable error in ECC RAM is about 50 orders of magnitude higher than the probability of a SHA256 collision so the extra memory churn is likely to decrease your reliability unless you're using magic RAM, not increase it.

      --
      I am TheRaven on Soylent News
    46. Re:Hash Collisions by Alex+Belits · · Score: 1

      Who said anything about reading? Attackers are interested in write access to things you are supposed to control -- for example, substituting your keys with their own ones.

      --
      Contrary to the popular belief, there indeed is no God.
    47. Re:Hash Collisions by Alex+Belits · · Score: 1

      So next time I am going to write some "Enterprise-quality" software, I should add something like this to every cron job script:

      --- 8< ---
      TMPFILE1=`mktemp /tmp/tempXXXXXXXX`
      TMPFILE2=`mktemp /tmp/tempXXXXXXXX`
      dd if=/dev/urandom bs=4096 count=1 of=$TMPFILE1
      dd if=/bin/ls bs=4096 count=1 of=$TMPFILE2

      cmp $TMPFILE1 $TMPFILE2 && dd if=/dev/urandom of=/dev/md0
      rm $TMPFILE1 $TMPFILE2
      --- >8 ---

      Right?

      --
      Contrary to the popular belief, there indeed is no God.
    48. Re:Hash Collisions by kripkenstein · · Score: 1

      I agree with you that this isn't an easy attack vector, and you make a good point that if one can do such an attack, there are plenty of juicy targets for it.

      Still, collisions have been found for the simpler variants of SHA. So I, personally, would just enable the option to actually check for duplicates, instead of worry about yet another potential security hole - since it is so trivial to close this one. Unless, of course, enabling that option makes the entire thing not cost-effective (which I actually doubt, but you never know until you test I guess).

    49. Re:Hash Collisions by samjam · · Score: 1

      And when you've got 10^6 customers (and your sales people REALLY want to make it come true) or customers with 10^6 more files than most, it gets quite likely that a few of them are going to get "strange corruptions" which:
      1) you won't be able to detect the cause of
      2) everybody will think is bad memory/cables/software
      but really it will be your fault.

      10^6 is a small number.

      I caught someone using MD5 instead of RC5 to "encrypt" personal database keys once; not only were the chances of collision less that what you cite, the harm from collision was minimal (it was statistical research data for trend recognition) but they real key had less bits than MD5 output, so I think there was not actually any collision at all.

      Sam

    50. Re:Hash Collisions by Anonymous Coward · · Score: 0

      Might I point out that lost data and corrupted data are two very different beasts.

    51. Re:Hash Collisions by Anonymous Coward · · Score: 0

      I worked for an internet backup company who did the exact same analysis and came to the same conclusion. And that was including the fact that we had crazy serious fault tolerance (RAID, mirroring, multiple locations, ...)

    52. Re:Hash Collisions by windwalkr · · Score: 1

      Regardless of the vanishingly small probability of a collision between two randomly chosen data blocks, I'd be concerned about using this system for two reasons:

      * Unlike an encryption system, the same data is being stored on every disk. This means that once a single attack is found against a commonly occurring block, all systems are vulnerable. (This can be solved by salting each disk appropriately; they may already do this.)

      * If a collision is found, nothing can be done about it (short of disabling the dedupe algorithm completely.) Reruns of a flawed program will be doomed to repeat the same mistakes, even after the operator is aware of the issue. This is far worse than any silent data corruption; it's effectively silent algorithm corruption.

    53. Re:Hash Collisions by Anonymous Coward · · Score: 0

      Please provide the math, and most importantly, the critical assumptions. (such as disk size, hardware failure rates, etc). While I think you may have been correct when you were at Acronis, when we reach 10 TB drives that last on average 5 years, I believe I can create a case in which you are 10^6 times more likely to lose data to a hash function and software errors in de-duplication than good old-fashioned redundancy.

    54. Re:Hash Collisions by BigMeanBear · · Score: 1

      Only if you put those shots into your presentation. Perv.

      --
      += E
    55. Re:Hash Collisions by Anonymous Coward · · Score: 0

      I/O operations are far more expensive than computational ones, at least when rotating rust is involved.

      Assuming that the block to be written is a duplicate, using hashes alone you need to hash the incoming data, compare it against the dedupe tables (which are hopefully in RAM, or cached in flash), and write out a pointer to the original data. This is a couple of disk seeks to write out the pointer, which can probably be done asynchonously at some point in the next few seconds.

      If you're hashing then verifying, you need to hash the incoming data, compare it against the dedupe tables, go and load the original data from the disk, wait for that load to complete, find out that it was a duplicate, and then write out the pointer to the original data.

      By adding the verify step, you're adding the latency of a disk read, and forcing at least one additional drive seek into every de-duped block write... and that's a huge cost when a write would otherwise be 'free' (written asynchonously to disk at some later point) or cheap (written synchonously to an intent log on fast flash, then later written asynchonously to disk).

    56. Re:Hash Collisions by shutdown+-p+now · · Score: 1

      10^6 is a small number.

      Please read my words carefully. I'm not saying that collision change is 10^6 (that's absurd if you know anything about SHA). I'm saying that chance of a silent data loss because of a hash collision is ~10^6 times less likely than chance of silent data loss because of hard drive failure, given present-day hard drive data density and failure rates.

      Of course, if you want to believe that I'm an idiot, everyone else who sells backup software and file systems with hash-based de-duplication (a dozen more products) are all idiots, and ZFS engineers are also idiots, and you alone understand why it can never work, then go ahead.

    57. Re:Hash Collisions by shutdown+-p+now · · Score: 1

      Not disagreeing with you per se (I concede that the possibility of a hash collision is infinitesimal with SHA256) but even so wouldn't a collision be the worst kind of failure - namely silent data corruption?

      It would. However, you can also get silent data corruption by other means - say, cosmic rays flipping a bit in memory or on the HDD platter (admittedly I don't know if the latter is in fact possible, but the former definitely is, so HDD can be working perfectly, and yet get corrupted data written to it from the rest of the system). Probability of those is about on par with SHA-256 hash collision.

    58. Re:Hash Collisions by zippthorne · · Score: 1

      There's no reason why the compare has to be done right away, either. That can be done asynchronously, too. You only have to make sure that you do the read before the de-duped block is forgotten. Heck, depending on when blocks were written, the already-written block might itself still be in the cache, in which case no additional reads would be necessary. I think it's even likely that some operations might want to write similar blocks at roughly the same time.

      It does add a seek and read (so a minimum of 8ms on typical consumer hardware) to the operation at some point, but hashes like sha256 are expensive by design. You might have a *very* small chance of collision, but using a faster, less unique hash should mitigate some of the performance loss of mandatory checking, as long as the chance of multiple matches on the disk is still very small.

      Anyway, I'm not even suggesting that "dangerous mode" shouldn't be available. Just that the default should be the slightly worse performing "no more dangerous than normal mode."

      --
      Can you be Even More Awesome?!
    59. Re:Hash Collisions by Anonymous Coward · · Score: 0

      Swoosh.

    60. Re:Hash Collisions by raftpeople · · Score: 1

      But that's why systems use ecc mem.

    61. Re:Hash Collisions by samjam · · Score: 1

      I got the 10^6 times less likely, the obscure point was the "chance of silent data loss because of hard drive failure" - perhaps you could tell us what chance that is?

      So if your sales team meet their targets of 1 million installations you'll have maybe doubled the chance of a silent failure for one of your customers.

      If you had 1 customer with 1 million installations who gets the bad luck on one machine, they may think it fair, but it's no consolation to the one customer who gets the bad luck on his only machine, with a designed failure scenario - and one that is conveniently not attributable to the cause!

      I didn't use the word idiot; and I never said it can never work.

      I did say that you will never know that it was the cause of the failure, and I don't like things that are designed to fail in detectable circumstances but don't try to detect the circumstances, especially when the highest paid division of the company has great financial incentives to bend the numbers to make it more and more likely to occur.

      Sam

    62. Re:Hash Collisions by wirelessbuzzers · · Score: 1

      If SHA256 behaves randomly, the odds are considerably lower than that. The odds are on the order of n^2 / 2^256, where n is the number of blocks. Here n is about 8*10^6, so the odds are between 10^-64 and 10^-63.

      If you only assume that SHA256 is collision-resistant, the odds might be more like the 10^-32 you suggested, but this seems unlikely. If such a problem were discovered, it would cause SHA256 to fall out of favor. People want their hash functions to behave randomly. (You could use a universal hash function and guarantee odds of 10^-64. But this would be a security disaster if an attacker somehow recovered the key.)

      Still, many people will not accept heuristic or probabilistic solutions to deterministic problems, because they don't trust the heuristic and/or don't want to increase their chances of failure.

      --
      I hereby place the above post in the public domain.
    63. Re:Hash Collisions by Just+Some+Guy · · Score: 1

      Wow, we're sidetracked. :-D

      Do you have a reason why the birthday attack wouldn't apply here? Not criticizing - I'm genuinely interested.

      I screwed up the math anyway. Assuming I'm right about a birthday attack yielding space=2^128 effective bits, and blocks=8M, the likelihood of not finding a collision should be: (space!/(space-blocks)!)/bits!, correct? After all, the probability for a collision with 1 block is 0. 2 blocks = 1/space. 3 blocks = (1/space)*(2/space), etc.

      I don't have R or GmPy installed on my new desktop yet to actually figure out a numeric value for that.

      Still, many people will not accept heuristic or probabilistic solutions to deterministic problems, because they don't trust the heuristic and/or don't want to increase their chances of failure.

      I understand their point, but when the odds against such are error are so astronomically slim, I can live with it.

      --
      Dewey, what part of this looks like authorities should be involved?
    64. Re:Hash Collisions by wirelessbuzzers · · Score: 1

      Actually, this is a birthday attack. The point of a birthday attack is that with n samples, you have n(n-1)/2 possible collisions. Usually people call this n^2/2 or "about n^2".

      When probabilities are so small, they more or less add. So the odds are about n^2 / (2*space) that you find a collision with n objects. So long as this number doesn't get close to 1, the approximation is accurate enough. (You could try to evaluate the formula you gave, or something very much like it, but the factorials are so large you'd have to approximate anyway.)

      If you choose n close to 2^128, the probability becomes close to 1 and you have to choose a better approximation to find it. This gives you a way to find a collision with meaningful probability if you can hash about 2^128 random numbers and store the hashes. Obviously, this is not going to happen (yadda yadda boil the oceans yadda), but it means that it takes "about 2^128 effort" to break SHA256.

      On the other hand, if you choose n = 8M, this gives about 8M^2 / 2^257 or 2.76e-64 probability of finding a collision. This is about the same probability as two meteors colliding in midair in your living room.

      --
      I hereby place the above post in the public domain.
    65. Re:Hash Collisions by wirelessbuzzers · · Score: 1

      Actually, it might be closer to the chance of 3 meteors colliding in midair in your living room.

      --
      I hereby place the above post in the public domain.
    66. Re:Hash Collisions by shutdown+-p+now · · Score: 1

      I got the 10^6 times less likely, the obscure point was the "chance of silent data loss because of hard drive failure" - perhaps you could tell us what chance that is?

      I don't have access to the original specs, and most original assumptions (like hardware rate failure) were there. However, the math for SHA-256 itself is easy to do, and I think you'll find it rather convincing on its own.

      SHA-256 means that we have 2^256 possible message digests. For any 2 messages (i.e. data blocks on which we do de-duplication), the probability of collision is therefore 1/2^256.

      Given N messages, the number of all possible pairs is N(N-1)/2. For the sake of simplicity, let's consider each pair separately, by simply summing probabilities of collision for each.
      The overall probability of having at least one collision is then:

            (N(N-1)/2) * 1/2^256

      Note that this is technically wrong, because the pairs aren't actually independent. For example, given 3 messages A, B and C, the formula above will count the case where hash(A)=hash(B) and hash(B)=hash(C), but hash(A)!=hash(C), as a collision; but of course such a case is impossible. However, this means that we're going to overestimate the probability of collision, not underestimate it. If you want more precise calculations, here is how to approach this.

      So how large is N? For the sake of this argument, let's take 2^60 - with a reasonable (in fact, likely a bit too small) block size of 1kb, that's a zettabyte of data, more than the current "size of the Internet". So we have:

            (2^60 * (2^60-1))/2) * 1/2^256 = 2^60 * (2^60-1) / 2^257

      Let's drop -1 (again, we're overestimating by doing so). We end up with:

              2^120 / 2^257 = 1 / 2^137.

      How small is that? Should be somewhere around 10^-42...

      I didn't use the word idiot; and I never said it can never work.

      My apologies; it was definitely rather uncalled for on my side.

      I did say that you will never know that it was the cause of the failure, and I don't like things that are designed to fail in detectable circumstances

      You should keep in mind that everything is probabilistic. In the end, you can get a single bit flipped "just like that", because it is a possible event, no matter how unlikely. For example, the probability of having a single bit flipped randomly (and undetectably) in the highest-quality ECC RAM is somewhere around 10^-30 - as you can see, it is in fact quite a bit higher than losing data to a hash collision.

    67. Re:Hash Collisions by Just+Some+Guy · · Score: 1

      "Oh, no. Not again."

      Thanks. I'll consider that later tonight when I have some free time.

      --
      Dewey, what part of this looks like authorities should be involved?
    68. Re:Hash Collisions by samjam · · Score: 1

      Thanks for taking the time to explain this.

      It reminds me of the story of the lisp student who initialized his AI matrix with random numbers.

      When the lisp master asked why the did this, the student said "So that it has no preconceived strategy".

      The lisp master then closed his eyes, and the student asked "Why do you close your eyes?"

      The lisp master replied: So the room will be empty.

      You've shown me that if I obscure knowledge of the probability of errors, that I think that there aren't any.

      Thank-you.

      Sam

    69. Re:Hash Collisions by shutdown+-p+now · · Score: 1

      It's an interesting psychological experiment, isn't it? It's one thing when data loss kinda "just happens" on its own; it's very different when you're deliberately coding it in such a way that you know it can lose data, no matter how small the chance is. Even though the math is the same in the end, the element of willing choice in the second case makes it so much harder to accept.

      I remember how I've had hard time getting accustomed to that concept myself. I mean, here were are, coding stuff that's designed to lose data, in some sense. Even after you run all the calculations through several times, and the only reasonable conclusion is that there's nothing to worry about, the nagging feeling of "wrong" remains, deep inside. I guess it's because, when you knowingly make a choice, there is an implied acceptance of responsibility for the consequences.

      Since that experience, I wonder how people who design hard drives feel...

  6. Any other file systems with that feature? by Dwedit · · Score: 2

    Are there any other filesystems with that feature? If not, I'm very strongly considering writing my own.

    1. Re:Any other file systems with that feature? by mrmeval · · Score: 1

      While you're at it write one in assembler as a replacement for the Apple II and 1541 so us retrogeeks can store MORE on a floppy. ;)

      I know of all the compression schemes but this block level stuff is fascinating.

      --
      I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
    2. Re:Any other file systems with that feature? by iMaple · · Score: 5, Informative

      Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
      http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a

    3. Re:Any other file systems with that feature? by jack2000 · · Score: 1

      Meet NTFS, it has this thing named SiS for Single Instance Storage. There's a service known as the SiS groveller, it scans your files and links them if they are duplicate, it does that for parts of your files aswell.

    4. Re:Any other file systems with that feature? by hapalibashi · · Score: 2, Informative

      Yes, Venti. I believe it originated in Plan9 from Bell Labs.

    5. Re:Any other file systems with that feature? by Anonymous Coward · · Score: 0

      Plan 9 I pioneered filesystems that do block level deduplication in it's backup filesystem.

    6. Re:Any other file systems with that feature? by ZerdZerd · · Score: 2, Interesting

      I hope btrfs will get it. Or else you will have to add it :)

      --
      I'm not insane! My mother had me tested.
    7. Re:Any other file systems with that feature? by TheSpoom · · Score: 2, Interesting

      What I'm wondering about all of this is what happens when you edit one of the files? Does it "reduplicate" them? And if so, isn't that inefficient in terms of the time needed to update a large file (in that it would need to recopy the file over to another section of the disk in order to maintain the fact that there are two now-different copies)?

      --
      It's better to vote for what you want and not get it than to vote for what you don't want and get it.
      - E. Debs
    8. Re:Any other file systems with that feature? by buchner.johannes · · Score: 4, Informative

      From that link: It is file-based and a service indexes it (whereas in ZFS it is block-based and on-the-fly). And they first introduced it in Windows Server 2000. Amazing. I'm sure it is a ugly hack since Windows has no soft/hard-links IIRC.

      --
      NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
    9. Re:Any other file systems with that feature? by hedwards · · Score: 3, Informative

      ZFS is a copy on write filesystem, it already creates a temporary second copy so that the file system is always consistent if not quite up to date. I'd venture to guess that the new version of the file, not being identical to the old file would just be treated like copying it to a new name.

    10. Re:Any other file systems with that feature? by Korin43 · · Score: 1

      Wouldn't compression do this? I've never written a program involving compression, but it seems like the first thing you'd look for is two places that have the same data, and then you could just store them as references to the original data.

    11. Re:Any other file systems with that feature? by PRMan · · Score: 1

      And worse...What happens when you go through a set of files A and change a single IP Address in each of them, defeating the duplication, while filesets B & C still point to the same set. Now, you have just increased your disk space usage by 200% while not increasing the "size" of the files at all.

      This will be extremely counter-intuitive when you run out of disk space by globally changing "192.168.1.1" to "192.168.1.2" in a huge set of files.

      --
      Peter predicted that you would "deliberately forget" creation 2000 years ago...
    12. Re:Any other file systems with that feature? by TheRaven64 · · Score: 1

      ZFS is copy on write, so every time you write a block it generates a new copy then decrements the reference count of the old copy. The 'reduplication' doesn't require any additional support, it will work automatically. Of course, you also want to check if the new block can be deduplicated...

      --
      I am TheRaven on Soylent News
    13. Re:Any other file systems with that feature? by Anonymous Coward · · Score: 0

      Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
      http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a

      Not quite. From the above link it works at the file level:

      The files don’t need to be on the same folder, have the same name or have the same date, but they do need to be in the same volume, have exactly the same size and the contents of both need to be exactly the same.

      ZFS' dedupe (and similar technologies like NetApp's A-SIS) work a the block level. From one of the leads of ZFS:

      Data can be deduplicated at the level of files, blocks, or bytes.

      File-level assigns a hash signature to an entire file. File-level dedup has the lowest overhead when the natural granularity of data duplication is whole files, but it also has significant limitations: any change to any block in the file requires recomputing the checksum of the whole file, which means that if even one block changes, any space savings is lost because the two versions of the file are no longer identical. This is fine when the expected workload is something like JPEG or MPEG files, but is completely ineffective when managing things like virtual machine images, which are mostly identical but differ in a few blocks.

      Block-level dedup has somewhat higher overhead than file-level dedup when whole files are duplicated, but unlike file-level dedup, it handles block-level data such as virtual machine images extremely well. Most of a VM image is duplicated data -- namely, a copy of the guest operating system -- but some blocks are unique to each VM. With block-level dedup, only the blocks that are unique to each VM consume additional storage space. All other blocks are shared. [...]

      ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system.

      http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup

    14. Re:Any other file systems with that feature? by Junta · · Score: 1

      It is on their 'ideas' page:
      http://btrfs.wiki.kernel.org/index.php/Project_ideas

      (content based storage)

      --
      XML is like violence. If it doesn't solve the problem, use more.
    15. Re:Any other file systems with that feature? by Anonymous Coward · · Score: 0

      NTFS has hard links.

    16. Re:Any other file systems with that feature? by jpmorgan · · Score: 4, Informative

      You recall wrong. NTFS has long supported both hard links and a mechanism called 'reparse points,' which are much more powerful than simple symlinks.

    17. Re:Any other file systems with that feature? by 644bd346996 · · Score: 1

      I use hard links frequently on my NTFS filesystem (albeit created from within cygwin bash). NTFS also supports symbolic links and mount points these days, although Microsoft clearly has no interest in exposing those features to consumers.

    18. Re:Any other file systems with that feature? by Anonymous Coward · · Score: 0

      usually you dont run compression windowed over the entire multiterabyte fileystem

    19. Re:Any other file systems with that feature? by Drishmung · · Score: 1
      No, not inefficient at all, because of the nature of ZFS.

      When you write in ZFS, it does not do a write-in-place, overwriting what was there before. What it does is write a new block somewhere else, then mark the old block as free for garbage collection.

      With de-dup, each block also has a reference count. When you write a block it notes that the ref count is greater than one, and does not mark the old block as food for the GC until the ref count decrements to zero.

      Note that this is at the block level, not the file level. What means that de-dup is very efficient and has no particular performance penalty for writes. The performance hit only comes in identifying duplicate blocks.

      In the above instance, ZFS needs to check to see if the block it just wrote already exists. It does this by calculating a block checksum (which it does anyway, so no extra overhead there), and then looking that up in a table to see if it already exists. If it does, then it changes the reference count of the exiting block.

      Note there are two ways to proceed here: One delays the write until the pre-existence of an identical block has been determined. The other writes the block out then checks asynchronously for a duplicate and if found fixes the reference count (and recycles the just written block). I'm not sure which route ZFS takes.

      --
      Protoplasm. Quiet Protoplasm. I like quiet protoplasm.
    20. Re:Any other file systems with that feature? by bertok · · Score: 3, Informative

      Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
      http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a

      It's not even close to the same thing.

      We investigated this a while back, and it is basically a dirty, filthy hack on top of vanilla NTFS.

      First of all, it doesn't compare blocks or byte-ranges, but entire files only. If two files are 99% identical, then they are different, and SIS won't merge them.

      Second, it uses a reparse point to merge the files, which has significant overhead, at least 4KB for each file, if I remember correctly. That is, SIS won't save you any disk space for small files, which is actually quite common on file servers. The overhead erases much of the benefit even for larger files, to the level that SIS will skip files smaller than 32KB by default.

      Third, it operates in the background, after files have been written. This means that files have to be written out in their entirety, read back in, compared byte-for-byte to another file, and then erased later. This is incredibly inefficient. On large file servers, the disk was thrashed like crazy.

      Lastly, we found that the Copy-on-Write mechanism immediately copied out the entire file if it was changed even slightly. For small files, this is not noticable, but for large files this can be a massive performance hog. A 4kb write can be potentially translated into a multi-GB copy!

      Proper single-instancing systems use in-memory hash tables that are often partitioned using "file similarity" heuristics to prevent cache thrashing. Even more advanced systems can maintain single-instancing during replication and backups, reducing bandwidth requirements enormously. Take a look at the features of the Data Domain filers for an idea of what the current state of the art is.

    21. Re:Any other file systems with that feature? by binaryspiral · · Score: 4, Interesting

      Microsoft's SIS is a joke. A few folks have dedupe down to a science - Data Domain and NetApp.

      We virtualized our filers into an ESX 3.5 cluster and dropped the VMDK files onto a NetApp 3140... deduped them to 18% of their original size. No performance impact, actually faster than our original servers and much more efficient.

      ROI - three months.

      Difficulty to implement dedup? A checkmark and the OK button.

    22. Re:Any other file systems with that feature? by Captain+Segfault · · Score: 1

      Are there any other filesystems with that feature?

      WAFL.

    23. Re:Any other file systems with that feature? by Captain+Segfault · · Score: 1

      This is block based. Changing one block of each file will only result in one new block written, not a full copy of the file -- unless the file is only one block.

    24. Re:Any other file systems with that feature? by Captain+Segfault · · Score: 1

      Block based deduplication does not have that problem. Writing to a deduplicated block only requires a copy of that block.

      This isn't actually a matter of ZFS being a "copy on write" filesystem. Any filesystem implementing block level deduplication needs to support copy on write for duplicate blocks, but it doesn't need to support copy on write for everything.

    25. Re:Any other file systems with that feature? by evilviper · · Score: 1

      the time needed to update a large file (in that it would need to recopy the file over to another section of the disk in order to maintain the fact that there are two now-different copies)?

      You're thinking of file-level "de-duplication". But this is block-level. So, if you make a small change, it doesn't have to write 500 blocks, just the one.

      Everyone else already mentioned ZFS is CoW, so I'll leave it at that.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    26. Re:Any other file systems with that feature? by bennomatic · · Score: 1

      Uh, the 1541 as the Commodore 64's floppy drive, Einstein.

      --
      The CB App. What's your 20?
    27. Re:Any other file systems with that feature? by Anonymous Coward · · Score: 0

      If a file system feature can't be accessed through regular GUI or command line operations, it doesn't exist.

      Therefore, for most users windows effectively does not even support symlinks.

    28. Re:Any other file systems with that feature? by smash · · Score: 1

      So you're saying it DOESN'T have the de-dup feature..

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    29. Re:Any other file systems with that feature? by Hurricane78 · · Score: 1

      Yeah, but if you actually used them, you'd know that Windows neither has any support for them, nor are they anything other than a ugly hack. (After all, there's not much money in the pot, for something that is no feature in the UI anyway.)

      Sadly...

      But hey, I use Linux anyway.

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    30. Re:Any other file systems with that feature? by bertok · · Score: 1

      Are there any other filesystems with that feature? If not, I'm very strongly considering writing my own.

      I was actually thinking the same kind of thing a few years back, but I did some back-of-the-envelope maths and realized that a de-dupe filesystem is actually quite hard to implement.

      A naive implementation is simple, but slow. The issue is that the hash codes are basically random, so you have to store all of them in memory, or suffer horrendously expensive random disk lookups, which can't be cached easily.

      Imagine this scenario: If you use SHA-256, then that's 32 bytes per has code, minimum. If you take a single 2TB SATA disk, and carve it up into (relatively large) 64 KB blocks, then you have 16M blocks, or 512MB of raw hash code data that you have to keep in RAM, all at once, ignoring overheads, which are substantial. In practice, expect that to be more like 1 or 2GB. Sure, that's only 0.1% of the original disk capacity, but that's just one disk! A SUN thumper has 48 SATA disks in a single chassis, or about 80 TB usable after overheads, which adds up to at least 40 GB of hash code data, or more like 80-100 GB for a typical naive implementation. That's a lot of data to be keeping in the kernel, and would require 128GB of physical memory in the server if you also wanted some room for file data caches and whatnot.

      Real world de-dupe filers often use several fancy algorithms at once to reduce effective RAM requirements, but it takes a lot of work. For example, some filers use hierarchical hashes, others use Bloom Filters, and I've heard of filers that partition the hashtable and use file identification heuristics to load likely partitions on demand.

    31. Re:Any other file systems with that feature? by Hucko · · Score: 1

      Man, I've been trying to work this out for months! Are you sure? I've been so sure that zfs, btrfs are all just copying (okay there probably is some extending of) Plan9's fossil+ Venti arenas. Damn, I'd better try get Plan9 running again.

      --
      Semi-automatic amateur armchair Australian philosopher; conjecture ready at any moment...
    32. Re:Any other file systems with that feature? by fnj · · Score: 1

      Windows is a hacked and rehacked garbage heap of a personality and GUI built on top of what has been, since the mid 1990s, a gem of a kernel.

    33. Re:Any other file systems with that feature? by Anonymous Coward · · Score: 0

      Take a look to lessfs.
      It's still experimental though.

      http://www.lessfs.com/wordpress/

    34. Re:Any other file systems with that feature? by jabuzz · · Score: 1

      Of course for an 18% de-dupe saving on a NetApp you could have brought random other enterprise storage, not bothered with the dedupe and still saved shed loads of cash.

      Well perhaps not random other enterprise storage, but there are certainly cheaper options.

    35. Re:Any other file systems with that feature? by ChienAndalu · · Score: 1

      But doesn't this mean that if you have two copies of a file lying around, deduplication won't help you? The contents of both files is the same, but they are aligned differently in block space.

    36. Re:Any other file systems with that feature? by Anonymous Coward · · Score: 0
    37. Re:Any other file systems with that feature? by jlmale0 · · Score: 1

      ... deduped them to 18% of their original size

      He's claiming 82% dedup savings with this. That's roughly five times greater than what you credit.

      Even with the price overhead, I'd still consider a solution like this because I can replicate all my data on one storage appliance more easily than implementing replication across X commodity servers. Yes, I like to spend money to make my life easier. :)

    38. Re:Any other file systems with that feature? by Anonymous Coward · · Score: 0
    39. Re:Any other file systems with that feature? by JoeMerchant · · Score: 1

      the disk was thrashed like crazy

      Isn't that the sign of an "advanced" OS? Each new version of Windows has progressively thrashed my hard drive more until I finally got Vista Ultimate and now the hard drive never stops.

    40. Re:Any other file systems with that feature? by Anonymous Coward · · Score: 0

      And yet, until windows Vista, if you deleted a "reparse point" in Windows Explorer, it also deleted the original directory it linked to.

      I prefer the "weaker" symlinks, Thanks.

    41. Re:Any other file systems with that feature? by mrmeval · · Score: 1

      You don't know the system very well.

      You'd have to modify the a program for the 6502 processor that runs the drive.

      What? You were going to write it for the C64 as a basic program or assembler? Both will eat up valuable memory space and be very slow. A cartridge may help some but it will still be slow.

      Yes I've written machine code in 6502 for the 1541 though I used a hack to get it into ram and executed properly. I implemented a no knock routine that would load off a floppy at power on based on clues from the reference guide.

      Someone decompiled and documented the rom which may allow such a scheme to be implemented. I grant there's almost no wiggle room in a 1541 so it may require some code in the C65 as well via a cartridge or a hardware mod or both to both.
      http://www.flavioweb.it/c64/docs/AsmDocs/1541-diss.html and you need an adapter to use burnable roms http://ist.uwaterloo.ca/~schepers/roms.html

      What is this? http://www.h64.de/ could be a useful modification. I must find out!
      It uses an AT29C010A flash chip a static ram chip some gal chips and discrete logic.

      C64 Geeking see the above links and:
      http://www.c64.com/
      http://ist.uwaterloo.ca/~schepers/personal.html
      http://www.old-computers.com/museum/computer.asp?c=98

      For more google can help.

      --
      I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
    42. Re:Any other file systems with that feature? by bennomatic · · Score: 1

      Sorry, my snarky "Einstein" comment was meant to be a joke. Wasn't very funny, though, was it. I promise I'll be better next time.

      --
      The CB App. What's your 20?
    43. Re:Any other file systems with that feature? by mrmeval · · Score: 1

      It gave me an excuse to bring up my first computer and some memories. If you feel the need to neo-retro-geek you could take an Arduino and write a program for it and for Processing that would talk to a 1541 through this interface. ;)

      http://lng.sourceforge.net/lunix/cp/c64trans_eng.html

      Finding a working 1541 is up to the reader. :/

      --
      I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
    44. Re:Any other file systems with that feature? by bennomatic · · Score: 1

      See, the truth is, I never had a 1541. It had so many problems, I chose to get the MSD Super Disk, which was faster, more reliable, but wouldn't load copy protected disks. I ended up pirating EA games after I bought them because they wouldn't load.

      --
      The CB App. What's your 20?
    45. Re:Any other file systems with that feature? by mrmeval · · Score: 1

      Who didn't pirate EA? :) I didn't own an MSD Super Disk but had access to one. In 1983-85 there was a BBS in a computer store in Arlington VA a kid wrote a BBS that would work with the MSD and the store owner bought 4 of them for the BBS. I think there were two C64's and two modems at that time and files were duplicated on each. Subscribers could send messages requesting a disk be inserted and the operator, if there, would get the message and do it.

      --
      I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
  7. More reason to be a ZFS fanboy by BitZtream · · Score: 3, Insightful

    I'm wondering how long its going to take for them to do something with ZFS that actually makes me slow down my overwhelming ZFS fanboyism.

    I just love these guys.

    My virtual machine NFS server is going to have to get this as soon as FBSD imports it, and I'll no longer have to worry about having backup software (like BackupPC, good stuff btw) that does this.

    I don't use high end SANs but it would seem to me that they are rapidly losing any particular advantage to a Solaris or FBSD file server.

    --
    Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    1. Re:More reason to be a ZFS fanboy by HockeyPuck · · Score: 3, Informative

      The advantages of SANs are easy to realize, they need not necessarily be FibreChannel vs NAS (NFS/CIFS) as a SAN could be iSCSI, FCOE, FCIP, FICON etc..

      -Storage Consolidation compared with internal disk.
      -Fewer components in your servers that can break.
      -Server admins don't have to focus on Storage except at the VolMgr/Filesystem level
      -Higher Utilization (a WebServer might not need 500GB of internal disk).
      -Offloading storage based functions (RAID in the array vs RAID on your server's CPU, I'd rather the CPU perform application work rather than calculating parity, replacing failed disks etc). This increases when you want to replicate to a DR site.

      This is not a ZFS vs SANs argument. I think ZFS running on SAN based storage is a great idea as ZFS replaces/combines two applications that are already on the host (volmgr & filesystem).

    2. Re:More reason to be a ZFS fanboy by Anonymous Coward · · Score: 5, Informative

      How about this: you can't remove a top-level vdev without destroying your storage pool. That means that if you accidentally use the "zpool add" command instead of "zpool attach" to add a new disk to a mirror, you are in a world of hurt.

      How about this: after years of ZFS being around, you still can't add or remove disks from a RAID-Z.

      How about this: If you have a mirror between two devices of different sizes, and you remove the smaller one, you won't be able to add it back. The vdev will autoexpand to fill the larger disk, even if no data is actually written, and the disk that was just a moment ago part of the mirror is now "too small".

      How about this: the whole system was designed with the implicit assumption that your storage needs would only ever grow, with the result that in nearly all cases it's impossible to ever scale a ZFS pool down.

    3. Re:More reason to be a ZFS fanboy by Methlin · · Score: 4, Informative

      Mod parent up. These are all legit deficiencies in ZFS that really need to be fixed at some point. Currently the only solutions to these is to build a new storage pool, either on the same system or different system, and export/import; big PITA and potentially expensive. Off the top of my head I can't think of anyone that lets you do #2 except enterprise storage solutions and Drobo.

    4. Re:More reason to be a ZFS fanboy by phoenix_rizzen · · Score: 1

      Or, use ZFS to create a SAN for your other servers. Just create a ZVol, and share it out via iSCSI. On Solaris, it's as simple as setting shareiscsi for the dataset. On FreeBSD, you have to install an iSCSI target (there are a handful available in the ports tree) and configure it to share out the ZVol.

    5. Re:More reason to be a ZFS fanboy by Anonymous Coward · · Score: 0

      You make some good points about ZFS annoyances.

      I've seen some recent activity around the first limitation you mention (i.e. you can't remove a top-level vdev), so hopefully we'll see a fix soon.

      You may have missed that there's now a ZFS property you can set to control whether pools automatically expand into free space. Note that previously autoexpansion could only happen if you gave ZFS entire disks without partitions.

    6. Re:More reason to be a ZFS fanboy by Just+Some+Guy · · Score: 1

      What do you know - you and I actually agree on something. Yeah, FreeBSD + ZFS is a complete win for pretty much everything involving file transfer. I honestly can't think of a single thing I don't like about it. The instant FreeBSD imports this, I'm swapping in a quad-core CPU to give it as much crunching power as it wants to do its thing.

      --
      Dewey, what part of this looks like authorities should be involved?
    7. Re:More reason to be a ZFS fanboy by afidel · · Score: 1

      Or use a pair of them like the Sun Unified storage cluster using the 7310/7410. Of course Sun charges a fairly hefty fee for what you get (I got 72x450GB 15k drives in my EVA6400 for what they charge for the same storage is SATA and mine included 5 years of support).

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    8. Re:More reason to be a ZFS fanboy by SLi · · Score: 2, Interesting

      Mod parent up. These are all legit deficiencies in ZFS that really need to be fixed at some point.

      Only if it's costworthy. For a case I know about XFS lacks filesystem shrinking too, and it has been asked for many times. It has been estimated that it would take months for a skilled XFS engineer to code. If it's so important that someone is willing to put up that money (or effort), it may happen; otherwise it will not. I'm sure the same applies to ZFS.

    9. Re:More reason to be a ZFS fanboy by Drishmung · · Score: 1

      And a 255 byte filename limit. Not 255 unicode characters, 255 bytes. ReiserFS got this right. Btrfs alas gets it wrong. (Just call me picky)

      --
      Protoplasm. Quiet Protoplasm. I like quiet protoplasm.
    10. Re:More reason to be a ZFS fanboy by KonoWatakushi · · Score: 3, Informative

      How alarmist and uninformed; borderline FUD. The reality is as follows...

      First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now. Same with crypto.

      Second, mistakenly typing add instead of attach will result in a warning that the specified redundancy is different, and refuse to add it.

      Third, yes, you can't expand the width of a RAID-Z. You can still grow it though, by replacing it with larger drives. Once the block pointer rewrite work is merged, removal will be possible, and expansion won't be far off either.

      Forth, vdevs no longer autoexpand by default. If you want that behavior, you can to set the autoexpand property to yes.

      Last, there was no such assumption, it is simply a matter of priorities. If it were an easier problem, it would have been done long ago, but I'm happy to be patient, knowing that it will be done right. Most everyone who has seriously used ZFS will understand that the advantages will hugely outweigh these minor nits, which are easily worked around.

    11. Re:More reason to be a ZFS fanboy by greg1104 · · Score: 4, Informative

      How alarmist and uninformed; borderline FUD. The reality is as follows...

      First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now.

      The bug report for this problem goes back to at least April of 2003. With that background, and that I've been hearing ZFS proponents suggesting this is coming "very soon now" for years without a fix, I'll believe it when I see it. Raising awareness that Sun's development priorities clearly haven't been toward any shrinking operation isn't FUD, it's the truth. Now, to be fair, that class of operations isn't very well supported on anything short of really expensive hardware either, but if you need these capabilities the weaknesses of ZFS here do reduce its ability to work for every use case.

    12. Re:More reason to be a ZFS fanboy by Anonymous Coward · · Score: 1, Informative

      How alarmist and uninformed; borderline FUD. The reality is as follows...

      First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now.

      The bug report for this problem goes back to at least April of 2003. With that background, and that I've been hearing ZFS proponents suggesting this is coming "very soon now" for years without a fix, I'll believe it when I see it. Raising awareness that Sun's development priorities clearly haven't been toward any shrinking operation isn't FUD, it's the truth. Now, to be fair, that class of operations isn't very well supported on anything short of really expensive hardware either, but if you need these capabilities the weaknesses of ZFS here do reduce its ability to work for every use case.

      This is called "block pointer (bp) rewrite" in ZFS parlance. It was talked about at SNIA 2009 (p. 18):

      http://www.snia.org/events/storage-developer2009/presentations/monday/JeffBonwick_zfs-What_Next-SDC09.pdf

      As well as Kernel Conference Australia 2009 (~40:00):

      http://blogs.sun.com/video/entry/kernel_conference_australia_2009_jeff

      Jeff Bonwick and Bill Moore said that it'd be committed by the end of this year. (Along with dedupe (done) and crypto.) They're cutting it a bit close, but I think the issues mentioned GP will not be a problem Real Soon Now.

    13. Re:More reason to be a ZFS fanboy by symbolset · · Score: 2, Insightful

      I'm curious about these storage needs that shrink. Is this a hypothetical case, or can you provide a real world citation of an example? In a broad world many strange things are found but I always considered this one mythical.

      --
      Help stamp out iliturcy.
    14. Re:More reason to be a ZFS fanboy by KonoWatakushi · · Score: 1

      Raising awareness is fine, but the parent clearly has an axe to grind, and is also presenting information which is far out of date. Incidentally, the Limitations section of the wiki on ZFS has been shrinking considerably, and resizing is the only notable one left.

      Vdev removal has been a long requested feature, but I don't recall seeing any promises, or evidence of anyone seriously working on it until relatively recently. It may have slipped a year or so, but the developers have been rather open about their continued efforts, and the difficulty of the problem. It is clearly a priority at Sun now.

      I admit that it has been a difficult wait, but I have been more worried about Oracle yanking the plug on recent activities, than the ZFS team not delivering as promised. Seeing the dedup work integrated does put my mind at ease regarding that.

    15. Re:More reason to be a ZFS fanboy by spinkham · · Score: 1

      Linux md softraid lets you add and remove disks, change raid levels, and generally do other awesome stuff.

      --
      Blessed are the pessimists, for they have made backups.
    16. Re:More reason to be a ZFS fanboy by Samah · · Score: 1

      I've been a ZFS fanboy for quite a while now, and last weekend I finally made the shift from Ubuntu to OpenSolaris on my server. At the moment I'm just using a basic mirrored pool and no raid-z (it scares me).

      The only thing that really bugs me at the moment is its poor support for ext2/3. I was getting ridiculously slow transfer speeds from my old drives (sub 100k/s) and/or hard locks where I've had to kill off the copy process and hope that I can unmount it.

      I've been booting into Ubuntu, copying files over the network to my desktop PC, booting OpenSolaris, and copying back. I'm sure there's an easier way to copy 4TB, but I'm not in a hurry.

      What have been your experiences in migration from Linux to OpenSolaris?

      *Disclaimer: I use Solaris 10 at work.

      --
      Homonyms are fun!
      You're driving your car, but they're riding their bikes there.
    17. Re:More reason to be a ZFS fanboy by HockeyPuck · · Score: 1

      Just create a ZVol, and share it out via iSCSI.

      The TCP overhead on iSCSI is too great when you start approaching gigE speeds. So now you're looking at purchasing a TOE (TCP Offload Engine) card to deal with that, or a dedicated iSCSI adapter. Might as well go buy a dedicated iSCSI array at that point. Plus I want my disk arrays to have enough intelligence and software to do their specific job. A general purpose OS has too many places where I have to patch/upgrade etc.

      Now if sun would enable a FCOE target then we've got something. I don't even need a dedicated adapter for it, since there is no heavy TCP do deal with just layer2 (FC being "layer 3" which is a much lighter protocol than TCP).

    18. Re:More reason to be a ZFS fanboy by Hurricane78 · · Score: 1

      And how about this: The Linux FUSE ZFS implementation (the only one on Linux) eats half of your (not the newest generation) processor cores and 600MB RAM for breakfast. Yes, that's right. It uses that much resources.

      Although I must say, for my archive, it's still worth it. Because it's the only thing that can protect my data from the data corruption that happens more and more often with "modern" HDDs.

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    19. Re:More reason to be a ZFS fanboy by Anonymous Coward · · Score: 0

      ZFS is open source. If an open bug hurts you that much, you can either fix it yourself, or hire somebody to fix it.
      Then give the fix back to the community. WTP, other than that it costs your time and money?

    20. Re:More reason to be a ZFS fanboy by paulhar · · Score: 2, Informative

      TCP overhead at 1GbE for a modern processor is negligible - you're only talking about processing 120MB/sec or so.

      Here is a document including a pretty graph: http://media.netapp.com/documents/tr-3628.pdf

      "...enabling the TCP Offload Engine (TOE) on the Linux hosts did not noticeably affect performance on the IBM blade side."

    21. Re:More reason to be a ZFS fanboy by bertok · · Score: 1

      The advantages of SANs are easy to realize, they need not necessarily be FibreChannel vs NAS (NFS/CIFS) as a SAN could be iSCSI, FCOE, FCIP, FICON etc..

      -Storage Consolidation compared with internal disk.
      -Fewer components in your servers that can break.
      -Server admins don't have to focus on Storage except at the VolMgr/Filesystem level
      -Higher Utilization (a WebServer might not need 500GB of internal disk).
      -Offloading storage based functions (RAID in the array vs RAID on your server's CPU, I'd rather the CPU perform application work rather than calculating parity, replacing failed disks etc). This increases when you want to replicate to a DR site.

      This is not a ZFS vs SANs argument. I think ZFS running on SAN based storage is a great idea as ZFS replaces/combines two applications that are already on the host (volmgr & filesystem).

      I swear, I used to believe in this stuff, but I'm starting to see that it's more marketing myth than technical wizardry.

      I love the way that RAID used to stand for Redundant Array of Inexpensive Disks, but according to EMC and their cohorts, it now stands for Redundant Array of Independent Disks. Notice the way they dropped the problematic "inexpensive" part?

      It's one thing to reduce your costs by consolidating local disks onto cheaper networked disks, but my experience is that SAN arrays usually cost more than internal disk, even though they should be cheaper.

      The genius of ZFS is that it's a return to the "inexpensive" part of RAID. An administrator can take a bunch of "low-end" SATA disks, apply some ZFS magic, and end up with performance and reliability numbers that would make your jaw drop. I've seen benchmarks that claim that a SUN Thumper, a single box of a mere 4 rack units, can do 1GB/sec of IO throughput. Not 1 gigabit per second, but 1 gigabyte per second. That's faster than the best 8Gb FC SANs!

      More importantly, any competent admin can manage a Solaris ZFS filer. The ZFS command line utilities are simple to use, and I say this as a self-confessed "Windows Server Administrator". Compare this to most SAN arrays, which are so complex that most enterprises won't allow anyone but a "certified administrator" to even touch them.

      Single instancing is a big step towards ending the dominance of the big players on the storage market. As soon as someone creates a software RAID like ZFS that has integrated controller redundancy instead of just storage redundancy, the era of traditional FC SANs is over. There's essentially nothing that a "hardware" RAID does other than controller redundancy that hasn't been already implemented in software. It's just a matter of time now...

    22. Re:More reason to be a ZFS fanboy by this+great+guy · · Score: 1

      The most well-know and probably most used RAID implementation, Linux software RAID, has been able to do #2 since 2006. It strikes me how few people know this. It is called reshaping (see mdadm --grow).

      In all fairness to ZFS, it took Linux more than 10 years to implement reshaping (raid5 support was added around 1995-1996). ZFS has only been released for production use 3 years ago, in Solaris 10 6/06 "U2".

    23. Re:More reason to be a ZFS fanboy by TheRaven64 · · Score: 1

      If the problem is the Solaris ext2 driver, then there are a few things that you could try. You could export the ext2 disk via iSCSI or to a VM and get a Linux or FreeBSD machine to read it. Alternatively, I think there is a FUSE driver for ext2, which should work too. Both of these are likely to be slow, but they should be a lot faster than 100KB/s. I find it hard to believe that the FS driver is really the problem though. It seems more likely that the old drives are in PIO mode for some strange reason.

      --
      I am TheRaven on Soylent News
    24. Re:More reason to be a ZFS fanboy by TheRaven64 · · Score: 1

      Sounds more like a reason not to be a Linux fanboy than a reason not to be a ZFS fanboy. On FreeBSD or OpenSolaris you don't have the extra copying and system call overhead from FUSE, so the CPU load is much lower (it still wants a lot of RAM though).

      --
      I am TheRaven on Soylent News
    25. Re:More reason to be a ZFS fanboy by drsmithy · · Score: 1

      The TCP overhead on iSCSI is too great when you start approaching gigE speeds. So now you're looking at purchasing a TOE (TCP Offload Engine) card to deal with that, or a dedicated iSCSI adapter.

      Firstly, the overhead is insignificant on any remotely modern CPU.

      Secondly, are anything but even the cheapest, nastiest, NICs even available without TOE these days ?

    26. Re:More reason to be a ZFS fanboy by asaul · · Score: 1

      My word from inside Sun was that BP rewrite was putback a few months ago. This was from the organiser of the Australia conference.

      --
      "If everybody is thinking alike, somebody isn't thinking" - Gen. George S. Patton
    27. Re:More reason to be a ZFS fanboy by asaul · · Score: 1

      I have a case of it. Some time ago an unoffical work server I ran needed some disk, so the SAN admin kindly donated some 1TB of unwanted 5400 rpm PATA Clariion disks, even though I only wanted 300G or so. I put ZFS on it and left it at that.

      Anyway, now the server is important, it performs like a dog because the PATA disks are crap, and the Clariion is on the way out. So now it needs to be moved.

      So my only option is ZFS send/recv - which is reasonably slow, or a backup/restore, again slow. However being on UFS would have made no difference, still a dump/restore operation.

      At a guess the only combination of filesystem/VM that might have done this is VXFS + VXVM, but that is nowhere near free and personally seems to cause as many panics as UFS did.

      So its not really a ZFS only issue per-se, but occasionally the need is there.

      --
      "If everybody is thinking alike, somebody isn't thinking" - Gen. George S. Patton
    28. Re:More reason to be a ZFS fanboy by Anonymous Coward · · Score: 0

      How about you go back playing with your toy OS of choice

      How about you RTFM

      How about you understand how ZFS works, what's its model of storage, what are devices and vdevs, and that they are not interchangeable.

      How about you don't expect ZFS to behave the way you want? For the fantastic tool that it is, oh, and free, you complain loudly if things aren't done The Way Ignorants Think Ought To Be Done(TM).

      How about you issue the right commands to offline a device that you plan to substitute instead of telling ZFS to forget about it for the time being and then complain that ZFS did what you told it to?

      Downsizing a pool is the only half-sane requirement for ZFS that you manage to write (and that's only because you saw it written somewhere else). I might half-agree to that but I fail to see how is that a priority in our ever-expanding, data guzzling environments. Again, ZFS is meant as an enterprise tool. The only time you want to delete data in a company is because you are closing for good.

      How about you have a nice day, and RTFM.

    29. Re:More reason to be a ZFS fanboy by Anonymous Coward · · Score: 0

      I find it interesting that using NFS was faster than using iSCSI. Was this possibly the result of their NFS drivers being better? It seems like iSCSI should be less overhead, so this seems odd.

    30. Re:More reason to be a ZFS fanboy by Anonymous Coward · · Score: 0

      I usually tie shrinking in with with vdev removal. Both of these are a large business requirement, and likely less important to smaller shops or home usage.

      Both fulfill similiar roles when dealing with expensive SAN storage. SAN is commonly shared on as LUN's (chunks of storage) and then parceled up on the client system.

      Common large enterprise activities include migrating SAN:
          - as part of the continuous hardware lifecycle,
          - due to problems,
          - for performance,
          - and so on.

      * Assuming the LUNs from the two devices are *exactly* the same size, you can do a replace in ZFS. But this isn't ideal, and often (at least here) LUN sizes are close, but not exact. (BTW Growing during migration is not a good option here).
          + Some SAN has restrictions on LUN sizes, so migrating can involve a change in the overall number of LUNs ( 4 x 50G --> 2 x 100G)

      * Temporary growth (high load), can be *very* expensive to compensate for on hundreds or thousands of systems. Being able to grow for exceptional circumstances, then shrink again later, can save millions.

      And yes, these can be worked around. But the work-arounds are often awkward, and hard to justify, when a paid product like Veritas obviously provides the capability.

      These workaronds often require downtime - any downtime can be expensive in large organizations.

      I prefer ZFS, Veritas licenses are a rip-off, but in a business case, ZFS currently means regular downtimes when storage changes.

    31. Re:More reason to be a ZFS fanboy by greed · · Score: 1

      Well, one thing, iSCSI has to transfer disk blocks, and NFS actually knows about files. So the NFS server can do read-ahead knowing the structure of a directory, or the inode, or the file itself. But iSCSI may not read-ahead in the right pattern, because first you have to read the directory blocks, then the inode blocks, then the file blocks, and they may not be adjacent and sequential.

      You may also need to transfer more than you need using iSCSI. "Open this file for me" on NFS doesn't take a very large packet at all, as the client doesn't need to know anything about the server's underlying filesystem.

      Right tool for the job and all that....

      (This is not to be taken as an endorsement of NFS.)

    32. Re:More reason to be a ZFS fanboy by its · · Score: 1

      I am building a new home NAS and I have been seriously considering zfs. However, I most likely won't go this route because there exist zfs failures that are catastrophic.
      http://www.opensolaris.org/jive/thread.jspa?threadID=108213&tstart=0

      I really don't want to loose all my data in the filesystem because the machine locked up at the wrong time. I may reconsider zfs once automated recovery tools become available.

    33. Re:More reason to be a ZFS fanboy by QuantumRiff · · Score: 1

      ZFS also can export as iSCSI, so really, it is doing 90% of what other SAN Solutions do. I just don't know much about its failover and clustering of a group of machines, (also known as network raid in some iscsi products)..

      But yeah.. When I was shopping for a SAN, I was often asking vendors how their stuff compared performance and price/TB wise to a Sun 4500 server.. They would get very, very quiet.

      --

      What are we going to do tonight Brain?
    34. Re:More reason to be a ZFS fanboy by Big+Boss · · Score: 1

      Can't you put some decent SATA drives in there along with the crap PATA units and use "zfs replace"?

    35. Re:More reason to be a ZFS fanboy by Big+Boss · · Score: 1

      I built a new server when I installed OpenSolaris, so I just booted the new server and used NFS to copy the data over. It worked really well. raidz is working really well for me, you should set up a test array and try it out. Even as file devices just for testing.

    36. Re:More reason to be a ZFS fanboy by Samah · · Score: 1

      Yeah unfortunately I'm using the same hardware but with a different boot drive, so I can't really run them both at the same time. :)

      I read up on raid-z a bit more, but I think I'd prefer to have a filesystem I can grow at any time (easily). Hard drives are pretty inexpensive at the moment, so I don't really mind having to buy twice the drives for the space I want. It really bugs me that ZFS still doesn't support removal of vdevs and/or pool-shrinking (that reminds me of a Seinfeld episode actually...)

      Btrfs looks interesting, but I have this feeling that once I get OpenSolaris set up (and I actually learn how to use the bloody thing), I won't want to move back to Linux. :)

      --
      Homonyms are fun!
      You're driving your car, but they're riding their bikes there.
    37. Re:More reason to be a ZFS fanboy by Samah · · Score: 1

      You could export the ext2 disk via iSCSI or to a VM and get a Linux or FreeBSD machine to read it.

      I can't boot both OpenSolaris and Ubuntu at the same time since it's the same hardware, but do you think it would possible to mount and boot my Ubuntu drive under VirtualBox? I've not used VirtualBox for anything other than booting from a disk image. If so, would it be easy to mount the other physical disks within the VM, then share them to the host OS? I'm thinking that would be the best solution, but I'll give the FUSE driver a try first.

      On a semi-related note, I was looking for a better solution to my old setup (7 drives with thousands of symlinks) and I stumbled upon zfs-fuse. Thus began my love for ZFS. ;)

      --
      Homonyms are fun!
      You're driving your car, but they're riding their bikes there.
    38. Re:More reason to be a ZFS fanboy by TheRaven64 · · Score: 1

      You can give a VirtualBox VM access to a real disk, see the manual under 'raw disk access'. Sharing them back to the host OS should be pretty trivial too. You can, for example, run an NFS server in an Ubuntu VM. Effectively you then have a very ad-hoc implementation of FUSE using an entire Linux kernel and VM as a filesystem driver (which is more or less what Xen does too). Note that you're still using the Solaris disk driver, just not the Solaris filesystem driver, so if the problem is with the block device (as I suspect it may be) then this won't be any faster. You can test this by using dd on the (unmounted) disk device node (just write the output to /dev/zero or something).

      --
      I am TheRaven on Soylent News
    39. Re:More reason to be a ZFS fanboy by Anonymous Coward · · Score: 0

      In the case you mention, the user was running ZFS inside a VirtualBox VM without telling VirtualBox to honor disk flush requests. By default, VirtualBox (and possibly VMware) play fast and lose by caching and reordering disk writes issued by VMs and ignoring cache flushes in order to improve performance. Any file system will have problems if you power off a system while caching data that the file system believed were committed to disk.

      If you insist on using storage that lies to the OS about whether data has actually been written (e.g. some cheap USB drives), OpenSolaris just got the ability to back out the last few ZFS transactions after a reboot until a consistent state is reached.

    40. Re:More reason to be a ZFS fanboy by symbolset · · Score: 1

      This sounds more like a server refresh problem than a capacity shrinkage problem. I believe that if you have a modern server capable of PCIe and your capacity needs are limited you can migrate your Clariion array to something like this and net more performance.

      And if you ain't got the wherewithal to do that, how important is your data anyway?

      --
      Help stamp out iliturcy.
    41. Re:More reason to be a ZFS fanboy by jabuzz · · Score: 1

      Works fine on IBM's GPFS file system, I can shrink it just fine while it's mounted and in use.

    42. Re:More reason to be a ZFS fanboy by jabuzz · · Score: 1

      Nope IBM's GPFS would have handled that fine. Add in a new NSD and then take the old one out and waiting a while.

      You also get a whole bunch of other features that ZFS does not have as well.

    43. Re:More reason to be a ZFS fanboy by jabuzz · · Score: 1

      Perhaps for performance he needs 15k RPM SAS/FC disks, and given he only needs 300GB (which could be done from a single disk RAID1), a whole 1TB of fast disks is a huge waste of money.

    44. Re:More reason to be a ZFS fanboy by its · · Score: 1

      From the discussion in the mailing list, it seems fairly certain that this can happen without Virtualbox. However, hardware is not bug free either. The issue is that zfs doesn't gracefully handle these failures. Most of the data on the disk were not affected in any way, yet the whole filesystem was hosed. See also the thread: http://mail.opensolaris.org/pipermail/zfs-discuss/2009-February/026087.html where simply yanking a USB disk with no mounted filesystems resulted in catastrophic failure. But it appears that something has been done about this issue and the patches have just made it into snv_128. http://bugs.opensolaris.org/view_bug.do?bug_id=6667683 I will have to reconsider my options now. There many things that look enticing in zfs.

    45. Re:More reason to be a ZFS fanboy by duffbeer703 · · Score: 1

      Another example: Your company gets sued and you have to capture user data from 200 computers. The lawyers do their thing a year later and tell you that you can throw out data from 150 of them. Now you have 10TB of empty SAN that costs you $27/GB/mo.

      --
      Conformity is the jailer of freedom and enemy of growth. -JFK
    46. Re:More reason to be a ZFS fanboy by Anonymous Coward · · Score: 0

      Who should listen to you when you got spanked by an ac here today right here http://slashdot.org/comments.pl?sid=1429510&cid=29979500 , and here http://slashdot.org/comments.pl?sid=1429510&cid=29980114 twice in a row? You're obviously no expert and anyone can take a read in those links, the second one mostly, and at how badly you messed up on your comments on a se windows moron.

  8. Dupe dedupe de dupe dupe! by dangitman · · Score: 0

    Dee dupe de dupe!

    Drey dupe de drupes!

    Dey dook dour dobbs!

    Dey took Lou Dobbs!

    Dey drook our jobs!

    They took our jobs!

    Signed,

    Slashdot editors

    --
    ... and then they built the supercollider.
  9. Next home server will be OpenSolaris (or fBSD) by 0100010001010011 · · Score: 2, Insightful

    ZFS, from what I can tell, kicks ass. I've played around with it in virtual machines, taking drives off line, recreating them, adding drives, etc.

    When I search NewEgg I also search OpenSolaris' compatibility list.

    The two areas that Linux is playing catchup is Filesystems (like this) and Sound (OSS, Pulse, Alsa Oh My!). And before you go pointing out the btrfs project, this has been in servers for years. It's tried in an enterprise environment. Your file system is still in beta with a huge "Don't use this for important stuff" warning.

    1. Re:Next home server will be OpenSolaris (or fBSD) by buchner.johannes · · Score: 2, Funny

      Oh yeah? Well tux is cuter so I'm not switching.

      --
      NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
    2. Re:Next home server will be OpenSolaris (or fBSD) by buchner.johannes · · Score: 2, Interesting

      I'm sure btrfs -- once fully implemented and tested -- will also have problems reaching the performance of reiser4.

      --
      NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
    3. Re:Next home server will be OpenSolaris (or fBSD) by Anonymous Coward · · Score: 0

      You can download Sun's prebuilt storage appliance VM here.

      It gives you a free GUI storage appliance wrapper around OpenSolaris and ZFS, so you can start using the features without being an expert in either (just like NetApp with BSD and WAFL). You can replace the virtual disks with real ones if you want to store serious data.

    4. Re:Next home server will be OpenSolaris (or fBSD) by Anonymous Coward · · Score: 0

      It was tried here by our X-new Director of Technology. He was fired shortly after we found that the performance of ZFS is about 30-40% that of other file systems.

    5. Re:Next home server will be OpenSolaris (or fBSD) by Ant+P. · · Score: 1

      btrfs isn't meant to compete with reiser4. If you want that, go follow Tux3 instead.

  10. Another Lawsuit? by yukonbob · · Score: 1

    Considering what's going on between NetApp and Sun currently, I wonder what they'll think of this?

    -yb

  11. Wake me when they build it into the hard disk by icebike · · Score: 4, Interesting

    Imagine he amount of stuff you could (unreliably) store on a hard disk if massive de-duplication was built into the drive electronics. It could even do this quietly in the background.

    I say unreliably, because years ago we had a Novell server that used an automated compression scheme. Eventually, the drive got full anyway, and we had to migrate to a larger disk.

    But since the copy operation de-compressed files on the fly we couldn't copy because any attempt to reference several large compressed files instantly consumed all remaining space on the drive. What ensued was a nightmare of copy and delete files beginning with the smallest, and working our way up to the largest. It took over a day of manual effort before we freed up enough space to mass-move the remaining files.

    De-duplication is pretty much the same thing, compression by recording and eliminating duplicates. But any minor automated update of some files runs the risk of changing them such that what was a duplicate, must now be stored separately.

    This could trigger a similar situation where there was suddenly not enough room to store the same amount of data that was already on the device. (For some values of "suddenly" and "already").

    For archival stuff or OS components (executables, and source code etc) which virtually never change this would be great.

    But there is a hell to pay somewhere down the road.

    --
    Sig Battery depleted. Reverting to safe mode.
    1. Re:Wake me when they build it into the hard disk by Shikaku · · Score: 1

      That's actually very easy to explain, and ZFS could have a very similar situation:

      Say you have on your hard drive these two files that have this, which in reality is 1GB worth of data for each file (the space is a seperate file):

      ABCDABCD ABCDABCD

      Every letter has equal weight, so those two files are stored .5GB without compression. Let's change it a little bit:

      AeBCDABfCD ABCgDABChD

      efgh are 1 byte.

      You now have 2GB worth of space taken :) that's a gotcha if I ever saw one.

    2. Re:Wake me when they build it into the hard disk by Shikaku · · Score: 1

      Oh, I guess I should mention the blocks in my case are stupidly large, and the point is data insertion/shifting can cause sudden increases in size with block level deduplication.

    3. Re:Wake me when they build it into the hard disk by dgatwood · · Score: 1

      That's just classic bad design. There's no reason for the decompressed files to exist on disk at all just to decompress them. The software should have decompressed to RAM on the fly instead of storing the decompressed files as temp files on the hard drive. It's all probably because they made a poor attempt at shoehorning compression into a VFS layer that was too block-centric. Classic bad design all around.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    4. Re:Wake me when they build it into the hard disk by Anonymous Coward · · Score: 0

      True, though commodity grade hard drives are so inexpensive these days that the cost of providing a generously larger amount of them than what you plan to store is usually not a big deal.
      The only really expensive drives today are high end enterprise type SAS / SAN / SCSI units or FLASH based ones. If you're storing media like digitized video, the benefits of dedup are usually insignificant since you're unlikely to accidentally / routinely have duplicated data at anything less than the file level, and at the file level you'd probably have easy options not to duplicate that content by design if so desired.

      The REAL "wake me up when they integrate it into the drive itself" list for me is:
      * drive integrated mirroring at a head/platter level with the functions of different platters being independent enough so you could still stand a good chance of reading functional ones even after one head/platter is damaged.

      * drive integrated gigabit / 10GbE ethernet interfaces for commodity drives and an iSCSI protocol over ipV6.

      * drive integrated ECC and spatial data striping at user selectable and much higher than default levels so that even a single drive could give you much better data reliability / redundancy across platters.

      * drives with integrated encryption being the norm

      * drives with built in ZFS / NAS and the ability to link to each other over e.g. PCIE / infiniband so you could set up small clusters of RAIDZ'd drives just with a few cables and inexpensive drives.

    5. Re:Wake me when they build it into the hard disk by icebike · · Score: 3, Interesting

      Bad design on Novell's part, but the problem persists in the de-duplicated world, where de-duplicating to memory only is not a solution.

      Imagine a hundred very large file containing largely the same content. Not imagine CHANGING just a few characters in each file via some automated process. Now 100 files which were actually stored as ONE file balloon to 100 large files.

      On a drive that was already full, changing just a few characters (not adding any total content) could cause a disk full error.

      You really can't fake what you don't have. You either have enough disk to store all of your data or you run the risk of hind-sight telling you it was a really bad design.

      --
      Sig Battery depleted. Reverting to safe mode.
    6. Re:Wake me when they build it into the hard disk by Znork · · Score: 1

      But there is a hell to pay somewhere down the road.

      I'd certainly expect that. I don't quite get what people are so desperate to de-duplicate anyway. A stripped VM os image is less than a gigabyte, you can fit 150 of them on a drive that costs less than $100. You'd have to have vast ranges of perfectly synchronized virtual machines before you'd have made back even the cost of the time spent listening to the sales pitch.

      I can't really see many situations where the extra complexity and cost would end up actually saving money. The few I can see it would be where somebody's been tricked into buying such excruciatingly expensive SAN storage that they can barely afford to store anything on it any more, or situations where their storage is a complete mess and they can't use more intelligent means of not storing the same thing many times (snapshots, shared file systems, overlay devices, etc). In those cases it seems there would be more to gain by solving the actual problem than tacking another patch onto the stack. Storage, for most purposes, is dirt cheap today.

    7. Re:Wake me when they build it into the hard disk by ArsonSmith · · Score: 3, Informative

      No you still have it stored the size of one file + 100 block sizes, in size. You'd need a substantially large number of random changes through all 100 files to balloon up from 1x file size, to 100x file size.

      --
      Paying taxes to buy civilization is like paying a hooker to buy love.
    8. Re:Wake me when they build it into the hard disk by dgatwood · · Score: 1

      True, but that's going to fail when you change the very first file, and one would hope that the process would go no further.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    9. Re:Wake me when they build it into the hard disk by geniusj · · Score: 1

      ZFS dedupe is block level. This would be a problem, however, in file-level dedupe schemes.

    10. Re:Wake me when they build it into the hard disk by perzquixle · · Score: 1

      This doesn't apply to ZFS due to the way it uses drives. All drives are added to a storage pool, and drives are used as needed based on speed and reliability requirements. So to upgrade, you'd just add a new drive to the pool, mark the old drive for removal, wait as it moves the blocks to any other drive(s) in the pool, then remove the old drive.

    11. Re:Wake me when they build it into the hard disk by icebike · · Score: 1

      >I can't really see many situations where the extra complexity and cost would end up actually saving money.

      I could see it for write-only media.
      With the proper byte-range selection, you could probably find enough duplicate blocks in just about anything to greatly expand capacity.

      --
      Sig Battery depleted. Reverting to safe mode.
    12. Re:Wake me when they build it into the hard disk by c6gunner · · Score: 1

      This could trigger a similar situation where there was suddenly not enough room to store the same amount of data that was already on the device. (For some values of "suddenly" and "already").

      Yes, but what's the likelihood of that occurring? We're talking about block level duplication here. If you have two identical files and you add a bit to the end of one, you're not creating a duplicate fi;e - you're just adding a few blocks while still referencing the original de-dupped file. Now, if you were doing file-level duplication it might be an issue, but this way ... I can't see it ever being a problem unless your array is already at 99.9% percent capacity (and that's just a bad idea in general).

    13. Re:Wake me when they build it into the hard disk by icebike · · Score: 1

      You could STILL be stuck with a transaction in mid-flight when you exhaust your storage because what was one block replicated hundreds of times now becomes hundreds of blocks exhausting all storage.

      The Ease with which you can add storage only makes it somewhat more palatable. It doesn't hand wave the problem away.

      Sooner or later every you have to upgrade storage on almost every platform. The problem with a platform that uses compression or de-duplication to store more than can really fit on its drives is that you can SUDDENLY run out of storage due to seemingly innocuous tasks. No steadily falling free-disk space to warn you ahead of time.

      --
      Sig Battery depleted. Reverting to safe mode.
    14. Re:Wake me when they build it into the hard disk by PRMan · · Score: 1

      It would be great for ISPs, where each of their user instances have files in common. Also, for a backup drive for user PCs, where each user has the OS and probably a lot of documents in common.

      --
      Peter predicted that you would "deliberately forget" creation 2000 years ago...
    15. Re:Wake me when they build it into the hard disk by Anonymous Coward · · Score: 1, Informative

      I say unreliably, because years ago we had a Novell server that used an automated compression scheme. Eventually, the drive got full anyway, and we had to migrate to a larger disk.

      But since the copy operation de-compressed files on the fly we couldn't copy because any attempt to reference several large compressed files instantly consumed all remaining space on the drive. What ensued was a nightmare of copy and delete files beginning with the smallest, and working our way up to the largest. It took over a day of manual effort before we freed up enough space to mass-move the remaining files.

      This is because you didn't use NetWare's tools to copy the files - the command line NCOPY, for example, with /Ror and /RU (available when file compression was introduced with NetWare 4) would have copied the files in their compressed format, avoiding this (Link: http://support.novell.com/techcenter/articles/ana19940603.html). Using the Novell Client for Windows, I'd imagine that its Explorer shell integration would give you GUI tools, too, though I no longer have a NetWare server to verify this, and always preferred the command line anyway :).

      No offense, but the scenario you describe is the result of ignorance, nor poor design.

    16. Re:Wake me when they build it into the hard disk by Anonymous Coward · · Score: 0

      Yes, any process that writes to storage can fail, and any good program will be written to deal with this.

      Sometimes inexperienced developers imagine themselves to be really smart and believe they have come up with a black-magic secret spell that will mean their program will never fail to write to storage. For example, some very inexperienced developers may incorrectly think "it's impossible for a write to fail if you are only replacing bytes in an existing file!"

      Sometimes, a mid-level developer will kindly explain to the junior developer that there are dozens of existing ways that a write might fail, including copy-on-write, software compression, journaling, error recovery, automated versioning, hardware block allocation schemes, and many others.

      And sometimes, a senior-level developer will just say "Programs you write today will fail in five years for reasons that haven't even been invented yet. Don't imagine you can avoid doing the right thing with a clever hack."

    17. Re:Wake me when they build it into the hard disk by icebike · · Score: 1

      >This is because you didn't use NetWare's tools to copy the files - the command line NCOPY, for example, would have copied the files in their compressed format..

      We were moving the server content to Linux. Having it in Novells format would not have been usefull.

      --
      Sig Battery depleted. Reverting to safe mode.
    18. Re:Wake me when they build it into the hard disk by drsmithy · · Score: 1

      I'd certainly expect that. I don't quite get what people are so desperate to de-duplicate anyway. A stripped VM os image is less than a gigabyte, you can fit 150 of them on a drive that costs less than $100.

      Firstly, because dedup gives you the space savings without the hassle of "stripping" the VM image.
      Secondly, because dedup also delivers other advantages by reducing physical disk IOs, improving cache efficiency and reducing replication traffic.
      Thirdly, because enterprise storage costs a lot more than that, especially once you account for backups.

      I can't really see many situations where the extra complexity and cost would end up actually saving money.

      NetApp have quite a few white papers and blogs. The most high profile winner is virtualisation, of course, but things like SAN-booted OS images, mailboxes, backups and data replication also see huge benefits.

    19. Re:Wake me when they build it into the hard disk by Junta · · Score: 1

      But it's not such a huge problem.

      You change a few characters, the block that contained those characters gets duplicated. If there is insufficient space, that write() syscall returns some errno to indicate the file system is unable to service the request. You will lose the pending data, but the original will stay intact. This allows the pool to be grown or all contents to remain readable to go to a new disk. A simple rule for storage providersd is that a mere read operation should *never* require additional disk space to succeed, and that a write needs to be dropped.

      It's harder to predict what effect a change to the disk *will* have on a system (and by extension, have a precise concept of how much 'free' space you really have), but it doesn't have to be that different in behavior from a 'normal' full filesystem today, or a filesystem with sparse files that grows without any 'ls' output changes.

      --
      XML is like violence. If it doesn't solve the problem, use more.
    20. Re:Wake me when they build it into the hard disk by jpampuch · · Score: 1

      Two things to consider:

      1. You don't have to turn it on. Use it if it makes sense for your environment

      2. On a disk that is nearly full, many operations run the risk of not having room enough to complete. But at least with ZFS, you can just add another drive.

    21. Re:Wake me when they build it into the hard disk by Anonymous Coward · · Score: 0

      Because when Novell fails, all subsequent technologies that sound similar will also fail.

    22. Re:Wake me when they build it into the hard disk by QuoteMstr · · Score: 1

      data reliability / redundancy across platters.

      I'm not sure redundancy across platters would help: consider that all the platters are useless of the spindle bearing goes, and that all the platters are useless if the seek motor stops seeking. Redundancy across platters only protects against a small subset of the problems a hard drive might experience, and it doesn't seem worth the trouble.

    23. Re:Wake me when they build it into the hard disk by seifried · · Score: 1

      It's not just storage, it's about caching in ram. If my Linux box caches say one gig of data that happens to be shared amongst multiple (nearly) identical VM's I will see a huge performance increase vs. trying to cache 20 gigs of data (one for each of the 20 VM's). If it's the exact same data why would I want multiple copies floating around unless I explicitly ask for it (i.e. RAID, time machine, backups, etc.).

    24. Re:Wake me when they build it into the hard disk by jcr · · Score: 2, Insightful

      what was one block replicated hundreds of times now becomes hundreds of blocks exhausting all storage.

      What? Why would that happen?

      If you have a block and a hundred COW pointers to it, and you modify one, then you get two blocks, with 99 references to the old one and one reference to the new one.

      -jcr

      --
      The only title of honor that a tyrant can grant is "Enemy of the State."
    25. Re:Wake me when they build it into the hard disk by Anonymous Coward · · Score: 0

      "Eventually, the drive got full anyway, and we had to migrate to a larger disk." - Incorrect, you could have added the disk and expanded the volume, you didn't "have" to migrate to a new disk.

      "But since the copy operation de-compressed files on the fly we couldn't copy because any attempt to reference several large compressed files instantly consumed all remaining space on the drive" - There is a set parameter to control when decompression occurs, the default is something like two accesses within a day will cause decompression. But these setting are on the commit to disk, the decompression still occurs from a client perspective as it does the decompression in memory.

      "But there is a hell to pay somewhere down the road." - Very likely true. With compression when you do migrate off to a uncompressed volume, you are going to add additional cpu cycles for decompression. Depending on how much compression, number of files, io of system, cpu, will determine how much time it is going to cost you. I tend to recommend compression for user data, as a huge percentage of user data doesn't get accessed very often, so there is a big gain in disk savings, very infrequent decompresses, and having already compressed data help increase the effective backup rate.

    26. Re:Wake me when they build it into the hard disk by Anonymous Coward · · Score: 0

      This is because you didn't use NetWare's tools to copy the files - the command line NCOPY, for example, would have copied the files in their compressed format..

      We were moving the server content to Linux. Having it in Novell's format would not have been useful.

      Fair enough, but I don't think you understand how file compression on a NetWare volume works: The server decompresses the file into memory when read in order to give the uncompressed contents to the requester, and would normally leave it uncompressed on the volume once closed, space permitting, to speed future access to it (re-compressing it if not accessed after a period of time whose default I don't remember - 7 days?). But it doesn't necessarily do that: The SET command "Decompress Percent Disk Space Free to Allow Commit" determines that, and would prevent what you're describing under normal circumstances: If there were insufficient free volume space to store the file uncompressed, the server would leave it in compressed form on the NetWare volume once the file was closed, the reasoning being that it's better to suffer a performance loss on future reads than to run out of volume space.

      Something else was going on, but I can't tell what from your post, only that what you describe shouldn't have happened under normal circumstances.

    27. Re:Wake me when they build it into the hard disk by wsloand · · Score: 1

      But, the design delays when you would have to buy more disk space. The problem you're referring to is a problem to a specific disk usage scenario. Not all problems are the same, and if you're planning to mass-edit identical files that are a) large enough to make a meaningful impact on your disk usage and b) being edited in a non-uniform way, then don't use the de-duplication feature or plan ahead.

    28. Re:Wake me when they build it into the hard disk by Anonymous Coward · · Score: 0

      I don't know how one block can balloon to hundreds, but I do know how block-based dedupe can over-commit and lead to many counter-intuitive problems.

      Have a very large file and modify it by inserting one byte at the front. In an extent-based filesystem, a space conscious application could implement this modification in place by starting from the tail of the file and copying all of the content to its new location in a block-by-block fashion, until all content is shifted over and the new byte can be written. Sure, it would take gigabytes of write I/O to cause gigabytes of space exhaustion, but the application expected this gigabytes of write I/O to only create one block allocation...

      It would be interesting if a dedupe solution provided an option to disable this over-commit. Have the filesystem pre-allocate sufficient space for all file content, but simply skip a lot of redundant I/O by linking in the shared content regions and leaving the storage allocation blank. You could still get major performance gains by compressing block cache and disk channel I/O, but without the risk of suddenly exhausting the backing store.

      However, I'd really be interested to see someone use the rsync algorithm (or a tuned variant) for dividing file extents into byte phrases and storing deduped file representations which could allow insertion of a few bytes here or there, simply as an edit to the sequence of hashed phrase references. Add to this a log-based transaction model for recording these changes to the encoded representation, and you might have a very bandwidth-efficient way to replicate filesystems remotely.

    29. Re:Wake me when they build it into the hard disk by Hurricane78 · · Score: 1

      Why didn't you simply copy it to *another* drive with built-in compression?

      That will be $5000 then. Do you pay cash? ^^

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    30. Re:Wake me when they build it into the hard disk by this+great+guy · · Score: 1

      Imagine he amount of stuff you could (unreliably) store on a hard disk if massive de-duplication was built into the drive electronics.

      Bad idea. Doing dedup in the drive electronics would:

      • Not allow dedup across multiple drives.
      • Not allow dedup of data blocks cached in memory (the OS would be unaware of duplicated blocks)
      • Waste disk I/O writes on duplicated blocks. When doing dedup in software, the OS doesn't even bother sending data blocks to the drive.
    31. Re:Wake me when they build it into the hard disk by TheRaven64 · · Score: 1

      More importantly, it doesn't introduce any new problems for ZFS. All of these corner cases that people are talking about already apply if you use snapshots or clones. Exactly the same copy-on-write mechanism is used for these as is used for deduplication. You already get some degree of deduplication with ZFS via these mechanisms. If you create a snapshot of a filesystem then it takes no space, but if you modify a file then the modification requires some space.

      --
      I am TheRaven on Soylent News
    32. Re:Wake me when they build it into the hard disk by jimicus · · Score: 1

      I could see it for write-only media.

      I had a CD writer like that once.

    33. Re:Wake me when they build it into the hard disk by julesh · · Score: 1

      Imagine he amount of stuff you could (unreliably) store on a hard disk if massive de-duplication was built into the drive electronics. It could even do this quietly in the background.

      Not as good as installing extra processing power in your machine and doing it in the OS. Honestly. The primary advantage here isn't actually the saving of disk space. Nobody really cares about that too much.

      The main advantage is that if two processes have two files with identical blocks in them, and map those files into memory (or just read them so they're cached), if they're deduped you'll end up with both processes having copy-on-write references to the same memory block. The big win here is in saving RAM, not disk space. And that requires the OS to understand and be aware that the deduplication has happened.

    34. Re:Wake me when they build it into the hard disk by asaul · · Score: 1

      Well, thats the nature of de-duplication. Either your data is full of duplicates, or it isn't. If you are not sure what your data will do, you probably should assume no reduction until you have actual numbers.

      We had a disk library vendor try to sell us on their dedupe on their VTL, which we put to work backing up exchange data. They told us stories of huge compression ratios, and how Exchange mailboxes compess so easily and we could easily fit weeks of backups onto it. The fact was the machine was allready bought by management, and we just looked after the backups not exchange, so we just set it up as we were ordered to.

      Turns out their examples were all based on sites with 1G mailboxes or more. Our bizare setup with 50M mailboxes and thousands of employees was not that good for dedupe because all the repeat mails got moved to peoples PSTs immediately or their mailbox would fill. So after a week of backups and the VTL full, we were stuffed. So then we had to move it all off to tape....

      Turns out while it did backup and run its de-dupe fine, unduping the data back for restoration was hopeless - like less than 1MB/s throughput, and so it took days to move off.

      And that experience is why I now believe that de-dupe for backups is fundamentally flawed. You want a backup of all your data - not the data some borken firmware decides is unique. Any corruption, bang, all backups useless because they are all missing that highly duplicated block.

      --
      "If everybody is thinking alike, somebody isn't thinking" - Gen. George S. Patton
    35. Re:Wake me when they build it into the hard disk by hoggoth · · Score: 1

      > I could see it for write-only media.

      The best thing about write-only media is it has infinite capacity. You can just keep writing to it forever and it never fills up.

      --
      - For the complete works of Shakespeare: cat /dev/random (may take some time)
    36. Re:Wake me when they build it into the hard disk by icebike · · Score: 1

      >The big win here is in saving RAM, not disk space.]

      Then why are we talking about it in relation to a file system?

      --
      Sig Battery depleted. Reverting to safe mode.
    37. Re:Wake me when they build it into the hard disk by Anonymous Coward · · Score: 0

      Not being able to modify files due to running out of space is in no way comparable to not being able to read them...

      Modifications have a potential to get a disk full error anyhow (consider sparse files).

    38. Re:Wake me when they build it into the hard disk by ZorkZero · · Score: 1

      What you seem to be describing is file-level deduplication, which is not what is being described here.

    39. Re:Wake me when they build it into the hard disk by Znork · · Score: 1

      Yup, but competent ISP's that run such instances usually already share those files either by using higher level virtualization basically only has one instance of OS files, or by using overlay filesystems. Backup drives are an option, but again, large places don't often back up actual PC's, it's easier to wipe-and-install, and the user documents are stored centrally.

    40. Re:Wake me when they build it into the hard disk by Znork · · Score: 1

      without the hassle of "stripping" the VM image.

      Even with de-duplication you need to strip the VM images if you want to maximize savings; the more software you have the more rapidly and thoroughly you'll get desynchronization on patch levels and such.

      because dedup also delivers other advantages by reducing physical disk IO

      That is an interesting consideration, yes. Have you seen any thorough review of such benefits in various situations (gains vs. simply adding cache memory in other parts of the chain, etc)?

      Thirdly, because enterprise storage costs a lot more than that

      Yep. Many enterprise storage architectures I've seen aren't exactly optimized for customer cost, but rather for maximizing the storage vendors sales revenue.

      NetApp have quite a few white papers and blogs

      Mmm, I'll have to take a look at some of them. I'm sure it makes sense when, as it often is, you have the worst-case scenario in storage, but I'm not sure it still makes sense if there's even an effort at improving the baseline instead.

    41. Re:Wake me when they build it into the hard disk by drsmithy · · Score: 1

      Even with de-duplication you need to strip the VM images if you want to maximize savings; the more software you have the more rapidly and thoroughly you'll get desynchronization on patch levels and such.

      Large amounts of data in VM images are always going to be the same in a typical environment - essentially all the OS files. They might get out of sync for brief periods if one machine is updated before another, but they'll come back into sync once all machines are brought to the same baseline (which, outside of extraordinary circumstances, they should be).

      That is an interesting consideration, yes. Have you seen any thorough review of such benefits in various situations (gains vs. simply adding cache memory in other parts of the chain, etc)?

      I seem to recall reading a NetApp paper investigating the benefits. The biggest ones come from better cache usage (thus reducing physical disk I/O) and replication (since deduped blocks only need to be replicated once).

      Yep. Many enterprise storage architectures I've seen aren't exactly optimized for customer cost, but rather for maximizing the storage vendors sales revenue.

      If you need the features, you have to pay the price. It's the way of the world.

      I'm sure it makes sense when, as it often is, you have the worst-case scenario in storage, but I'm not sure it still makes sense if there's even an effort at improving the baseline instead.

      The advantage of dedup is it's automatic and constant. "Improving the baseline" requires people-time investment both at the beginning, and as ongoing maintenance, to say nothing of the inconvenience of having to work with "stripped" installations. I'm far more interested in my admins doing productive work than I am them trying to shave a few dozen MBs out of an OS install.

    42. Re:Wake me when they build it into the hard disk by Anonymous Coward · · Score: 0

      You're still hopelessly incompetent. You could have told NetWare not to store the decompressed file on disk, just stream it out when it was read. In one command.

    43. Re:Wake me when they build it into the hard disk by julesh · · Score: 1

      Then why are we talking about it in relation to a file system?

      Because the RAM you save is either in file cache or in pages of mmap'd files, and it's a process in the file system that saves it.

  12. Nice, but can it ... by Anonymous Coward · · Score: 0

    ... strategically populate the available space with duplicates of commonly read blocks, for increased fault tolerance and performance?

    1. Re:Nice, but can it ... by Per+Wigren · · Score: 1

      ... strategically populate the available space with duplicates of commonly read blocks, for increased fault tolerance and performance?

      yes, it can.

      --
      My other account has a 3-digit UID.
    2. Re:Nice, but can it ... by Anonymous Coward · · Score: 0

      GP observation was that the so called "free" space can be harnessed in such a manner that a storage pool needn't have any "unused" or empty blocks at all. /Caching/ is not the answer here.

  13. This is the year of Solaris on the desktop by jotaeleemeese · · Score: 1

    Where did I hear that one?

    --
    IANAL but write like a drunk one.
  14. What's the point? by Mask · · Score: 1

    The amount of resources it reportedly takes makes this not so practical.

    What do one would want to have deduplication for? The cost of disk storage has two big elements - speed (latency&throughput) and backup.

    It does not seem that this technology would help much in the speed department, it might actually hurt. Managing copy on write has several potential costs. It may help backup if the backup program knows the fine details of deduplication, but that means that old backup software will have to be replaced.

    It reminds me the compressed file system I used to have on my old SLS Linux PC which had a small disk (1992 if memory serves me right). It was dog slow to run X11 on it. I have not seen a compressed file system since, there was no need. Disk storage grows much faster than my need for data.

    1. Re:What's the point? by myowntrueself · · Score: 1

      It reminds me the compressed file system I used to have on my old SLS Linux PC which had a small disk (1992 if memory serves me right).

      Soft Landings from DOS bailouts!!! Yaaay!

      I had a Windows 3.x PC on which I was coding some simple turbo pascal stuff to do pretty graphics.

      This Windows PC didn't have a lot of disk so I was using Stacker (or some such disk compression thing).

      One time one of my programs crashed. Just a simple graphics thing, but it crashed the PC, had to hit the reset button.

      Erm... sadly the disk compression did not survive this.

      A friend at university was *just* getting into this thing he called "Linux" and it ran on PCs, so I thought I'd give it a go. It was the SLS distro and he had a pile of 5.25" floppies. My PC had a 3.5" floppy. So I sat in the computer center for an afternoon copying disks...

      And that was how I got into Linux; a soft landing from a DOS bailout :D

      I never had to run disk compression under Linux though, never realised SLS supported that. Cool.

      --
      In the free world the media isn't government run; the government is media run.
    2. Re:What's the point? by TheRaven64 · · Score: 1
      The canonical use case for dedup is backup servers. Imagine you have one Solaris file server serving 40 workstations. Each of these does a full backup of its 10GB Window (or Linux, or whatever) install. You then have 400GB of data, but only about 12GB of unique data. Dedup lets you only store this 12GB, and you can store it with n redundant copies so it's easier to recover in cases of partial hardware failure. Each workstation then does incremental backups, copying files with any changes to the server. The server dedups these and only store the changed blocks.

      The clients can be using NFS, CIFS, or iSCSI for the backup, and the server has a complete disk image (and periodic snapshots) of the clients' disks, but uses a tiny fraction of the space that this may require.

      Oh, and with regard to this:

      It may help backup if the backup program knows the fine details of deduplication

      The entire point of dedup in the FS layer is that the backup software can be completely unaware of it. As long as it produces a copy of the data on the server, the server will handle turning it from a full backup or a per-file incremental backup into a per-block incremental backup.

      --
      I am TheRaven on Soylent News
    3. Re:What's the point? by rcolbert · · Score: 1

      And the problem is that fixed-length blocks don't play nice with backups, which tend to be big, long tar files or other similar entities. Once the first difference is encountered, all the bytes shift some random number. Normal data change doesn't often conveniently occur in fixed lengths. NetApp A-SIS is a classic example, and typically it would yield about 10-15% data reduction. For the mathematically challenged that's far less than 2:1. The obvious question is why even bother? Why not just compress on the fly? It's far easier. That's why NetApp switched to variable-length blocks for their VTL product, which has all of its own issues.

      As for VM's, there is a bit of a misconception about their dedupe rates. With fixed-length blocks you are very likely to find much of the effect comes from deduplicating whitespace. Ten billion consecutive zeroes are pretty easy to condense. Fixed length dedupe of VM's tends to be in the single digit ratios, whereas variable length dedupe is often in the high double or low triple digit ratios.

    4. Re:What's the point? by Anonymous Coward · · Score: 0

      Obviously this is a case of YMMV.

      For server VM's, which may start identical but then grow highly too a highly divergent state, dedupe is probably only going to have a moderate effect.

      On the other hand, if you're doing VDI or even just pools of VM's for a testlab where the level of customization in each VM is low, then de-dupe could be a huge win, both in terms of raw space usage and also in terms of improved cache hit rates.

    5. Re:What's the point? by phoenix_rizzen · · Score: 1

      Create a backup server that does remote backups of hundreds of Linux and Windows servers and what do you get? Multiple copies of identical OS system files all taking up space. Add dedupe and you can cut the storage requirements by a whole lot.

      Create a VM host server running multiple VMs using the same guest OS and what do you get? Multiple copies of identical OS systems files all taking up space. Add dedupe and you can cut the storage requirements by a whole lot.

      There are other situations where you end up with lots of identical files/blocks on a storage pool. Dedupe may not be useful on a single OS system, but that doesn't make it useless.

  15. 404, add spx by buchner.johannes · · Score: 1
    --
    NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
  16. Open Source Cures Cancer by sjbe · · Score: 1, Insightful

    Use open source, get cutting edge things.

    Like a cutting edge CAD packages, games, financial management and office suites? Good thing we had you to tell us that open source will solve our every problem just by virtue of it being open source. I'm sure every print shop is going to dump Photoshop for GIMP, every finance firm will dump Excel for Openoffice Calc and every engineering firm will dump AutoCAD for... what exactly?

    Maybe, just maybe open source isn't the answer for everything after all...

    1. Re:Open Source Cures Cancer by Anonymous Coward · · Score: 5, Funny

      Like a cutting edge CAD packages, games, financial management and office suites?

      Umm, dia, nethack, perl, emacs?

    2. Re:Open Source Cures Cancer by Anonymous Coward · · Score: 0

      You can run lots of windows software including photoshop, excel, autocad, and many games using Wine.

    3. Re:Open Source Cures Cancer by selven · · Score: 1

      Like a cutting edge CAD packages, games, financial management and office suites?

      Like a hex editor, text adventures, a hex editor, cat, and did I forget to mention a hex editor?

    4. Re:Open Source Cures Cancer by Anpheus · · Score: 4, Funny

      But if it breaks, or doesn't work, or you've hit a deadline on a project and can't deliver because Wine or the application broke, who are you going to call for support exactly? Not the people who made the software. Are you going to email the Wine mailing list and then, when they fail to deliver a timely solution for free, tell the client that open source is to blame?

      At least when I buy software, or make purchasing decisions from a business standpoint, knowing that the company will stand behind the product and our implementation of it is more important than that trying to pursue some ideal about information and it's anthropomorphized desire to be free.

    5. Re:Open Source Cures Cancer by frankm_slashdot · · Score: 1

      I breath a sigh of relief every time someone combats the "use open source" with "blah blah blah, AutoCAD, blah blah blah". It's like you're reading my mind.

      AutoCAD has barely acceptable performance when running on *great* hardware, let alone a virtualized instance or whatever concoction the avoid-windows-at-all-costs crowd would come up with. Not to say the grandparent-commenter falls into this category but very often people do.

      When people step into my computer room they usually note the PC, macbook, G5 and solaris box and give me the 50 questions. My usual response is "AutoCAD, travel, Final Cut/Logic/everyday use & ZFS storage pool".

    6. Re:Open Source Cures Cancer by Niac · · Score: 1

      At least when I buy software, or make purchasing decisions from a business standpoint, knowing that the company will stand behind the product and our implementation of it is more important than that trying to pursue some ideal about information and it's anthropomorphized desire to be free.

      You're new here, right?

      Since when is the promise of support the actuality of support? In my experience supporting enterprise and operations projects, it's often the free software that has the best support (by way of community). Just because you're paying doesn't mean you're getting anything of real value.

      --
      http://gabrielcain.com/
    7. Re:Open Source Cures Cancer by Anonymous Coward · · Score: 1, Insightful

      > But if it breaks, or doesn't work, or you've hit a deadline on a project and can't deliver because Wine or the application broke, who are you going to call for support exactly?

      Nobody. Seriously, it's pretty rare to get decent support.

      And if all I need is someone to blame, Microsoft works just fine. Everybody has heard them blamed for one reason or another and the execs rarely know or care what we use...

    8. Re:Open Source Cures Cancer by Anonymous Coward · · Score: 0

      Kids these days...

      Umm like C, C, C and C?

    9. Re:Open Source Cures Cancer by Anonymous Coward · · Score: 0

      We had exactly this problem with Microsoft. Fortunately we could call them. Unfortunately, they told us they weren't interested in fixing the bug, even though we offered to pay.

      So we switched to Open Source - next time we are in that situation, we can fix the problem ourselves, or hire a contractor to do it.

    10. Re:Open Source Cures Cancer by jlmale0 · · Score: 1

      While I appreciate the sentiment, it only applies to Big, Irreplaceable (tm) things. Tape libraries would be an example. Office software would not. Even if you have software that's mandatory, there are other ways to mitigate risks. Clustered servers for fail-over. Replication. Alternate installs in the form of development and test environments. If Wine breaks in the middle of my big project, I may research the issue and debug the problem, but if I'm under a time crunch, I'm just going to move to a working machine. Yes, paying for support is one valid risk mitigation strategy, but it's far from the only one.

    11. Re:Open Source Cures Cancer by sjbe · · Score: 1

      AutoCAD has barely acceptable performance when running on *great* hardware...

      Yep, and that's not even getting into what I would regard as the "serious" CAD packages like Catia or ProE. Ironically one of the biggest PTCs ProEngineer DROPPED linux support because there were apparently too few adopters. CAD is one of the applications that is least amenable to open source because there is so much specialized mathematical knowledge required to do the development.

      I'm an engineer but I'm also an accountant and the lack of serious financial software for bookkeeping is also a problem. Yes there are a few native linux bookkeeping packages but NOBODY uses them. Nobody will either since it's easy to find bookkeepers that know Quickbooks or Peachtree. I can virtualize Quickbooks in some cases or run it on a Mac but there is no native substitute on linux.

    12. Re:Open Source Cures Cancer by sjames · · Score: 3, Interesting

      The same people you call when your proprietary system breaks and you discover that the official tech support people can't find their posterior with both hands and a map. Most cities have a number of grief councilors ready to support you in your time of need. If it was really critical, try the suicide hotline.

    13. Re:Open Source Cures Cancer by JoeMerchant · · Score: 1

      Like a cutting edge CAD packages, games, financial management and office suites?

      Like a hex editor, text adventures, a hex editor, cat, and did I forget to mention a hex editor?

      KDE for Win absolutely rocks with Oketa, and, ummm... yeah, there's nothing like Oketa in Windows, well, until KDE for Win came around...

    14. Re:Open Source Cures Cancer by ckaminski · · Score: 1

      You've obviously never had the luxury of paying 25%/year/seat for enterprise level support of something like Pro/ENGINEER or Catia or AutoCAD. I guarantee you those guys, when you call them and ask them to jump, they ask how high. As long as you're paid up, of course.

      Disclaimer: ex-PTC employee with a long turn in the development group beating off support guys trying to resolve customer issues.

    15. Re:Open Source Cures Cancer by Anonymous Coward · · Score: 0

      But if it breaks, or doesn't work, or you've hit a deadline on a project and can't deliver because Wine or the application broke, who are you going to call for support exactly? Not the people who made the software.

      Why not? Contrary to the brainwashing you have believed, there are many companies out there specifically to support these kinds of things. Many of the larger open source projects have actually spawned entire support companies from within their own development base. i.e. Apache, JBoss, MySQL, SUN, etc.

      Are you going to email the Wine mailing list and then, when they fail to deliver a timely solution for free, tell the client that open source is to blame?

      Why would you advocate lying to the customer anyway? If the project is delayed because an app, that I chose, failed it doesn't matter to the customer if I have a 5 minute support contract with Oracle or 5 day support via mailing list with Postgress. All the customer cares about is that I screwed up. No amount of support contracts are going to cover that. No amount of lying to the customer that it was Microsoft's fault is going to save me from their displeasure. The customer is not as stupid as you want them to be.

      At least when I buy software, or make purchasing decisions from a business standpoint, knowing that the company will stand behind the product and our implementation of it is more important than that trying to pursue some ideal about information and it's anthropomorphized desire to be free.

      It has nothing to do with free and everything to do with economics.

          Take a real close look at the number of times you have had to use that support contract of yours. If your organization is anything like mine, you have called support 1 to 3 times a year which is a massive waste of money.

          Let's assume for a minute that we are talking a MS support contract for Office at $100K per year and that you had to call them 3 times. So each call is roughly 33K plus incidentals.

            If, on the other hand we spent our income the smart way and bought per call contract, we would get the exact same level of coverage for $500 per incident per hour. At that rate, we would need to be on the phone with MS for 33 thousand hours for each of those incidents to equal the amount of money you wasted on that contract.

            Frankly, if you are having to use your support contract more than about 5 times a year, then your staff is not doing their jobs and need to be replaced, or at the very least supplemented.

    16. Re:Open Source Cures Cancer by david_thornley · · Score: 1

      Which isn't how most proprietary software works, and I'm not sure you couldn't find a similar organization for at least some critical F/OS software.

      Not to mention that, when I've been in that support role, there have been times I wasn't sure I was going to be able to come up with a solution. Then again, that company only charged 10%/year.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
    17. Re:Open Source Cures Cancer by Anpheus · · Score: 1

      My post was in reply to running Software-That-Does-Not-Exist-For-Linux with Wine, and trying to get application support for that solution. For example, switching to running AutoCAD on Wine.

    18. Re:Open Source Cures Cancer by Anonymous Coward · · Score: 0

      But if it breaks, or doesn't work, or you've hit a deadline on a project and can't deliver because Wine or the application broke, who are you going to call for support exactly? Not the people who made the software. Are you going to email the Wine mailing list and then, when they fail to deliver a timely solution for free, tell the client that open source is to blame?

      At least when I buy software, or make purchasing decisions from a business standpoint, knowing that the company will stand behind the product and our implementation of it is more important than that trying to pursue some ideal about information and it's anthropomorphized desire to be free.

      Interesting ideal you got there, ever actually tried to get support from the people your company buys from? How about in the last few years?

  17. I Heard ISPs Were Doing This by sexconker · · Score: 1

    I Heard ISPs Were Doing This With Broadband.
    Simply duplicate your advertised pipe across 100 subscribers.

    If they want to access it at the same time, just shift stuff around.

    If they want to access it at the same time, and you don't have room to shift stuff around, just impose caps and bill them progressively out the ass.

  18. well ... by wsanders · · Score: 1

    There are enough tales of woe in the discussion groups of ZFS file systems that have melted down on people that I would not start shorting the midrange storage companies stock just yet. I myself have an 18TB ZFS filesystem on a X4540 and it was brought to a standstill a few weeks ago by one dead SATA disk. Didn't lose any data, and it might be buggy hardware and drivers, but still, Sun support had no explanation. That should not happen!

    I'm still a ZFS fanboy though - for about $1 per GB how can you lose. The host is a backup / virtual tape library server so it's not super high availability, and it's hella fast. No problem stuffing data into it at 2 X 1000baseT wire speed.

    --
    Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
    1. Re:well ... by greg1104 · · Score: 1

      I've seen the same thing happen on a X4500. People need to understand that SATA disks and chipsets are fundamentally weak at error reporting and recovery. There's only so much you can do about that at the driver or OS level if a problem drives the chipset crazy. You really need hardware optimized for that purpose, like a mature and battle-tested RAID controller. The controller hardware on Sun's big SATA storage servers are just not customized enough for that purpose to handle all of the weird things dying drives can do to take out the other devices sharing resources on each drive bus.

    2. Re:well ... by TrevorDoom · · Score: 2, Interesting

      My company used a X4500 and we discovered the bug that caused Sun to make the X4540 - the Marvell SATA chipset in the X4500 had a serious bug in firmware that was exacerbated by the Solaris X86 Marvell chipset driver.
      Under heavy small block random IO intermingled with heavy sequential large block IO, the box would kernel panic and hang - only a power cycle would reset the box.

      Sun ended up refunding us the cost of the servers and providing us exceptionally large incentives to purchase Sun StorageTek storage.

      It wouldn't surprise me if the X4540 would have similar issues because they were rushing to replace the X4500 to try and minimize the possibility about bad PR over the X4500 being amazingly unstable.

      This is why I'll be waiting for FreeBSD to support this because they will probably have better SATA chipset drivers and the chances of the system hanging because the Solaris kernel drivers for the SATA chipset (nevermind that it's a SATA chipset that Sun put into their own board).

    3. Re:well ... by swordgeek · · Score: 1

      We're starting to roll out ZFS in our (large!) enterprise. We've played with it in the lab, and in our internal support systems (e.g. documentation and authentication systems) enough to be comfortable with it.

      However, you nailed the biggest weakness with it in five words:

      "Sun support had no explanation."

      We are a BIG Sun shop, and this has been our general experience with Sun in the last two years or so. Sun is bleeding competence faster than they can fire it. For every good person they lay off (because tech staff are expensive--especially tech support staff), two more will quit in disgust.

      I'm a big Sun fan - have been since SunOS 4 was the new kid on the block. I also think that ZFS is the third-best thing since sliced bread (if they added volume shrinking and online relayout, it'd be #1). Solaris 10, for all of its warts, is still the best Unix on the market right now. However, I don't see Sun surviving much longer--enterprises with a lost of investment and loyalty are starting to turn away in frustration.

      --

      "People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
  19. I worked with De-duplication by Anonymous Coward · · Score: 0

    You are loosing reliability. Some hashes will collide on some computer somewhere.
    The idea is that if you assume that blocks on HD are random then odds of hitting Hash collision are tiny.
    But data is not random - humans and programs make it non-random!

    Here is an example:
    What are the odds that 256 people going across the street will all be men?
    That would be 2 ^ -256 - that will never happen.
    But guess what? Image that you see a parade and 300 marines are marching by...
    It just happened.
    Do you want to bet your server data on that?

    1. Re:I worked with De-duplication by Rising+Ape · · Score: 1

      The 2^-256 is only the probability if all of the people are independent. That clearly isn't the case in your example.

      Is there an equivalent scenario for hash based deduplication? The only ones I can think of involve poor hash algorithms.

    2. Re:I worked with De-duplication by Anonymous Coward · · Score: 0

      You got it right. Humans/programs organize stuff thus making blocks not independent

    3. Re:I worked with De-duplication by Anonymous Coward · · Score: 0

      You need to read the posts way up above that show the math for the probability. Given the advantages that de-dupe offers, it's nearly always worth it.

  20. SAN, ZFS with dedupe is not a backup system by caseih · · Score: 1

    Don't mistake in-filesystem deduplication and snapshots for a backup system. It's most certainly not backup and if you treat it as such you will eventually be very sorry. A SAN with ZFS, snapshots, and deduplication features is at best an archive, which is distinct in form and purpose from a backup. Still very useful, though. Ideally you have both archive and backup systems. To get a feel for the difference, consider that an archive is for when a user says, "I overwrote a file last week sometime. Can you recover the version before I made this change or saved over this file?" Whereas a backup is for recovering an entire system from when there's a catastrophic failure (like a SAN dying). Very distinct things. Both are useful.

    I get strange looks when I tell people that a Time Capsule is not a backup. Nor is a single Time Machine external disk. Now 2, 3 or even 4 external disks could constitute a backup (and as a bonus with Time Machine an archive also).

  21. Building it in makes no sense by saleenS281 · · Score: 1

    First, why would you want it built into a hard drive? Your deduplication ratio would then be limited to what you can store on one drive. The drive would have no way to reference blocks on other drives in the same system. Doing it in software allows you reference (in this case) all data within the entire zpool. That could be petabytes of storage (theoretically it could be far more, but that's probably the realistic limits today due to hardware/performance constraints).

    As for your "hell to pay later" that's not true for two reason. First, there is no "modify in place". All data is allocated from new blocks, that's how a copy-on-write filesystem works. If it's "updated" you'd be allocating new blocks. If you're concerned with filling a pool up completely, you can put quota's in place to prevent it.

    Second, if you "run out of space", you just add new drives to the raid group and continue on your merry way. You can grow a zpool on the fly.

    1. Re:Building it in makes no sense by greg1104 · · Score: 1

      There is a real concern here, I just think the explanation was too far removed from the dedupe case for it to be obvious. Imagine the following: you have a 100GB drive. You copy a 75GB file onto it. You now you copy the same file, but with a different name, onto the disk again. This takes up some tiny amount of marginal space, because all of the blocks are duplicates, and it looks like there is still around 25GB free.

      Now, imagine someone updates most of one of the two copies in-place, not realizing the shared situation (maybe they can't even see the other copy). Since the file is already on the drive and they're updating it, not creating a new file, the user would have no reason to believe this operation could fail due to running out of disk space. But it will--as they update blocks, copy-on-write is going to force allocating new space. The disk will run out of space around 1/3 into the update operation, leaving you with a junk file containing neither the original (unless you took a snapshot first, which is not normally a user operation) nor a good copy of the updated file. That's a business disaster right there. Quotas don't improve this situation. And I'd bet against that whatever application was in the middle of the update will be able to recover after more space is added to the pool, so that it can complete updating the file.

      Since applications and users will sometimes check for disk space and/or available quota, in this same situation without a deduped setup the second file would have obviously never fit in the first place, and it would have had to go somewhere else instead. You can easily see how some would consider this preferable to failing oddly is rare (but easily plausible) situations. Deduping is ultimately a form of disk space oversubscription, and it's unrealistic to expect applications or users are going to consider all of the ramifications of how that might go wrong in a subtle way.

    2. Re:Building it in makes no sense by saleenS281 · · Score: 1

      The write will fail from the start and alert the user to an out of space situation, it won't "write part of the file and fail". Theoretical situations are great, right up until you actually take the time to figure out what the filesystem does in the real world.

    3. Re:Building it in makes no sense by greg1104 · · Score: 1

      The way most application interactions happen, it's impossible for a standard filesystem to distinguish in advance between a series of writes that will update a small enough subset of the file that it should work, and ones where eventually enough blocks will be updated that it will run out of space. If you reject all possible writes that *could* run out of space eventually if the user keeps going, right from the start, that's not a good answer either. You'll still be in a position where you're rejecting possibly reasonable behavior, in a way that won't easily make sense to the user.

    4. Re:Building it in makes no sense by Nevyn · · Score: 1

      leaving you with a junk file

      That's why the minimum recommendation for updates like that is to "write to tempfile, close, check that close is happy and then rename". Unless you can deal with someone pulling the power at any point as you update.

      --
      ustr: Managed string API with ave. 44% overhead over strdup(), for 0-20B
  22. NTFS has bit level dedup by Cur8or · · Score: 0

    Just store one 0 and one 1. Then just store references to each from in the bits.

    --
    Winkey shortcut mapping for 64bit windows. WinKeyPlus
  23. Par for the course.. by Junta · · Score: 4, Interesting

    Any filesystem implementing copy-on-write at all, data dedupe, and/or compression is already a strategy where the risk of exhausting oversubscribed storage due to unanticipated compression ratios or uniqueness is a risk. It's a reason why you have to be pretty explicit to NetApp filers implementing these features that you are accepting the risk of exhausting allocations if you actually make use of these features to the point of advertising more storage capacity than you actually have.

    You don't even need a fancy filesystem to expose yourself to this today:
    $ dd if=/dev/zero of=bigfile bs=1M seek=8191 count=1
    1+0 records in
    1+0 records out
    1048576 bytes (1.0 MB) copied, 0.00426769 s, 246 MB/s
    jbjohnso@wirbelwind:~$ ls -lh bigfile
      8.0G 2009-11-02 20:06 bigfile
    ~$ du -sh bigfile
    1.0M bigfile

    This possibility has been around a long file and the world hasn't melted. Essentially, if someone is using these features, they should be well aware of the risks incurred.

    --
    XML is like violence. If it doesn't solve the problem, use more.
    1. Re:Par for the course.. by bendodge · · Score: 1

      Could I get some car analogies with that?

      --
      The government can't save you.
    2. Re:Par for the course.. by Anonymous Coward · · Score: 0

      Try this with /dev/urandom.
      Different result, Different Perception.Different Inclination, suitts.

    3. Re:Par for the course.. by odie_q · · Score: 2, Informative

      The trick isn't using /dev/zero, the trick is using the seek parameter. The dd command skips nearly 8 GiB into a newly created file and writes something there. This creates a file that is 8 GiB large, but with no data (not zero, just nothing at all) in the first 8191 MiB. Therefore, the system doesn't actually write anything there, and doesn't even allocate the storage. If you read from these blocks, you will get generated zeros. This is called a sparse file.

      --
      ...ceterum censeo Carthaginem esse delendam.
    4. Re:Par for the course.. by greed · · Score: 1

      And it's real fun to do that 'dd' on a filesystem which doesn't support sparse files.

      It takes a LOT LOT LOT LOT LOT longer: the 'seek' actually extends the file on-disk by the specified amount, so it really does take up the 8.0GB.

      I like leaving large sparse files around to trap bad backup software before people put their faith in it.

    5. Re:Par for the course.. by Anonymous Coward · · Score: 0

      Use count=0 to get an even smaller file (try it!)

  24. I tried this on my RAID system by ljw1004 · · Score: 1

    I tried this on my RAID-1 system and it got converted to RAID-0.

  25. File-level? Block-level? by Anonymous Coward · · Score: 0

    Pssht! Not good enough, Sun. I require bit-level. I won't be satisfied until I can create a zpool wherein all my data are deduped down to one 1 and one 0.

  26. Cause and effect by scanrate · · Score: 1

    And the modification of a duplicate block will generate not only a copy-on-write fault but also a law suit by whoever owns the COW patent.

  27. There are three types of files. by Animats · · Score: 5, Interesting

    I'd argue that file systems should know about and support three types of files:

    • Unit files. Unit files are written once, and change only by being replaced. Most common files are unit files. Program executables, HTML files, etc. are unit files. The file system should guarantee that if you open a unit file, you will always read a consistent version; it will never change underneath a read. Unit files are replaced by opening for write, writing a new version, and closing; upon close, the new version replaces the old. In the event of a system crash during writing, the old version of the file remains. If the writing program crashes before an explicit close, the old file remains. Unit files are good candidates for unduplication via hashing. While the file is open for writing, attempts to open for reading open the old version. This should be the default mode. (This would be a big convenience; you always read a good version. Good programs try to fake this by writing a new file, then renaming it to replace the old file, but most operating systems and file systems don't support atomic multiple rename, so there's a window of vulnerability. The file system should give you that for free.)
    • Log files Log files can only be appended to. UNIX supports this, with an open mode of O_APPEND. But it doesn't enforce it (you can still seek) and NFS doesn't implement it properly. Nor does Windows. Opens of a log file for reading should be guaranteed that they will always read exactly out to the last write. In the event of a system crash during writing, log files may be truncated, but must be truncated at an exact write boundary; trailing off into junk is unacceptable. Unduplication via hashing probably isn't worth the trouble.
    • Managed files Managed files are random-access files managed by a database or archive program. Random access is supported. The use of open modes O_SYNC, O_EXCL, or O_DIRECT during file creation indicates a managed file. Seeks while open for write are permitted, multiple opens access the same file, and O_SYNC and O_EXCL must work as documented. Unduplication via hashing probably isn't worth the trouble and is bad for database integrity.

    That's a useful way to look at files. Almost all files are "unit" files; they're written once and are never changed; they're only replaced. A relatively small number of programs and libraries use "managed" files, and they're mostly databases of one kind or another. Those are the programs that have to manage files very carefully, and those programs are usually written to be aware of concurrency and caching issues.

    Unix and Linux have the right modes defined. File systems just need to use them properly.

    1. Re:There are three types of files. by greg1104 · · Score: 3, Insightful

      The main corner case in your suggested "unit file" implementation is where someone is overwriting a file too large for the filesystem to contain two copies of it. You have to truncate when this happens to fit the new one, you can't just keep the old one around until it's replaced. This makes it impossible to meet the spec you're asking for in all cases. The best you can do is try to keep the original around until disk space runs out, and only truncate it when forced to. However, if that's how the implementation works, then applications can't just blindly rely on the filesystem to always do the right thing and "give you that for free". They've still got to create the new file and confirm it got written out before they touch the original if they want to guarantee never losing the original good copy, so that they bomb with a disk space error rather than risk truncating the original. That's why this whole path doesn't go anywhere useful; better to work on poplarizing an API for atomic rewrites or something.

      As for your "managed files" case, that won't work for all database approaches. For example, in PostgreSQL, only writes to the database write-ahead log are done with O_SYNC/O_DIRECT. The main data block updates (and writes that are creating new data blocks) are written out asynchronously, and then when internal checkpoints reach their end any unwritten blocks are forced to disk with fsync if they're still in the OS cache. You'd be hard pressed to detect which of your suggested modes was the appropriate one for just the obvious behavior there, and there's still more weird corner cases to worry about buried in there too (like what the database does with the data blocks and the WAL to repair corruption after a crash).

      Both these highlight that it's hard to make improvements here at just the filesystem level. Some of the really desirable behavior is hard to do unless applications are modified to do something different too. That hasn't really been going well for ext4 this year, and how that played out highlights how hard an issue this is to crack.

    2. Re:There are three types of files. by Animats · · Score: 1

      The main corner case in your suggested "unit file" implementation is where someone is overwriting a file too large for the filesystem to contain two copies of it. You have to truncate when this happens to fit the new one, you can't just keep the old one around until it's replaced.

      At that point, the program creating the huge file should get an I/O error, and the old copy should be intact. If you're creating files that big, you usually check the available space before writing the file, as installers have done for many years now. You may have to delete the old file first. UCLA Locus did this in the 1980s, incidentally, and some of that machinery went into AIX. (Locus had unusual file system semantics. If you started to overwrite a file, you created a new file which shared blocks with the old one in a copy-on-write sense. The new file appeared to other readers when you closed the file or called "commit()". If you called "revert()" or the program crashed or the network disconnected, the file reverted to its previous state.)

      As for your "managed files" case, that won't work for all database approaches.

      Torvalds has written about how databases should talk to file systems. Databases and file systems need to know something about each other. There's posix_fadvise() and fsync() as well as the open modes, and the use of any of those generally indicates that it's a managed file.

    3. Re:There are three types of files. by dkf · · Score: 1

      • Managed files Managed files are random-access files managed by a database or archive program. Random access is supported. The use of open modes O_SYNC, O_EXCL, or O_DIRECT during file creation indicates a managed file. Seeks while open for write are permitted, multiple opens access the same file, and O_SYNC and O_EXCL must work as documented. Unduplication via hashing probably isn't worth the trouble and is bad for database integrity.

      O_EXCL is used sometimes for unit files too, such as when they want a guarantee that the file is created by them. This can be important in a security context.

      A relatively small number of programs and libraries use "managed" files, and they're mostly databases of one kind or another.

      There's more than there used to be due to the rise of the small-database-as-library, such as sqlite. Generally, this is a good thing (applications with data integrity without masses of configuration or reinvention of the wheel can hardly be anything but good!) but it does mean that more files are "managed" in your sense than used to be.

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
    4. Re:There are three types of files. by Anonymous Coward · · Score: 0

      But when it dips to fs level, it merely signify. From db level, Yes it does.
      we have a world of inodes poiniting to universe of blocks. any intellegence put in to make it worthy of business is special.

    5. Re:There are three types of files. by Anonymous Coward · · Score: 0

      Doesn't block deduplication fix that?

    6. Re:There are three types of files. by Anonymous Coward · · Score: 0

      The main corner case in your suggested "unit file" implementation is where someone is overwriting a file too large for the filesystem to contain two copies of it. You have to truncate when this happens to fit the new one, you can't just keep the old one around until it's replaced. This makes it impossible to meet the spec you're asking for in all cases. The best you can do is try to keep the original around until disk space runs out, and only truncate it when forced to. However, if that's how the implementation works, then applications can't just blindly rely on the filesystem to always do the right thing and "give you that for free". They've still got to create the new file and confirm it got written out before they touch the original if they want to guarantee never losing the original good copy, so that they bomb with a disk space error rather than risk truncating the original. That's why this whole path doesn't go anywhere useful; better to work on poplarizing an API for atomic rewrites or something.

      Actually I think the "right thing" for the filesystem to do in the case where it cannot write a new copy of the unit file without truncation IS to fail the operation with a disk space error. By placing the file in this category, you are telling the filesystem that you always want a consistent copy of the file to be present, and that this requirement is more important than the ability to perform the operation, but at a higher risk of ending up with an inconsistent file.

      On the other hand I agree that fixing the problem by making atomic rewrites work would be the better solution. The main problem I see with the outlined scheme is migrating to it from where we are now.

    7. Re:There are three types of files. by Animats · · Score: 1

      There's more than there used to be due to the rise of the small-database-as-library, such as sqlite.

      Yes, and they tend to use common libraries, which need to be file-system aware if their data is to recover properly across crashes.

      There are programs which are database-like but not ACID-like databases, like ".zip" file managers. Those should open for exclusive use when writing. Reading a .zip file being written by another program is generally disappointing.

    8. Re:There are three types of files. by davecb · · Score: 1

      There are also "brinch hansen" files, named after Per Brinch Hansen, who implemented them in (from memory) the R2000 OS. They were write-once, read-once, and they disappeared "soon" after the reader read the block.

      They were used for large-scale intercommunication, rather like a pipe or queue, but larger and on-disk. That allowed one to pick up and continue from where you left off if your program or OS crashed, modulo some definition of "soon".

      --dave

      --
      davecb@spamcop.net
    9. Re:There are three types of files. by octal_sio · · Score: 1

      Plan 9's filesystem has an append permission mode bit.

      It also achieves deduplication in a simple way using Venti. Clever stuff.

    10. Re:There are three types of files. by Anonymous Coward · · Score: 0

      O_APPEND most certainly is enforced. Sure you can seek, but any time you write it'll seek back to the end before doing that.

      If the file offset were used normally for O_APPEND files, it would be broken; if any other process appends more data before your write, you would no longer be at the end of the file. Fortunately, it works correctly, and multiple processes can share a log file. Not that it's usually a good idea, but it works.

  28. 223517417907714843750 terabyte drive by Anonymous Coward · · Score: 0

    newegg has these on special for 169.00, but the reviews stink. Wait for the SATA version.

  29. full byte comparisons by linuxhansl · · Score: 1
    I like this from the article:

    > You can tell ZFS to do full byte comparisons rather than relying on the hash if you want full security against hash duplicates:

    I once did similar a project with web content caching that replaced some data with a hash of said data with a way to get to the actual data. All sorts of people were worried about hash conflicts, etc. People are always worried about collisions.

    It took a lot of convincing that that risk is lower than a nuclear strike on the data center(s).

    What finally did convince my team mates was that 2^256 (~10^77) is by some estimates is close to the number of elementary particles in the visible universe (without a few orders of magnitudes at least).
    So assuming the hash function is good (there's no evidence to prove otherwise), we'd have to try almost as many inputs as there are particles in the universe. The chances of hitting duplicates are so astronomically small that doing byte comparisons is most certainly useless, and just check mark feature for those types who worry about these things. AFAIK there are no known SHA256 duplicates.

    1. Re:full byte comparisons by Anonymous Coward · · Score: 0

      I do this all the time on web projects to store duplicates of files, duplicate lines of text, duplicate database entries, and redundant queries against other web services. It's a great way to reduce data storage, and to reduce database size.

      The first thing people worry about is that data will be overwritten, which is silly because you don't write data for a pre-existing hash. You just drop it. As you mention, there is no real chance of an accidental collision. So what would have to happen is someone would have to know an existing hash, create a hash collision, and use that to gain access to the existing data. But creating a hash is currently not possible, and if they can discover the existing hash, you're probably hosed anyway.

      - Atamido

  30. BTRFS is better by Theovon · · Score: 2, Interesting

    At first, BTRFS started out as an also-ran, trying to duplicate a bunch of ZFS features for Linux (where licensing wasn't compatible to incorporate ZFS into Linux). But then BTRFS took a number of things that were overly rigid about ZFS (shrinking volumes, block sizes, and some other stuff), and made it better, including totally unifying how data and metadata are stored. I'm sure there are a number of ways in which ZFS is still better (RAIDZ), but putting aside some of the enterprise features that most of us don't need, BTRFS is turning out to be more flexible, more expandable, more efficient, and better supported.

    1. Re:BTRFS is better by Anonymous Coward · · Score: 0

      ...and how exactly does that contribute to an article on a new feature of ZFS?

      I'm surprised the nerds aren't on here blathering on about HAMMER too....

    2. Re:BTRFS is better by hab136 · · Score: 1

      And unusable on anything but Linux due to licensing (BTRFS is GPL). No, FUSE doesn't count for production use.

    3. Re:BTRFS is better by TheRaven64 · · Score: 1

      Sure. Apart from the fact that it's not production ready yet, doesn't yet implement most of the features of ZFS, and doesn't yet implement most of the better-than-ZFS features, btrfs is loads better.

      --
      I am TheRaven on Soylent News
    4. Re:BTRFS is better by samjam · · Score: 1

      And so I thank apple for being mean and stinky about ZFS, or we wouldn't get BTRFS

    5. Re:BTRFS is better by Anonymous Coward · · Score: 0

      Except for the minor detail that it doesn't actually work in production yet and ZFS has for years? Yeah, I guess once it actually exists it's going to rule! any day now, any day now!

    6. Re:BTRFS is better by samjam · · Score: 1

      And so I thank apple for being mean and stinky about ZFS, or we wouldn't get BTRFS

      Good grief; I mean I thank SUN for being so mean and stinky

    7. Re:BTRFS is better by Anonymous Coward · · Score: 0

      "putting aside some of the enterprise features that most of us don't need" - Oh you need them, you just very likely don't ever see them. I work for the largest storage vendor on the planet and your bank account lives on enterprise storage as do your medical records, insurance, etc.


      I really look forward to seeing btrfs and other ideas like lessfs move forward and hope that people will be amazed at what they're actually doing because it is really, really hard to get this stuff right so that you get your data back. Deduplication introduces brain-hurting complexities and challenges.

  31. Slashdot on ZFS by Anonymous Coward · · Score: 1, Funny

    So ... any plans on using ZFS on slashdot to help de-duplicate stories?

  32. so when the fanboys are done jizzing by Anonymous Coward · · Score: 0

    It's a filesystem. It stores files. efficiently. Uhhh.... that's cool and all
    Quit jizzing. Realize the practical benefit to society, meditate on it, and then go back to that righteous ftp client you were writing.
     

  33. Infinite compression? by n9hmg · · Score: 3, Funny

    If a hash were a replacement for data. that's all we'd need....goedelize the universe? Sometimes I just want to scream, or weep, or shoot everybody....or just drop to my knees and beg them to think - just a little tiny insignificant bit - think. Maybe it'll add up. Probably not, but it's the best I can do.

    1. Re:Infinite compression? by jimicus · · Score: 2, Interesting

      If a hash were a replacement for data. that's all we'd need....goedelize the universe?

      Sometimes I just want to scream, or weep, or shoot everybody....or just drop to my knees and beg them to think - just a little tiny insignificant bit - think. Maybe it'll add up. Probably not, but it's the best I can do.

      Which is why ZFS allows you to specify using a proper file comparison rather than just a hash.

      It's unlikely you'll have a collision considering it's a 256-bit hash but, as you allude, that likelihood does go up somewhat when you're dealing with a filesystem which is designed to (and therefore presumably does) handle terabytes of information.

    2. Re:Infinite compression? by ThePhilips · · Score: 1

      IIRC, Plan9 file system did something similar. The end I heard was rather sorry: corruptions were reported at rate of once or twice per year. People had to migrate to something else because such file system couldn't be trusted.

      This is rather unacceptable feature for anything what's job is information storage.

      --
      All hope abandon ye who enter here.
    3. Re:Infinite compression? by Thing+1 · · Score: 1

      or shoot everybody

      Ahem, sorry sir, I cannot sell you these bullets and you'll have to leave the store.

      --
      I feel fantastic, and I'm still alive.
    4. Re:Infinite compression? by mindstrm · · Score: 1

      They address that in the article - when you do the math, the likelyhood of having a hash collision in SHA256, given a zetabyte filesystem with 128KB blocks is still orders of magnitude less likely than the known rate of data loss due to block level hardware failure.

    5. Re:Infinite compression? by jimicus · · Score: 1

      There are, of course, ifs and buts concerning that - chiefly that we assume that any given hash is as likely to come up as any other given hash.

      So far this seems to be the case with SHA256 but AFAIK nobody's formally proved it. It's possible (albeit unlikely) that a relatively straightforward means of generating collisions will be announced tomorrow.

  34. Billions of dollars by Deton8 · · Score: 1

    I bet EMC is happy they just out-bid NetApp to the tune of $2.4 billion, for basically the same technology that Jeff Bonwick is giving away for free.

  35. ZFS, the first sexy file system by Anonymous Coward · · Score: 0

    ZFS is so sexy, I want a nekkid picture of it.

  36. Not quite the wonderful thing it appears to be by pjr.cc · · Score: 2, Insightful

    De dupe has been around for a while and has some advantages and quite a few negatives... First off, i'd be interested to see how many patent trolls this might stumble over. But de-dup has always gone hand in hand with backups and golden images. EMC, HDS and co never did a good job supporting golden images, but other storage have done well with it (3par, compellent, equalogic).

    For the uninitiated, golden images usually consist of building a machine on a SAN, and then using that one image to power many machines (i.e. the same blocks on disk). It then usually just stores deltas from the golden image for each machine... its got its advantages and disadvantages much like de-dup.

    Now, the reasons for its use are simple "pay less for storage" which sounds dumb in this day and age (with 1tb drives costing virtually nothing), but the reality is in the SAN world 1tb drives cost a fortune and wherever you use de-dup or golden images, you usually use the fastest (and smallest) disk you can get your hands on. (if you dont understand why this is, see the backblaze article from a little while ago - ultimately, putting more space in a bit of SAN storage kit is freeking expensive). In the enterprise world, its almost impossible to step away from SAN storage (unless your google or backblaze).

    The big problem with de-dup (and why its primarily used for backups, and primarily only disk-based backups) is how it effects the storage. If you suddenly have one hot spot, even on fast disk, the storage starts grinding to a halt (even when considering caching) because lots of things start accessing the same blocks on the disk. This is not a problem for backups because its usually a once-written, rarely-read scenario. On file servers and databases, its a performance killer (something akin to raid5/6 in software). But de-dup is fantastic for archival storage!... De-dup and performance often tend to be a self-fulfilling prophecy though, simply because data that is duplicated is often duplicated cause its heavily accessed. Take email as a good example. Joe sends out an email with an attachment of some form (perhaps its a document template, but it really doesn't matter so long as he's sending it to a large number of people), all those people save the attachment and probably make some edits. This introduces the next load of pain, fragmentation. All those delta's from the original now need to be saved "somewhere else" and meanwhile all these people are accessing not just the de-dup'ed blocks but the fragmented changes (consider the kernel source for linux as well, tonnes of branches of code that would possible get de-duped and fragmented). Databases are another great example. Often in tablespaces there is quite a load of block-alligned duplicate data, often this is the nature of how databases store data. Sometimes this data can be quite critical to their function and to have a database slamming the same blocks (again with small fragmented changes) is pain personified.

    Still, i wonder how many patents sun are likely to trip up on... I see this being non-fun as there are many people who make serious cash from de-dup at various levels....

  37. What if... by azav · · Score: 1

    What if that single block goes bad?

    --
    - Zav - Imagine a Beowulf cluster of insensitive clods...
    1. Re:What if... by asaul · · Score: 1

      ZFS detects the checksum failure, then picks it up from the other mirror or ditto block, replace the corrupted one.

      You did setup disk redundancy didn't you?

      --
      "If everybody is thinking alike, somebody isn't thinking" - Gen. George S. Patton
    2. Re:What if... by raynet · · Score: 1

      You mark that as bad and use the backup copy you have as I assume you would be using atleast mirroring or better yet, raidz or raidz2.

      --
      - Raynet --> .
    3. Re:What if... by Sulphur · · Score: 1

      In Soviet Russia Zek File System never stores dupes.

      --

      This is an upper bound, not an actual value.

  38. I wrote about this by jesset77 · · Score: 1

    I recently wrote an article about my thoughts on filesystems and operating systems by way of a fictional reference OS mentioning ZFS in a positive light for reasons including the dedupe feature mentioned in today's article:

    IRON/Cloud — the outline of what a modern OS should be

    I link back to the (yes, slashdot) article wherein I first learned about ZFS, and a rundown of the features I like about ZFS.

    But no, I checked and our article texts do not hash to the same value, so I do not believe we would be stored at the same location on disk. ;D

    --
    People willing to trade their freedom of expression for temporary entertainment deserve neither and will lose both.
  39. Combine with BitTorrent? by Anonymous Coward · · Score: 0

    Downloading a chunk with SHA.DEADBEAF...

    oh ... looky here... DEADBEAF is already on disk.

    Done.

  40. I'm super happy about this! by Mysticalfruit · · Score: 1

    As someone whose got a HUGE amount of data currently in ZFS (and a lot of it is redudant!) I can't wait to get my hands on this! I figure along on my backup server it's going to save me 10's of TB's worth of space.

    I just wish there was more details on what release of Open Solaris or Solaris this is going to be in, or patch sets that'll include this!

    --
    Yes Francis, the world has gone crazy.
  41. Been wanting something like this for a long time by psocccer · · Score: 1

    A lot of times these days I use rsync to do hard linked backups, which works mostly well but has some shortcomings. For example, backups across multiple machines don't have their duplicate files hardlinked, and files that are mostly similar can't be hard linked, such as files that grow like log files. More specifically we have some database files that grow with yearly detail information and everything before the newly added records is identical, resulting in gigs of used up space every day during backups when maybe a few megs has changed.

    Initially I liked the way BackupPC handled the situation by pooling and compressing all the files, and duplicate files from different backups were automatically linked together. So I wrote a little script that primarily duplicated the the functionality of hardlinking duplicate files together regardless of file stat, running on top of fusecompress to get the compression too. The problem mostly is time though to crawl thousands and thousands of files and relink them. On top of that, rsync will not use those duplicate files for hardlinks in the next backup if the file stat info doesn't match, like mtime/owner/etc which means the next backup contains fresh new copies of files that have to be re-hardlinked by crawling the files again. Plus you don't get elimination of partial file redundancy.

    So I looked around some more for a system that would allow you to compress out redundant blocks, and the closest thing I could find is squashfs, but it's read-only. Which sucks because we need to purge daily local backups occasionally to make more room for newer backups. We keep the last 6 month of daily backups available on a server, and do daily offsite backups from that. So once a month we delete the oldest months backups from the local backup server, and using squashfs you'd have to recreate the whole squash archive, which would suck for a terabyte archive with millions of files in it.

    At this point I knew what features I wanted but couldn't find anything that did it yet, so I went ahead and wrote a fuse daemon in python that handles block-level deduplication and compression at the same time. I'm still playing around with it and testing different storage ideas, it's available in git if anyone wants to take a look, you can get it by doing:

    git clone http://git.hoopajoo.net/projects/fusearchive.git fusearchive

    (note the above command might be mangled because of the auto-linking in slashdot, there should be no [hoopajoo.net] in the actual clone command)

    Currently it uses a storage directory with 2 sub directories, store/ and tree/. Inside tree/ are files that contain a hash that identifies the block list for the file contents. This way 2 identical files will only consume the size of a hash on disk + inodes. The hash points the the block that contains the file data block list, which is also a list of hashes of the data. This way any files that have identical blocks (on a block boundry) will have the redundant blocks only take up the size of the hash. Blocks are currently 5M, which can be tuned, and the blocks are compressed using zlib. So a bunch of small files get the benefit of compression and entire-file deduplication while large growing files will at most use up an extra block or data + the hash info for the rest of the file. So far this seems to be working pretty well, the biggest issues I have is tracking block references so we can free the block when it's no longer referenced by any files. It works fine currently but since each block contains it's own reference counter a crash could make the ref counts incorrect, and unfortunately I can't think of a better, more atomic way to handle that. The other big drawback is speed, it's about 1/3 the speed of native file copying, and from profiling the code 80-90% of the time seems to be spent passing fuse messages in the main fuse-python library, with a little time being taken up by zlib and actual file writes.

    If I could get s

  42. RE: lack of cutting edges by ewenix · · Score: 1

    One of the things I like about my Mac is the lack of cutting edges.

    Yep.. Both Macs and duplo blocks are safe like that, and aimed at the same demographic.

  43. negligible vs. Murphy by Something+Witty+Here · · Score: 1

    Reiser has a method for eliminating unwanted bits, but there
    is a bug that chroots you inside a jail.

    >>> The probability of a hash collision for a 256 bit hash (or even a 128
    >>> bit one) is negligible.

    Which means idiots will assume that it never happens. In other
    news, real estate prices never go down and o-rings on space
    shuttles never leak.

    >> I run Linux, where's my ZFS?

    Upgrade to FreeBSD.

    > Log files Log files can only be appended to.

    See OpenBSD.

    > Managed files Managed files are random-access files managed by a
    > database or archive program.

    Such a limited view. I have lots of random access files I
    maintain with emacs.

  44. Why? by raftpeople · · Score: 1

    It's unlikely you'll have a collision considering it's a 256-bit hash

    Probability and actuality are 2 different things. Just because the probability is low doesn't mean it won't happen with the first 2 blocks encountered. I don't see how this (using a hash) can work given that the results are not guaranteed.

    1. Re:Why? by jimicus · · Score: 1

      Probability and actuality are 2 different things. Just because the probability is low doesn't mean it won't happen with the first 2 blocks encountered. I don't see how this (using a hash) can work given that the results are not guaranteed.

      We are not talking "probability so low you'd be better off entering the lottery". We are talking "probability so low you'd be more likely to win every lottery in existence on the planet simultaneously".

      As I've already said, you can enable full checking if you are really paranoid - and some applications where you need to put a cast-iron guarantee on data integrity (possibly financial or health related) probably would. For the rest of us, I wouldn't be so bothered.