Slashdot Mirror


ZFS Gets Built-In Deduplication

elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."

30 of 386 comments (clear)

  1. Does that mean... by Anonymous Coward · · Score: 4, Funny

    Duplicate slashdot articles will be links back to the original one?

    1. Re:Does that mean... by ezzzD55J · · Score: 3, Insightful

      The single block is still stored redundantly, of course. Just not redundantly more than once.

  2. Re:Hash Collisions by CMonk · · Score: 3, Informative

    That is covered very clearly in the blog article referenced from the Register article. http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup

  3. More reason to be a ZFS fanboy by BitZtream · · Score: 3, Insightful

    I'm wondering how long its going to take for them to do something with ZFS that actually makes me slow down my overwhelming ZFS fanboyism.

    I just love these guys.

    My virtual machine NFS server is going to have to get this as soon as FBSD imports it, and I'll no longer have to worry about having backup software (like BackupPC, good stuff btw) that does this.

    I don't use high end SANs but it would seem to me that they are rapidly losing any particular advantage to a Solaris or FBSD file server.

    --
    Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    1. Re:More reason to be a ZFS fanboy by HockeyPuck · · Score: 3, Informative

      The advantages of SANs are easy to realize, they need not necessarily be FibreChannel vs NAS (NFS/CIFS) as a SAN could be iSCSI, FCOE, FCIP, FICON etc..

      -Storage Consolidation compared with internal disk.
      -Fewer components in your servers that can break.
      -Server admins don't have to focus on Storage except at the VolMgr/Filesystem level
      -Higher Utilization (a WebServer might not need 500GB of internal disk).
      -Offloading storage based functions (RAID in the array vs RAID on your server's CPU, I'd rather the CPU perform application work rather than calculating parity, replacing failed disks etc). This increases when you want to replicate to a DR site.

      This is not a ZFS vs SANs argument. I think ZFS running on SAN based storage is a great idea as ZFS replaces/combines two applications that are already on the host (volmgr & filesystem).

    2. Re:More reason to be a ZFS fanboy by Anonymous Coward · · Score: 5, Informative

      How about this: you can't remove a top-level vdev without destroying your storage pool. That means that if you accidentally use the "zpool add" command instead of "zpool attach" to add a new disk to a mirror, you are in a world of hurt.

      How about this: after years of ZFS being around, you still can't add or remove disks from a RAID-Z.

      How about this: If you have a mirror between two devices of different sizes, and you remove the smaller one, you won't be able to add it back. The vdev will autoexpand to fill the larger disk, even if no data is actually written, and the disk that was just a moment ago part of the mirror is now "too small".

      How about this: the whole system was designed with the implicit assumption that your storage needs would only ever grow, with the result that in nearly all cases it's impossible to ever scale a ZFS pool down.

    3. Re:More reason to be a ZFS fanboy by Methlin · · Score: 4, Informative

      Mod parent up. These are all legit deficiencies in ZFS that really need to be fixed at some point. Currently the only solutions to these is to build a new storage pool, either on the same system or different system, and export/import; big PITA and potentially expensive. Off the top of my head I can't think of anyone that lets you do #2 except enterprise storage solutions and Drobo.

    4. Re:More reason to be a ZFS fanboy by KonoWatakushi · · Score: 3, Informative

      How alarmist and uninformed; borderline FUD. The reality is as follows...

      First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now. Same with crypto.

      Second, mistakenly typing add instead of attach will result in a warning that the specified redundancy is different, and refuse to add it.

      Third, yes, you can't expand the width of a RAID-Z. You can still grow it though, by replacing it with larger drives. Once the block pointer rewrite work is merged, removal will be possible, and expansion won't be far off either.

      Forth, vdevs no longer autoexpand by default. If you want that behavior, you can to set the autoexpand property to yes.

      Last, there was no such assumption, it is simply a matter of priorities. If it were an easier problem, it would have been done long ago, but I'm happy to be patient, knowing that it will be done right. Most everyone who has seriously used ZFS will understand that the advantages will hugely outweigh these minor nits, which are easily worked around.

    5. Re:More reason to be a ZFS fanboy by greg1104 · · Score: 4, Informative

      How alarmist and uninformed; borderline FUD. The reality is as follows...

      First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now.

      The bug report for this problem goes back to at least April of 2003. With that background, and that I've been hearing ZFS proponents suggesting this is coming "very soon now" for years without a fix, I'll believe it when I see it. Raising awareness that Sun's development priorities clearly haven't been toward any shrinking operation isn't FUD, it's the truth. Now, to be fair, that class of operations isn't very well supported on anything short of really expensive hardware either, but if you need these capabilities the weaknesses of ZFS here do reduce its ability to work for every use case.

  4. Re:This is good news... by bcmm · · Score: 4, Insightful

    ...and would normally make me happy; except I'm a Mac user. Still good news, but could've been better for a certain sub-set of the population, darn it.

    Use open source, get cutting edge things.

    --
    # cat /dev/mem | strings | grep -i llama
    Damn, my RAM is full of llamas.
  5. Wake me when they build it into the hard disk by icebike · · Score: 4, Interesting

    Imagine he amount of stuff you could (unreliably) store on a hard disk if massive de-duplication was built into the drive electronics. It could even do this quietly in the background.

    I say unreliably, because years ago we had a Novell server that used an automated compression scheme. Eventually, the drive got full anyway, and we had to migrate to a larger disk.

    But since the copy operation de-compressed files on the fly we couldn't copy because any attempt to reference several large compressed files instantly consumed all remaining space on the drive. What ensued was a nightmare of copy and delete files beginning with the smallest, and working our way up to the largest. It took over a day of manual effort before we freed up enough space to mass-move the remaining files.

    De-duplication is pretty much the same thing, compression by recording and eliminating duplicates. But any minor automated update of some files runs the risk of changing them such that what was a duplicate, must now be stored separately.

    This could trigger a similar situation where there was suddenly not enough room to store the same amount of data that was already on the device. (For some values of "suddenly" and "already").

    For archival stuff or OS components (executables, and source code etc) which virtually never change this would be great.

    But there is a hell to pay somewhere down the road.

    --
    Sig Battery depleted. Reverting to safe mode.
    1. Re:Wake me when they build it into the hard disk by icebike · · Score: 3, Interesting

      Bad design on Novell's part, but the problem persists in the de-duplicated world, where de-duplicating to memory only is not a solution.

      Imagine a hundred very large file containing largely the same content. Not imagine CHANGING just a few characters in each file via some automated process. Now 100 files which were actually stored as ONE file balloon to 100 large files.

      On a drive that was already full, changing just a few characters (not adding any total content) could cause a disk full error.

      You really can't fake what you don't have. You either have enough disk to store all of your data or you run the risk of hind-sight telling you it was a really bad design.

      --
      Sig Battery depleted. Reverting to safe mode.
    2. Re:Wake me when they build it into the hard disk by ArsonSmith · · Score: 3, Informative

      No you still have it stored the size of one file + 100 block sizes, in size. You'd need a substantially large number of random changes through all 100 files to balloon up from 1x file size, to 100x file size.

      --
      Paying taxes to buy civilization is like paying a hooker to buy love.
  6. Re:Any other file systems with that feature? by iMaple · · Score: 5, Informative

    Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
    http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a

  7. Re:Hash Collisions by shutdown+-p+now · · Score: 4, Informative

    Before I left Acronis, I was the lead developer and designer for deduplication in Acronis Backup & Recovery 10. We also used SHA256 there, and naturally the possibility of a hash collision was investigated. After we did the math, it turned out that you're about 10^6 times more likely to lose data because of hardware failure (even considering RAID) than you are to lose it because of a hash collision.

  8. Re:This is good news... by jeffb+(2.718) · · Score: 3, Funny

    Use open source, get cutting edge things.

    The last time I tried to build an Intel box for Linux work, I lost my grip on the cheap generic case, and sustained a cut that sent me to the emergency room. One of the things I like about my Mac is the lack of cutting edges.

  9. Re:This is good news... by Anonymous Coward · · Score: 4, Funny

    Shoulda gone with a blade server, then you wouldn't have had to worry about the emergency room.

  10. Re:Any other file systems with that feature? by buchner.johannes · · Score: 4, Informative

    From that link: It is file-based and a service indexes it (whereas in ZFS it is block-based and on-the-fly). And they first introduced it in Windows Server 2000. Amazing. I'm sure it is a ugly hack since Windows has no soft/hard-links IIRC.

    --
    NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
  11. Re:Any other file systems with that feature? by hedwards · · Score: 3, Informative

    ZFS is a copy on write filesystem, it already creates a temporary second copy so that the file system is always consistent if not quite up to date. I'd venture to guess that the new version of the file, not being identical to the old file would just be treated like copying it to a new name.

  12. Re:This is good news... by Trepidity · · Score: 4, Informative

    If you're running a normal desktop or laptop, this isn't likely to be of great use in any case. There's non-negligible overhead in doing the deduplication process, and drive space at consumer-level sizes is dirt-cheap, so it's only really worth doing this you have a lot of block-level duplicate data. That might be the case if e.g. you have 30 VMs on the same machine each with a separate install of the same OS, but is unlikely to be the case on a normal Mac laptop.

  13. Par for the course.. by Junta · · Score: 4, Interesting

    Any filesystem implementing copy-on-write at all, data dedupe, and/or compression is already a strategy where the risk of exhausting oversubscribed storage due to unanticipated compression ratios or uniqueness is a risk. It's a reason why you have to be pretty explicit to NetApp filers implementing these features that you are accepting the risk of exhausting allocations if you actually make use of these features to the point of advertising more storage capacity than you actually have.

    You don't even need a fancy filesystem to expose yourself to this today:
    $ dd if=/dev/zero of=bigfile bs=1M seek=8191 count=1
    1+0 records in
    1+0 records out
    1048576 bytes (1.0 MB) copied, 0.00426769 s, 246 MB/s
    jbjohnso@wirbelwind:~$ ls -lh bigfile
      8.0G 2009-11-02 20:06 bigfile
    ~$ du -sh bigfile
    1.0M bigfile

    This possibility has been around a long file and the world hasn't melted. Essentially, if someone is using these features, they should be well aware of the risks incurred.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  14. Re:Any other file systems with that feature? by jpmorgan · · Score: 4, Informative

    You recall wrong. NTFS has long supported both hard links and a mechanism called 'reparse points,' which are much more powerful than simple symlinks.

  15. Re:Open Source Cures Cancer by Anonymous Coward · · Score: 5, Funny

    Like a cutting edge CAD packages, games, financial management and office suites?

    Umm, dia, nethack, perl, emacs?

  16. There are three types of files. by Animats · · Score: 5, Interesting

    I'd argue that file systems should know about and support three types of files:

    • Unit files. Unit files are written once, and change only by being replaced. Most common files are unit files. Program executables, HTML files, etc. are unit files. The file system should guarantee that if you open a unit file, you will always read a consistent version; it will never change underneath a read. Unit files are replaced by opening for write, writing a new version, and closing; upon close, the new version replaces the old. In the event of a system crash during writing, the old version of the file remains. If the writing program crashes before an explicit close, the old file remains. Unit files are good candidates for unduplication via hashing. While the file is open for writing, attempts to open for reading open the old version. This should be the default mode. (This would be a big convenience; you always read a good version. Good programs try to fake this by writing a new file, then renaming it to replace the old file, but most operating systems and file systems don't support atomic multiple rename, so there's a window of vulnerability. The file system should give you that for free.)
    • Log files Log files can only be appended to. UNIX supports this, with an open mode of O_APPEND. But it doesn't enforce it (you can still seek) and NFS doesn't implement it properly. Nor does Windows. Opens of a log file for reading should be guaranteed that they will always read exactly out to the last write. In the event of a system crash during writing, log files may be truncated, but must be truncated at an exact write boundary; trailing off into junk is unacceptable. Unduplication via hashing probably isn't worth the trouble.
    • Managed files Managed files are random-access files managed by a database or archive program. Random access is supported. The use of open modes O_SYNC, O_EXCL, or O_DIRECT during file creation indicates a managed file. Seeks while open for write are permitted, multiple opens access the same file, and O_SYNC and O_EXCL must work as documented. Unduplication via hashing probably isn't worth the trouble and is bad for database integrity.

    That's a useful way to look at files. Almost all files are "unit" files; they're written once and are never changed; they're only replaced. A relatively small number of programs and libraries use "managed" files, and they're mostly databases of one kind or another. Those are the programs that have to manage files very carefully, and those programs are usually written to be aware of concurrency and caching issues.

    Unix and Linux have the right modes defined. File systems just need to use them properly.

    1. Re:There are three types of files. by greg1104 · · Score: 3, Insightful

      The main corner case in your suggested "unit file" implementation is where someone is overwriting a file too large for the filesystem to contain two copies of it. You have to truncate when this happens to fit the new one, you can't just keep the old one around until it's replaced. This makes it impossible to meet the spec you're asking for in all cases. The best you can do is try to keep the original around until disk space runs out, and only truncate it when forced to. However, if that's how the implementation works, then applications can't just blindly rely on the filesystem to always do the right thing and "give you that for free". They've still got to create the new file and confirm it got written out before they touch the original if they want to guarantee never losing the original good copy, so that they bomb with a disk space error rather than risk truncating the original. That's why this whole path doesn't go anywhere useful; better to work on poplarizing an API for atomic rewrites or something.

      As for your "managed files" case, that won't work for all database approaches. For example, in PostgreSQL, only writes to the database write-ahead log are done with O_SYNC/O_DIRECT. The main data block updates (and writes that are creating new data blocks) are written out asynchronously, and then when internal checkpoints reach their end any unwritten blocks are forced to disk with fsync if they're still in the OS cache. You'd be hard pressed to detect which of your suggested modes was the appropriate one for just the obvious behavior there, and there's still more weird corner cases to worry about buried in there too (like what the database does with the data blocks and the WAL to repair corruption after a crash).

      Both these highlight that it's hard to make improvements here at just the filesystem level. Some of the really desirable behavior is hard to do unless applications are modified to do something different too. That hasn't really been going well for ext4 this year, and how that played out highlights how hard an issue this is to crack.

  17. Re:Any other file systems with that feature? by bertok · · Score: 3, Informative

    Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
    http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a

    It's not even close to the same thing.

    We investigated this a while back, and it is basically a dirty, filthy hack on top of vanilla NTFS.

    First of all, it doesn't compare blocks or byte-ranges, but entire files only. If two files are 99% identical, then they are different, and SIS won't merge them.

    Second, it uses a reparse point to merge the files, which has significant overhead, at least 4KB for each file, if I remember correctly. That is, SIS won't save you any disk space for small files, which is actually quite common on file servers. The overhead erases much of the benefit even for larger files, to the level that SIS will skip files smaller than 32KB by default.

    Third, it operates in the background, after files have been written. This means that files have to be written out in their entirety, read back in, compared byte-for-byte to another file, and then erased later. This is incredibly inefficient. On large file servers, the disk was thrashed like crazy.

    Lastly, we found that the Copy-on-Write mechanism immediately copied out the entire file if it was changed even slightly. For small files, this is not noticable, but for large files this can be a massive performance hog. A 4kb write can be potentially translated into a multi-GB copy!

    Proper single-instancing systems use in-memory hash tables that are often partitioned using "file similarity" heuristics to prevent cache thrashing. Even more advanced systems can maintain single-instancing during replication and backups, reducing bandwidth requirements enormously. Take a look at the features of the Data Domain filers for an idea of what the current state of the art is.

  18. Re:Any other file systems with that feature? by binaryspiral · · Score: 4, Interesting

    Microsoft's SIS is a joke. A few folks have dedupe down to a science - Data Domain and NetApp.

    We virtualized our filers into an ESX 3.5 cluster and dropped the VMDK files onto a NetApp 3140... deduped them to 18% of their original size. No performance impact, actually faster than our original servers and much more efficient.

    ROI - three months.

    Difficulty to implement dedup? A checkmark and the OK button.

  19. Re:Open Source Cures Cancer by Anpheus · · Score: 4, Funny

    But if it breaks, or doesn't work, or you've hit a deadline on a project and can't deliver because Wine or the application broke, who are you going to call for support exactly? Not the people who made the software. Are you going to email the Wine mailing list and then, when they fail to deliver a timely solution for free, tell the client that open source is to blame?

    At least when I buy software, or make purchasing decisions from a business standpoint, knowing that the company will stand behind the product and our implementation of it is more important than that trying to pursue some ideal about information and it's anthropomorphized desire to be free.

  20. Infinite compression? by n9hmg · · Score: 3, Funny

    If a hash were a replacement for data. that's all we'd need....goedelize the universe? Sometimes I just want to scream, or weep, or shoot everybody....or just drop to my knees and beg them to think - just a little tiny insignificant bit - think. Maybe it'll add up. Probably not, but it's the best I can do.

  21. Re:Open Source Cures Cancer by sjames · · Score: 3, Interesting

    The same people you call when your proprietary system breaks and you discover that the official tech support people can't find their posterior with both hands and a map. Most cities have a number of grief councilors ready to support you in your time of need. If it was really critical, try the suicide hotline.