ZFS Gets Built-In Deduplication

← Back to Stories (view on slashdot.org)

ZFS Gets Built-In Deduplication

Posted by ScuttleMonkey on Monday November 2, 2009 @11:21AM from the sounds-like-a-resource-hog-waiting-to-happen dept.

elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."

15 of 386 comments (clear)

Min score:

Reason:

Sort:

More reason to be a ZFS fanboy by BitZtream · 2009-11-02 11:32 · Score: 3, Insightful

I'm wondering how long its going to take for them to do something with ZFS that actually makes me slow down my overwhelming ZFS fanboyism.
I just love these guys.
My virtual machine NFS server is going to have to get this as soon as FBSD imports it, and I'll no longer have to worry about having backup software (like BackupPC, good stuff btw) that does this.
I don't use high end SANs but it would seem to me that they are rapidly losing any particular advantage to a Solaris or FBSD file server.

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
1. Re:More reason to be a ZFS fanboy by symbolset · 2009-11-02 16:11 · Score: 2, Insightful
  
  I'm curious about these storage needs that shrink. Is this a hypothetical case, or can you provide a real world citation of an example? In a broad world many strange things are found but I always considered this one mythical.
  
  --
  Help stamp out iliturcy.
Re:This is good news... by bcmm · 2009-11-02 11:35 · Score: 4, Insightful

...and would normally make me happy; except I'm a Mac user. Still good news, but could've been better for a certain sub-set of the population, darn it.
Use open source, get cutting edge things.

--
# cat /dev/mem | strings | grep -i llama
Damn, my RAM is full of llamas.
Next home server will be OpenSolaris (or fBSD) by 0100010001010011 · 2009-11-02 11:36 · Score: 2, Insightful

ZFS, from what I can tell, kicks ass. I've played around with it in virtual machines, taking drives off line, recreating them, adding drives, etc.
When I search NewEgg I also search OpenSolaris' compatibility list.
The two areas that Linux is playing catchup is Filesystems (like this) and Sound (OSS, Pulse, Alsa Oh My!). And before you go pointing out the btrfs project, this has been in servers for years. It's tried in an enterprise environment. Your file system is still in beta with a huge "Don't use this for important stuff" warning.
Re:Does that mean... by ezzzD55J · 2009-11-02 11:46 · Score: 3, Insightful

The single block is still stored redundantly, of course. Just not redundantly more than once.
Re:Does that mean... by Methlin · 2009-11-02 12:01 · Score: 2, Insightful

Er, isn't block deduplication really really bad at a hard drive block failure point of view? You'd have to compress or otherwise change the data to have a copy now, or it'd just be marked redundant; if that block where all those redundant nodes are pointing to go bad, all of those files are now bad.
If you were concerned about block level failure or even just drive level failure, you wouldn't be running your ZFS pool without redundancy (mirror or raidz(2)).
Re:Does that mean... by hedwards · 2009-11-02 12:31 · Score: 2, Insightful

That requires a citation.

ZFS isn't that much different than traditional file systems. I'm not quite sure how that reconciles with the fact that it reports unrecoverable bits of information when it couldn't self heal to you. If it were that unusable there'd be no point. Additionally there isn't really much likelihood of that happening considering that ZFS isn't really supposed to be used outside of a ZMIRROR or RAIDZ environment. Sure you can do it, but most of the goodness comes from multiple disks.
Open Source Cures Cancer by sjbe · 2009-11-02 12:32 · Score: 1, Insightful

Use open source, get cutting edge things.
Like a cutting edge CAD packages, games, financial management and office suites? Good thing we had you to tell us that open source will solve our every problem just by virtue of it being open source. I'm sure every print shop is going to dump Photoshop for GIMP, every finance firm will dump Excel for Openoffice Calc and every engineering firm will dump AutoCAD for... what exactly?
Maybe, just maybe open source isn't the answer for everything after all...
1. Re:Open Source Cures Cancer by Anonymous Coward · 2009-11-02 22:17 · Score: 1, Insightful
  
  > But if it breaks, or doesn't work, or you've hit a deadline on a project and can't deliver because Wine or the application broke, who are you going to call for support exactly?
  Nobody. Seriously, it's pretty rare to get decent support.
  And if all I need is someone to blame, Microsoft works just fine. Everybody has heard them blamed for one reason or another and the execs rarely know or care what we use...
Re:Wake me when they build it into the hard disk by jcr · 2009-11-02 14:29 · Score: 2, Insightful

what was one block replicated hundreds of times now becomes hundreds of blocks exhausting all storage.
What? Why would that happen?
If you have a block and a hundred COW pointers to it, and you modify one, then you get two blocks, with 99 references to the old one and one reference to the new one.
-jcr

--
The only title of honor that a tyrant can grant is "Enemy of the State."
Re:This is good news... by joe_bruin · 2009-11-02 15:40 · Score: 2, Insightful

Use open source, get cutting edge things.
I run Linux, where's my ZFS? No, FUSE doesn't count.
Re:This is good news... by 644bd346996 · 2009-11-02 15:44 · Score: 2, Insightful

Time Machine has by far the easiest to use interface of any backup solution that is at least as powerful. And because it does file-level deduplication using hardlinks, the backups themselves are standard directory trees, browsable in every way that the rest of the filesystem is. If the block-level deduplication is part of the backup software and not the filesystem, the the archives will be opaque files that usually can only be manipulated by the backup software itself. This means the user has little or no choice between UIs for restoring from the archive, and it usually prevents the archives from being indexed by something like Spotlight.
By adding one small filesystem feature (hardlinks to directories), Apple made it possible to trivially implement a good incremental backup system. (The under-the-hood parts of Time Machine could be implemented in a fairly short shell script run as a cron job.) They then proceeded to put the slickest UI ever around their backup system, but still left it open for other programs. If Apple added block-level deduplication to their filesystem, they wouldn't even have to touch the Time Machine code and it would become the best personal backup software in history.
Re:There are three types of files. by greg1104 · 2009-11-02 17:14 · Score: 3, Insightful

The main corner case in your suggested "unit file" implementation is where someone is overwriting a file too large for the filesystem to contain two copies of it. You have to truncate when this happens to fit the new one, you can't just keep the old one around until it's replaced. This makes it impossible to meet the spec you're asking for in all cases. The best you can do is try to keep the original around until disk space runs out, and only truncate it when forced to. However, if that's how the implementation works, then applications can't just blindly rely on the filesystem to always do the right thing and "give you that for free". They've still got to create the new file and confirm it got written out before they touch the original if they want to guarantee never losing the original good copy, so that they bomb with a disk space error rather than risk truncating the original. That's why this whole path doesn't go anywhere useful; better to work on poplarizing an API for atomic rewrites or something.
As for your "managed files" case, that won't work for all database approaches. For example, in PostgreSQL, only writes to the database write-ahead log are done with O_SYNC/O_DIRECT. The main data block updates (and writes that are creating new data blocks) are written out asynchronously, and then when internal checkpoints reach their end any unwritten blocks are forced to disk with fsync if they're still in the OS cache. You'd be hard pressed to detect which of your suggested modes was the appropriate one for just the obvious behavior there, and there's still more weird corner cases to worry about buried in there too (like what the database does with the data blocks and the WAL to repair corruption after a crash).
Both these highlight that it's hard to make improvements here at just the filesystem level. Some of the really desirable behavior is hard to do unless applications are modified to do something different too. That hasn't really been going well for ext4 this year, and how that played out highlights how hard an issue this is to crack.
Not quite the wonderful thing it appears to be by pjr.cc · 2009-11-02 19:32 · Score: 2, Insightful

De dupe has been around for a while and has some advantages and quite a few negatives... First off, i'd be interested to see how many patent trolls this might stumble over. But de-dup has always gone hand in hand with backups and golden images. EMC, HDS and co never did a good job supporting golden images, but other storage have done well with it (3par, compellent, equalogic).
For the uninitiated, golden images usually consist of building a machine on a SAN, and then using that one image to power many machines (i.e. the same blocks on disk). It then usually just stores deltas from the golden image for each machine... its got its advantages and disadvantages much like de-dup.
Now, the reasons for its use are simple "pay less for storage" which sounds dumb in this day and age (with 1tb drives costing virtually nothing), but the reality is in the SAN world 1tb drives cost a fortune and wherever you use de-dup or golden images, you usually use the fastest (and smallest) disk you can get your hands on. (if you dont understand why this is, see the backblaze article from a little while ago - ultimately, putting more space in a bit of SAN storage kit is freeking expensive). In the enterprise world, its almost impossible to step away from SAN storage (unless your google or backblaze).
The big problem with de-dup (and why its primarily used for backups, and primarily only disk-based backups) is how it effects the storage. If you suddenly have one hot spot, even on fast disk, the storage starts grinding to a halt (even when considering caching) because lots of things start accessing the same blocks on the disk. This is not a problem for backups because its usually a once-written, rarely-read scenario. On file servers and databases, its a performance killer (something akin to raid5/6 in software). But de-dup is fantastic for archival storage!... De-dup and performance often tend to be a self-fulfilling prophecy though, simply because data that is duplicated is often duplicated cause its heavily accessed. Take email as a good example. Joe sends out an email with an attachment of some form (perhaps its a document template, but it really doesn't matter so long as he's sending it to a large number of people), all those people save the attachment and probably make some edits. This introduces the next load of pain, fragmentation. All those delta's from the original now need to be saved "somewhere else" and meanwhile all these people are accessing not just the de-dup'ed blocks but the fragmented changes (consider the kernel source for linux as well, tonnes of branches of code that would possible get de-duped and fragmented). Databases are another great example. Often in tablespaces there is quite a load of block-alligned duplicate data, often this is the nature of how databases store data. Sometimes this data can be quite critical to their function and to have a database slamming the same blocks (again with small fragmented changes) is pain personified.
Still, i wonder how many patents sun are likely to trip up on... I see this being non-fun as there are many people who make serious cash from de-dup at various levels....
Re:Does that mean... by noidentity · 2009-11-02 20:20 · Score: 2, Insightful

Duplicate slashdot articles will be links back to the original one?

No, see, this de-duplication is transparent at the interface level. So while dupes won't take extra disk space on Slashdot servers, we'll still see them as normal. Isn't it nice to know that this optimization will be taking place?