ZFS Gets Built-In Deduplication
elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."
Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a
Before I left Acronis, I was the lead developer and designer for deduplication in Acronis Backup & Recovery 10. We also used SHA256 there, and naturally the possibility of a hash collision was investigated. After we did the math, it turned out that you're about 10^6 times more likely to lose data because of hardware failure (even considering RAID) than you are to lose it because of a hash collision.
How about this: you can't remove a top-level vdev without destroying your storage pool. That means that if you accidentally use the "zpool add" command instead of "zpool attach" to add a new disk to a mirror, you are in a world of hurt.
How about this: after years of ZFS being around, you still can't add or remove disks from a RAID-Z.
How about this: If you have a mirror between two devices of different sizes, and you remove the smaller one, you won't be able to add it back. The vdev will autoexpand to fill the larger disk, even if no data is actually written, and the disk that was just a moment ago part of the mirror is now "too small".
How about this: the whole system was designed with the implicit assumption that your storage needs would only ever grow, with the result that in nearly all cases it's impossible to ever scale a ZFS pool down.
Mod parent up. These are all legit deficiencies in ZFS that really need to be fixed at some point. Currently the only solutions to these is to build a new storage pool, either on the same system or different system, and export/import; big PITA and potentially expensive. Off the top of my head I can't think of anyone that lets you do #2 except enterprise storage solutions and Drobo.
From that link: It is file-based and a service indexes it (whereas in ZFS it is block-based and on-the-fly). And they first introduced it in Windows Server 2000. Amazing. I'm sure it is a ugly hack since Windows has no soft/hard-links IIRC.
NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
If you're running a normal desktop or laptop, this isn't likely to be of great use in any case. There's non-negligible overhead in doing the deduplication process, and drive space at consumer-level sizes is dirt-cheap, so it's only really worth doing this you have a lot of block-level duplicate data. That might be the case if e.g. you have 30 VMs on the same machine each with a separate install of the same OS, but is unlikely to be the case on a normal Mac laptop.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
You recall wrong. NTFS has long supported both hard links and a mechanism called 'reparse points,' which are much more powerful than simple symlinks.
How alarmist and uninformed; borderline FUD. The reality is as follows...
First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now.
The bug report for this problem goes back to at least April of 2003. With that background, and that I've been hearing ZFS proponents suggesting this is coming "very soon now" for years without a fix, I'll believe it when I see it. Raising awareness that Sun's development priorities clearly haven't been toward any shrinking operation isn't FUD, it's the truth. Now, to be fair, that class of operations isn't very well supported on anything short of really expensive hardware either, but if you need these capabilities the weaknesses of ZFS here do reduce its ability to work for every use case.