Data Deduplication Comparative Review
snydeq writes "InfoWorld's Keith Schultz provides an in-depth comparative review of four data deduplication appliances to vet how well the technology stacks up against the rising glut of information in today's datacenters. 'Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year,' Schultz writes. 'If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.' Under review are dedupe boxes from FalconStor, NetApp, and SpectraLogic."
Actually, it is automatic. ZFS already assumes you have a multithreaded OS running on more cpu than you probably need (e.g. Solaris), so it's already doing checksums (up to and including SHA256) for each data block in the filesystem. Comparing checksums (and optionally entire datablocks) to determine what blocks are duplicates isn't that much extra work at that point, although for deduplication you probably want to use a beefier checksum than you might choose otherwise, so there is some increase in work. http://blogs.sun.com/bonwick/entry/zfs_dedup has some more information on it. Getting it onto my linux box, now.. there's the rub. userspace ZFS exists, but I've only seen one pointer to a patch for it that includes dedup, and I haven't heard any stability reports on it yet.