Ask Slashdot: Free/Open Deduplication Software?
First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.
That deduplication for NTFS is really interesting, actually. It's not licensed technology but straight from Microsoft Research and it has some clever aspects to it.
Some technical details about the deduplication process:
Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.
After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.
There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.
Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.
New writes are not deduped – this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.
The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all. So when are we going to see Linux equivalents? Because Linux is getting behind on the new technologies.
...includes dedupe.
There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.
Acronis Backup & Recovery 11 Advanced Server has deduplication (licensed addon) and runs on Linux. At roughly $2000 it ain't cheap. I've never used it so can't comment on how well it does.
These posts express my own personal views, not those of my employer
ZFS in FreeBSD 9 has deduplication support. I've been running the betas / release candidates on my NAS for a little while (everything important is backed up, so I thought I'd give it a test). ZFS development in FreeBSD is funded by iXSystems, who sell expensive NAS and SAN systems so they have an incentive to keep it improving.
I have a ZFS filesystem using compression and deduplication for my backups from my Mac laptop. I copy entire files to it, but it only stores the differences.
I am TheRaven on Soylent News
FreeBSD, and FreeNAS which is bases on FreeBSD, both come with ZFS. Neither is going away anytime soon.
I use both at home and am happy as a clam.
Trolling is a art,
I've used LessFS. On my "server" powered by an Intel Atom, it is very very slow. It writes at about 5MB/sec, even when everything is inside a ram disk.
You can't use a block size of 4KB, otherwise write speeds are around 256KB/sec, need to use at least 64KB.
Check out BackupPC. Been using it for about 5 years at our company, admittedly a mostly Linux shop, with great results. Deduplication on a per-file basis, block-based transfers via the rsync protocol, and a good web-based UI (at least in terms of function). Thanks to deduplication we are getting about a 10:1 storage compression backing up servers and workstations: a total of 1.28 TB of backups in 130.88 GB of used space.
Your post doesn't make it clear if you're looking for a free backup product to replace DataDomain, NetApp, etc. or if you're now wanting to dedup on live filesystems.
If you're looking for a free backup product that supports deduplication, look at backuppc . Powerful and complex, but free. I've used it for years with good results.
So you want dragonfly BSD with a hammer filesystem.
An excellent and stable BSD and an excellent filesystem to go with it. And a very helpful community.
As I said in another post, ZFS development on FreeBSD is now funded by iXSystems. Given that most of their income is from selling large storage solutions built on top of FreeBSD and ZFS (often with a side order of FusionIO and other very expensive hardware things), they have a strong incentive to keep it stable and full of the features that their customers want.
I am TheRaven on Soylent News
However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.
Slagborr
it needs incredible amount of memory to operate effectively.
from my university notes:
5TB data, average blocksize 64K = 78125000 blocks
for each block the dedup needs 320 bytes so
78125000 x 320 byte = 25 GB dedup table
use compression instead. (eg zfs compression)
I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication
Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.
Secession is the right of all sentient beings.
md5sum `find . -type f` | sort
...and so on
http://michaelsmith.id.au
$ md5sum
d41d8cd98f00b204e9800998ecf8427e
$ cat aaah.png >
$ md5sum
d41d8cd98f00b204e9800998ecf8427e
-
Duplicates!
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
I just use rsync from the command line to do deduplication. Been working like a charm for years.
First I sync from the remote directory to a local base directory:
rsync --partial -z -vlhprtogH --delete root@www.mydomain.net:/etc/ /backup/server/www/etc/base/
Then I sync that to the daily backup. Files that have not changed are hard-linked between all the days that share them. It very efficient and simple, and retrieving files is as simple as doing a directory search.
rsync -vlhprtogH --delete --link-dest=/backup/server/www/etc/base/ /backup/server/www/etc/base/ /backup/server/www/etc/2012-01-04
-Dave
backuppc is backup software that does file-level deduplication via hard links on its backup store. Despite the name's suggestion that it is for backing up (presumably windows) PCs, it's great with *nix.
http://backuppc.sourceforge.net/
Its primary disadvantage is the logical consequence of all those hard links. Duplicating the backup store, so you can send it offsite, is basically impossible with filesystem-level tools. You have to copy the entire filesystem to the offsite media, typically with dd.
It also can make your life difficult if you're trying to restore a lot of data all at once, like after a disaster. You take your offsite disks that you've dd' copied, hook them up, and start to run restores.
The hard links mean lots and lots of disk head seeks, so you are doing random i/o on your restore. This is really slow. If I ever have to do this, my plan is to buy a bunch of SSD's to copy my backup onto. Since there are no seeks on SSDs it will be much faster.
I think file-level de-dupe is usually a lot less effective because it can't accomodate files that differ only slightly but are otherwise the same, whereas block-level de-dupe works with everything.
I also don't know what happens in your scheme when you have "de-duped" a file that's the same in 4 different directories but then one application wants to change "its" version of the file. It sounds like it trashes the file for the three other uses of it since there's no way to automate copy-on-write with your shell script but maybe my clue isn't working.
Yep, put a nail in OpenSolaris' coffin. Instead, I use and recommend OpenIndiana and NexentaStor (or Nexenta's community edition if you prefer).
Try Squashfs which creates deduplicated and compressed filesystem archives (http://www.linux-mag.com/id/7357/ for a good journal article).
If you're using Ubuntu, Debian, Fedora Squashfs will be already built into your distro kernel, and the squashfs-tools will also already be available in your distro repository.
Isn't de-dup a fairly trivial application for a DB of MD5sums, even if you don't have the chops to use the filesystem at a more fundamental level?
Yes, but in that case, two multi-GB files that share all of their data except one bit will not be deduplicated. The difference between your approach and Microsofts is grounded in the same though-process that make modern compression algorithms better than older ones:
First you treat all files separately, which is really simple but has the drawback of not cross-linking chances for compression/dedup across files. This is what deflate (ZIP/GZIP) and your approach to dedup do. The same data simply gets recompressed twice -- or in your case not duplicated if the data is even marginally different. You will never reach the maximum space-saving that way, even though you can at least be sure to be reasonably fast.
Then you notice that files sharing most data should only be compressed/deduped once and then just linked together. The easiest way to do that is to cut up the files into blocks and compress them. If two blocks are the same, you don't recompress them but just put in a link to the previous compression. This is what (roughly speaking) BZIP2, RAR, ACE and some other formats do. In deduping terms this means creating multi-level hashes for each files. It works much better, but has the price of being more complex and time consuming.
Finally, you notice that cutting up files at fixed boundaries is also wasteful. If two blocks are the same, but one has all bytes shifted left one position, you needlessly waste space. Thus, you try to identify if you can dynamically cut up the files/stream into chunks that you have already compressed, plus a handful of spare bytes here or there or with a very simple substitution/transposition function applied. This is (extremely roughly speaking) what LZMA of the 7-Zip fame does and what Microsoft tries to do different in their dedup approach.
Of course, going that way is even MORE complex and time consuming, but may be well worth it, if space-saving is what you're intested in. After all, there is no such thing as a free lunch -- you either pay with time or with space (or with general applicability in some corner cases).
So, all in all, the approach itself is not new -- neither yours nor Microsofts -- but the magic lies in actually creating a working product out of the theoretical approach outlined above.
Since most file servers have about 95% unused processor cycles and a limited amount of disk I/O both compression and dedupe can be significant wins provided they don't create an I/O profile that is a smaller percentage more random than their effective compression (ie if they add 10% randomness to the I/O profile but provide 30% compression then it's probably a net win). The fact that they potentially increase cache effectiveness is just gravy since cache is a few orders of magnitude faster than spinning disk and at least an order of magnitude faster than even SSD's.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Hard links are awesome, but they're limited to a per-file basis. SDFS and other block-level de-dupers will only store unique blocks. E.g. storing multiple virtual machine images -- as each image is one huge file, hard links do nothing.
There's a form of deduplication supported by the Linux kernel, if you use the logical volume manager. If you create base LVM device, and then create a snapshot of that device, the snapshot only requires sufficient real estate on the host physical volume to store the diffs between the snapshot and the base. You can use this for "freezing" a file system to do back-ups, or for incremental back-ups, or whatever.
My rather limited experience with this is that, if you have more than a few snapshots on a base device, your write performance degrades very raplidy. There's also a hard limit of 255 snapshots per device.
You can also do file-based deduplication with the "rsnapshot" tool, which has been available for many years.
Also also, I haven't kept up, but I seem to recall that ZFS for linux was promising this as a major selling point.
2*3*3*3*3*11*251
OpenBSD has had the Epitome deduplication framework for some time. I believe version 2 is considered production-ready.
If you don't think opensolaris has a future fair enough. FreeBSD does. FreeBSD currently supports ZFS v28, which has dedup. Be aware you need plenty of RAM.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.