Ask Slashdot: Free/Open Deduplication Software?
First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.
ZFS in FreeBSD 9 has deduplication support. I've been running the betas / release candidates on my NAS for a little while (everything important is backed up, so I thought I'd give it a test). ZFS development in FreeBSD is funded by iXSystems, who sell expensive NAS and SAN systems so they have an incentive to keep it improving.
I have a ZFS filesystem using compression and deduplication for my backups from my Mac laptop. I copy entire files to it, but it only stores the differences.
I am TheRaven on Soylent News
FreeBSD, and FreeNAS which is bases on FreeBSD, both come with ZFS. Neither is going away anytime soon.
I use both at home and am happy as a clam.
Trolling is a art,
I've used LessFS. On my "server" powered by an Intel Atom, it is very very slow. It writes at about 5MB/sec, even when everything is inside a ram disk.
You can't use a block size of 4KB, otherwise write speeds are around 256KB/sec, need to use at least 64KB.
Acronis is a bloody NIGHTMARE to deal with. We have a mixed shop here, and after seeing what Acronis does on Windows I vetoed the idea of having it on our mission critical linux servers.
I have never seen such a useless backup product before I started working with Acronis. Most backup systems let you set it up once and they WORK. Acronis is always getting itself wedged (dur, a metadata file I miswrote yesterday is corrupt, I will just hang), and when wedged it hangs ALL backup jobs, not just the one that is stuck. And the only "fix" is to redo all the jobs from scratch. No other backup system needs as much handholding as Acronis.
Acronis claims to have an excellent recovery environment. I haven't used it, but I am sure it is fantastic when you finally dig up a month-old backup to restore from because Acronis had stopped working.
So you want dragonfly BSD with a hammer filesystem.
An excellent and stable BSD and an excellent filesystem to go with it. And a very helpful community.
As I said in another post, ZFS development on FreeBSD is now funded by iXSystems. Given that most of their income is from selling large storage solutions built on top of FreeBSD and ZFS (often with a side order of FusionIO and other very expensive hardware things), they have a strong incentive to keep it stable and full of the features that their customers want.
I am TheRaven on Soylent News
However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.
Slagborr
it needs incredible amount of memory to operate effectively.
from my university notes:
5TB data, average blocksize 64K = 78125000 blocks
for each block the dedup needs 320 bytes so
78125000 x 320 byte = 25 GB dedup table
use compression instead. (eg zfs compression)
I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication
Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.
Secession is the right of all sentient beings.
Interesting link, but it doesn't look like Microsoft has actually released this yet and it is only slated to be released with Win 8 server, and it will come with some caveats.
FTFA:
"It is a server-only feature, like so many of Microsoft’s storage developments. But perhaps we might see it deployed in low-end or home servers in the future.
It is not supported on boot or system volumes.
Although it should work just fine on removable drives, deduplication requires NTFS so you can forget about FAT or exFAT. And of course the connected system must be running a server edition of Windows 8.
Although deduplication does not work with clustered shared volumes, it is supported in Hyper-V configurations that do not use CSV.
Finally, deduplication does not function on encrypted files, files with extended attributes, tiny (less than 64 kB) files, or re-parse points."
In Linux, I would avoid any backup system that doesn't support hard links, long paths, file attributes, file access control lists and SElinux contexts.
Some of the "offerings" out there are so Windows-centric that they can't even handle "Makefile" and "makefile" being in the same directory.
In Windows, I would require that it backs up and restores both the long and the short name of files, if short name support is enabled for the file system (default in Windows). Why? If "Microsoft Office" has a short name of MICROS~3 and it gets restored with MICROS~2, the apps won't work, because of registry entries using the short name.
I'd also look for one that can back up NTFS streams. Some apps store their registration keys in NTFS streams.
In all cases, Acronis does not measure up to what I require of a backup program. Also because the restore environment doesn't even work unless you have hardware compatible with its drivers. You may be able to back up, and even boot the restore environment, but not do an actual restore from it.
ArcServe is better - for Linux, it still lacks support for file attributes and the hardlink handling is rather peculiar during restore, but at least handles SElinux and dedup of the backup.
An option for dedup on Linux file systems would be nice - the easiest implementation would be COW hardlinks. But like for Microsoft's new NTFS, you'd need something that scans the file system for duplicates. And it better have an attrib for do-not-dedup too, because of how expensive COW can be for large files, or to avoid file fragmentation for specific files.
Considering Linux does have this capability in a few FS drivers now (ok.. some more stable than others, sure) I think the GP should be modded troll rather than the post pointing out it's likely a shill... too bad i'm out of mod points
backuppc is backup software that does file-level deduplication via hard links on its backup store. Despite the name's suggestion that it is for backing up (presumably windows) PCs, it's great with *nix.
http://backuppc.sourceforge.net/
Its primary disadvantage is the logical consequence of all those hard links. Duplicating the backup store, so you can send it offsite, is basically impossible with filesystem-level tools. You have to copy the entire filesystem to the offsite media, typically with dd.
It also can make your life difficult if you're trying to restore a lot of data all at once, like after a disaster. You take your offsite disks that you've dd' copied, hook them up, and start to run restores.
The hard links mean lots and lots of disk head seeks, so you are doing random i/o on your restore. This is really slow. If I ever have to do this, my plan is to buy a bunch of SSD's to copy my backup onto. Since there are no seeks on SSDs it will be much faster.
We check backups and run a test restore on each and every server every month (we had this rule before we started with Acronis).
Acronis is awful. Frankly, someone should open a fraud investigation. Acronis products have no business being sold as enterprise backup solutions. Fucking ntbackup is far more reliable.
Right NOW I am wrestling with Acronis Backup & Recovery 11's retarded cousin, Acronis Recovery for Microsoft Exchange. I seriously want to sue the sadistic and incompetent assholes at Acronis for all the pain and suffering they are causing.
Yep, put a nail in OpenSolaris' coffin. Instead, I use and recommend OpenIndiana and NexentaStor (or Nexenta's community edition if you prefer).
Isn't de-dup a fairly trivial application for a DB of MD5sums, even if you don't have the chops to use the filesystem at a more fundamental level?
Yes, but in that case, two multi-GB files that share all of their data except one bit will not be deduplicated. The difference between your approach and Microsofts is grounded in the same though-process that make modern compression algorithms better than older ones:
First you treat all files separately, which is really simple but has the drawback of not cross-linking chances for compression/dedup across files. This is what deflate (ZIP/GZIP) and your approach to dedup do. The same data simply gets recompressed twice -- or in your case not duplicated if the data is even marginally different. You will never reach the maximum space-saving that way, even though you can at least be sure to be reasonably fast.
Then you notice that files sharing most data should only be compressed/deduped once and then just linked together. The easiest way to do that is to cut up the files into blocks and compress them. If two blocks are the same, you don't recompress them but just put in a link to the previous compression. This is what (roughly speaking) BZIP2, RAR, ACE and some other formats do. In deduping terms this means creating multi-level hashes for each files. It works much better, but has the price of being more complex and time consuming.
Finally, you notice that cutting up files at fixed boundaries is also wasteful. If two blocks are the same, but one has all bytes shifted left one position, you needlessly waste space. Thus, you try to identify if you can dynamically cut up the files/stream into chunks that you have already compressed, plus a handful of spare bytes here or there or with a very simple substitution/transposition function applied. This is (extremely roughly speaking) what LZMA of the 7-Zip fame does and what Microsoft tries to do different in their dedup approach.
Of course, going that way is even MORE complex and time consuming, but may be well worth it, if space-saving is what you're intested in. After all, there is no such thing as a free lunch -- you either pay with time or with space (or with general applicability in some corner cases).
So, all in all, the approach itself is not new -- neither yours nor Microsofts -- but the magic lies in actually creating a working product out of the theoretical approach outlined above.