Ask Slashdot: Free/Open Deduplication Software?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Free/Open Deduplication Software?

Posted by timothy on Wednesday January 4, 2012 @08:54AM from the the-dept-dept-the-from-dept-from-from dept.

First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.

22 of 306 comments (clear)

Min score:

Reason:

Sort:

OpenSolaris but not FreeBSD? by TheRaven64 · 2012-01-04 09:05 · Score: 3, Informative

ZFS in FreeBSD 9 has deduplication support. I've been running the betas / release candidates on my NAS for a little while (everything important is backed up, so I thought I'd give it a test). ZFS development in FreeBSD is funded by iXSystems, who sell expensive NAS and SAN systems so they have an incentive to keep it improving.
I have a ZFS filesystem using compression and deduplication for my backups from my Mac laptop. I copy entire files to it, but it only stores the differences.

--
I am TheRaven on Soylent News
1. Re:OpenSolaris but not FreeBSD? by Anonymous Coward · 2012-01-04 09:57 · Score: 4, Informative
  
  People considering either dedup or compression on FreeBSD should be made blatantly aware of one of the issues which exists solely on FreeBSD. When using these features, you will find your system "stalling" intermittently during ZFS I/O (e.g. your SSH session stops accepting characters, etc.). Meaning, interactivity is *greatly* impacted when using dedup or compression. This problem affects RELENG_7 (which you shouldn't be using for ZFS anyway, too many bugs), RELENG_8, the new 9.x releases, and HEAD (10.x). Changing the compression algorithm to lzjb has a big improvement, but it's still easily noticeable.
  My point is that I cannot imagine using either of these features on a system where users are actually on the machine trying to do interactive tasks, or on a machine used as a desktop. It's simply not plausible.
  Here's confirmation and reading material for those who think my statements are bogus. The problem:
  http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012718.html
  http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html
  And how OpenIndiana/Illumos solved it:
  http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html
FreeBSD has ZFS by grub · 2012-01-04 09:06 · Score: 3, Informative

FreeBSD, and FreeNAS which is bases on FreeBSD, both come with ZFS. Neither is going away anytime soon.
I use both at home and am happy as a clam.

--
Trolling is a art,
Lessfs is slow on Atom by Dwedit · 2012-01-04 09:09 · Score: 3, Informative

I've used LessFS. On my "server" powered by an Intel Atom, it is very very slow. It writes at about 5MB/sec, even when everything is inside a ram disk.
You can't use a block size of 4KB, otherwise write speeds are around 256KB/sec, need to use at least 64KB.
1. Re:Lessfs is slow on Atom by BitZtream · 2012-01-04 09:54 · Score: 3, Informative
  
  No you didn't.
  You got 40MB writing to memory cache possibly, not the ZFS store.
  I have a quad disk, 8 core, 8 GIG machine that ONLY does ZFS, Sustaining 40MB/s doesn't happen without special tuning, turning off write cache flushing and a whole bunch of other stuff ... unless I stay in memory buffers. Once that 8 gig fills or the 30 second default timeout for ZFS to flush, the machine comes to a stand still while the disks are flushed, and at that point, the throughput rate drops well below 40MB/s since it is actually finally putting that data on disk.
  Without compression and dedup, with possibly low end checksuming, you may be able to write that fast. With compression or checksuming, theres absolutely no way your processor is moving the data fast enough.
  This is a well known and well documented set of issues. If you haven't experienced it, its only because you really aren't using your NAS under any sort of real work load.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Re:Acronis by Anonymous Coward · 2012-01-04 09:15 · Score: 2, Informative

Acronis is a bloody NIGHTMARE to deal with. We have a mixed shop here, and after seeing what Acronis does on Windows I vetoed the idea of having it on our mission critical linux servers.
I have never seen such a useless backup product before I started working with Acronis. Most backup systems let you set it up once and they WORK. Acronis is always getting itself wedged (dur, a metadata file I miswrote yesterday is corrupt, I will just hang), and when wedged it hangs ALL backup jobs, not just the one that is stuck. And the only "fix" is to redo all the jobs from scratch. No other backup system needs as much handholding as Acronis.
Acronis claims to have an excellent recovery environment. I haven't used it, but I am sure it is fantastic when you finally dig up a month-old backup to restore from because Acronis had stopped working.
dragonflybsd by Anonymous Coward · 2012-01-04 09:17 · Score: 2, Informative

So you want dragonfly BSD with a hammer filesystem.
An excellent and stable BSD and an excellent filesystem to go with it. And a very helpful community.
Re:FreeBSD by TheRaven64 · 2012-01-04 09:21 · Score: 5, Informative

As I said in another post, ZFS development on FreeBSD is now funded by iXSystems. Given that most of their income is from selling large storage solutions built on top of FreeBSD and ZFS (often with a side order of FusionIO and other very expensive hardware things), they have a strong incentive to keep it stable and full of the features that their customers want.

--
I am TheRaven on Soylent News
No dedup in FreeNAS by Svenne · 2012-01-04 09:23 · Score: 5, Informative

However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.

--

Slagborr
Dedup is just a marketing word.... by Anonymous Coward · 2012-01-04 09:23 · Score: 3, Informative

it needs incredible amount of memory to operate effectively.
from my university notes:
5TB data, average blocksize 64K = 78125000 blocks
for each block the dedup needs 320 bytes so
78125000 x 320 byte = 25 GB dedup table
use compression instead. (eg zfs compression)
1. Re:Dedup is just a marketing word.... by m.dillon · 2012-01-04 09:46 · Score: 5, Informative
  
  All dedup operations have a trade-off between disk I/O and memory use. The less memory you use the more disk I/O you have to do, and vise-versa.
  Think of it like this: You have to scan every block on the disk at least once (or at least scan all the meta-data at least once if the CRC/SHA/whatever is already recorded in meta-data). You generate (say) a 32 bit CRC for each block. You then [re]read the blocks whos CRCs match to determine if the CRC found a matching block or simply had a collision.
  The memory requirement for an all-in-one pass like this is that you have to record each block's CRC plus other information... essentially unbounded from the point of view of filesystem design and so not desirable.
  To reduce memory use you can reduce the scan space... on your first pass of the disk only record CRCs in the 0x0-0x7FFFFFFF range, and ignore 0x80000000-0xFFFFFFFF. In other words, now you are using HALF the memory but you have to do TWO passes on the disk drive to find all possible matches.
  The method DragonFly's HAMMER uses is to allocate a fixed-sized memory buffer and start recording all CRCs as it scans the meta-data. When the memory buffer becomes full DragonFly dynamically deletes the highest-recorded CRC (and no longer records CRCs >= to that value) to make room. Once the pass is over another pass is started beginning with the remaining range. As many passes are taken as required to exhaust the CRC space.
  Because HAMMER stores a data CRC in meta-data the de-dup passes are mostly limited to just meta-data I/O, plus data reads only for those CRCs which collide, so it is fairly optimal.
  This can be done with any sized CRC but what you cannot do is avoid the verification pass.. no matter how big your CRC is or your SHA-256 or whatever, you still have to physically verify that the duplicate blocks are, in fact, exactl duplicates, before you de-dup their block references. A larger CRC is preferable to reduce collisions but diminishing returns build up fairly quickly relative to the actual amount of data that can be de-duplicated. 64 bits is a reasonable trade-off, but even 32 bits works relatively well.
  In anycase, most deduplication algorithms are going to do something similar unless they were really stupidly written to require unbounded memory use.
  -Matt
2. Re:Dedup is just a marketing word.... by m.dillon · 2012-01-04 10:48 · Score: 3, Informative
  
  Yes, this is correct.
  For on-line de-duplication the most optimal case in my view is to only de-dup data which may already be present in the buffer cache from prior recent operations, so the on-line dedup only maintains a small in-kernel-memory table of recent CRCs. This catches common operations such as file and directory tree copying fairly nicely.
  The off-line dedup catches everything using a fixed amount of memory and multiple passes (if necessary) on the meta-data, then bulk data reads only for those blocks which appear to be duplicates to verify that they are exact copies.
  I've run dedup on a 2TB backup from a VM with as little as 192MB of ram and it works. A more preferable setup would be to have a bit more memory, like a gigabyte, but more importantly to have a SSD large enough to cache the filesystem meta-data. A 40G SSD is usually enough for a 2TB filesystem. That makes the off-line dedup quite optimal and also makes other maintainance and administrative operations on the large filesystem, such as du, find, ls -lR, cpdup, even a smart diff... let alone rsync or other things one might want to run... it makes all of that go screaming fast without having to waste money buying a bigger system or waste money on excessive energy use.
  -Matt
3. Re:Dedup is just a marketing word.... by m.dillon · 2012-01-04 11:15 · Score: 4, Informative
  
  Well, I can tell you why the option is there... it's not because of collisions, it's there to handle the case where there is a huge amount of actual duplication where the blocks would verify as perfect matches. In this case the de-duplication pass winds up having to read a lot of bulk-data to validate that the matches are, in fact, perfect, which can take a lot of time verses only having to read the meta-data.
  Just on principle I think it's a bad idea to just trust a checksum, cryptographic hash, CRC, or whatever. Corruption is always an issue... even if the filesystem code itself is perfect and even if the disk subsystem is perfect there is so much code running in a single address space (i.e. the KERNEL itself) that it is possible to corrupt a filesystem just from hitting unrelated bugs in the kernel.
  Not to mention radiation flipping a bit somewhere in the cpu or memory (even for ECC memory it is possible to get corruption, but the more likely case is in the billions of transistors making up a modern cpu, even with parity on the L1/L2/L3 caches).
  Hell, I don't even trust IP's stupid simple 1's complement checksum in HAMMER's mirroring protocols. Once during my BEST Internet days we had a T3 which bugged out certain bit patterns in a way that actually got past the IP checksum... we only tracked it down because SSH caught it in its stream and screamed bloody murder.
  If you de-duplicate trusting the meta-data hash, even a big one, what you can end up doing is turning 9 good and 1 corrupted copies of a file into 10 de-duped corrupted copies of the file.
  I'm sure there are many data stores that just won't care if that happens every once in a while. Google's crawlers probably wouldn't care at all, so there is definitely a use for unverified checks like this. I don't plan on using a cryptographic hash as large as the one ZFS uses any time soon but being able to optimally de-dup with 99.9999999999% accuracy it's a reasonable argument to have one that big.
  -Matt
4. Re:Dedup is just a marketing word.... by m.dillon · 2012-01-04 11:47 · Score: 3, Informative
  
  For our production systems it depends 100% on the actual amount of duplicated data, since bulk data reads are needed to verify the duplication. The number of passes is almost irrelevant because they primarily scan meta-data N times, not bulk data (duplicated bulk data only has to be verified once).
  The meta-data can be scanned much more quickly than the verification of duplicated bulk data because the meta-data is laid out on the physical disk fairly optimally for the B-Tree scan the de-dup code issues. So meta-data can be read from the hard disk at 40 MBytes/sec even without the use of a SSD to cache it. Of course, with DFly's swapcache and the meta-data cached on the SSD that scan runs at 200-300 MBytes/sec.
  But in contrast, the bulk reads used to validate the duplicate data just aren't going to be laid out linearly on the disk. There's a lot of skipping around... so the more actual duplicate data we have the larger the percentage of the disk's surface we have to read to verify it.
  This is an area which I could further optimize in HAMMER's dedup code. Currently I do not sort the bulk data block numbers when running the data verification pass. Not only that but I am scanning a sorted CRC list, so the bulk data offsets are going to be seriously unsorted. Doing so would definitely improve performance, probably quite a bit, but still not be anywhere near the 40 MBytes/sec the meta-data scan can achieve off the platter. It would not be a whole lot of programming, probably a day to do that. Currently isn't at the top of my list though.
  What this means, in summary (and even with semi-sorting of the bulk data blocks), is that one can use a bounded amount of ram without really effecting the efficiency of the off-line de-duplication.
  -Matt
What is deduplication? by jdavidb · 2012-01-04 09:26 · Score: 5, Informative

I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication
Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.

--
Secession is the right of all sentient beings.
Re:I've wanted deduplication for a long time! by Anonymous Coward · 2012-01-04 09:28 · Score: 2, Informative

Interesting link, but it doesn't look like Microsoft has actually released this yet and it is only slated to be released with Win 8 server, and it will come with some caveats.
FTFA:
"It is a server-only feature, like so many of Microsoft’s storage developments. But perhaps we might see it deployed in low-end or home servers in the future.
It is not supported on boot or system volumes.
Although it should work just fine on removable drives, deduplication requires NTFS so you can forget about FAT or exFAT. And of course the connected system must be running a server edition of Windows 8.
Although deduplication does not work with clustered shared volumes, it is supported in Hyper-V configurations that do not use CSV.
Finally, deduplication does not function on encrypted files, files with extended attributes, tiny (less than 64 kB) files, or re-parse points."
Re:Acronis by arth1 · 2012-01-04 09:52 · Score: 3, Informative

In Linux, I would avoid any backup system that doesn't support hard links, long paths, file attributes, file access control lists and SElinux contexts.
Some of the "offerings" out there are so Windows-centric that they can't even handle "Makefile" and "makefile" being in the same directory.
In Windows, I would require that it backs up and restores both the long and the short name of files, if short name support is enabled for the file system (default in Windows). Why? If "Microsoft Office" has a short name of MICROS~3 and it gets restored with MICROS~2, the apps won't work, because of registry entries using the short name.
I'd also look for one that can back up NTFS streams. Some apps store their registration keys in NTFS streams.
In all cases, Acronis does not measure up to what I require of a backup program. Also because the restore environment doesn't even work unless you have hardware compatible with its drivers. You may be able to back up, and even boot the restore environment, but not do an actual restore from it.
ArcServe is better - for Linux, it still lacks support for file attributes and the hardlink handling is rather peculiar during restore, but at least handles SElinux and dedup of the backup.
An option for dedup on Linux file systems would be nice - the easiest implementation would be COW hardlinks. But like for Microsoft's new NTFS, you'd need something that scans the file system for duplicates. And it better have an attrib for do-not-dedup too, because of how expensive COW can be for large files, or to avoid file fragmentation for specific files.
Re:I've wanted deduplication for a long time! by anomaly256 · 2012-01-04 10:07 · Score: 4, Informative

Considering Linux does have this capability in a few FS drivers now (ok.. some more stable than others, sure) I think the GP should be modded troll rather than the post pointing out it's likely a shill... too bad i'm out of mod points
BackupPC by danpritts · 2012-01-04 10:14 · Score: 3, Informative

backuppc is backup software that does file-level deduplication via hard links on its backup store. Despite the name's suggestion that it is for backing up (presumably windows) PCs, it's great with *nix.

http://backuppc.sourceforge.net/

Its primary disadvantage is the logical consequence of all those hard links. Duplicating the backup store, so you can send it offsite, is basically impossible with filesystem-level tools. You have to copy the entire filesystem to the offsite media, typically with dd.

It also can make your life difficult if you're trying to restore a lot of data all at once, like after a disaster. You take your offsite disks that you've dd' copied, hook them up, and start to run restores.

The hard links mean lots and lots of disk head seeks, so you are doing random i/o on your restore. This is really slow. If I ever have to do this, my plan is to buy a bunch of SSD's to copy my backup onto. Since there are no seeks on SSDs it will be much faster.
Re:Acronis by Anonymous Coward · 2012-01-04 10:41 · Score: 3, Informative

We check backups and run a test restore on each and every server every month (we had this rule before we started with Acronis).
Acronis is awful. Frankly, someone should open a fraud investigation. Acronis products have no business being sold as enterprise backup solutions. Fucking ntbackup is far more reliable.
Right NOW I am wrestling with Acronis Backup & Recovery 11's retarded cousin, Acronis Recovery for Microsoft Exchange. I seriously want to sue the sadistic and incompetent assholes at Acronis for all the pain and suffering they are causing.
Easy, use OpenIndiana or NexentaStor by Zemplar · 2012-01-04 10:42 · Score: 5, Informative

Yep, put a nail in OpenSolaris' coffin. Instead, I use and recommend OpenIndiana and NexentaStor (or Nexenta's community edition if you prefer).
Re:Why don't you just hire a competent sysadmin? by Jappus · 2012-01-04 10:49 · Score: 3, Informative

Isn't de-dup a fairly trivial application for a DB of MD5sums, even if you don't have the chops to use the filesystem at a more fundamental level?
Yes, but in that case, two multi-GB files that share all of their data except one bit will not be deduplicated. The difference between your approach and Microsofts is grounded in the same though-process that make modern compression algorithms better than older ones:
First you treat all files separately, which is really simple but has the drawback of not cross-linking chances for compression/dedup across files. This is what deflate (ZIP/GZIP) and your approach to dedup do. The same data simply gets recompressed twice -- or in your case not duplicated if the data is even marginally different. You will never reach the maximum space-saving that way, even though you can at least be sure to be reasonably fast.
Then you notice that files sharing most data should only be compressed/deduped once and then just linked together. The easiest way to do that is to cut up the files into blocks and compress them. If two blocks are the same, you don't recompress them but just put in a link to the previous compression. This is what (roughly speaking) BZIP2, RAR, ACE and some other formats do. In deduping terms this means creating multi-level hashes for each files. It works much better, but has the price of being more complex and time consuming.
Finally, you notice that cutting up files at fixed boundaries is also wasteful. If two blocks are the same, but one has all bytes shifted left one position, you needlessly waste space. Thus, you try to identify if you can dynamically cut up the files/stream into chunks that you have already compressed, plus a handful of spare bytes here or there or with a very simple substitution/transposition function applied. This is (extremely roughly speaking) what LZMA of the 7-Zip fame does and what Microsoft tries to do different in their dedup approach.
Of course, going that way is even MORE complex and time consuming, but may be well worth it, if space-saving is what you're intested in. After all, there is no such thing as a free lunch -- you either pay with time or with space (or with general applicability in some corner cases).
So, all in all, the approach itself is not new -- neither yours nor Microsofts -- but the magic lies in actually creating a working product out of the theoretical approach outlined above.