Ask Slashdot: Free/Open Deduplication Software?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Free/Open Deduplication Software?

Posted by timothy on Wednesday January 4, 2012 @08:54AM from the the-dept-dept-the-from-dept-from-from dept.

First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.

8 of 306 comments (clear)

Min score:

Reason:

Sort:

Re:FreeBSD by TheRaven64 · 2012-01-04 09:21 · Score: 5, Informative

As I said in another post, ZFS development on FreeBSD is now funded by iXSystems. Given that most of their income is from selling large storage solutions built on top of FreeBSD and ZFS (often with a side order of FusionIO and other very expensive hardware things), they have a strong incentive to keep it stable and full of the features that their customers want.

--
I am TheRaven on Soylent News
No dedup in FreeNAS by Svenne · 2012-01-04 09:23 · Score: 5, Informative

However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.

--

Slagborr
What is deduplication? by jdavidb · 2012-01-04 09:26 · Score: 5, Informative

I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication
Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.

--
Secession is the right of all sentient beings.
Re:Dedup is just a marketing word.... by m.dillon · 2012-01-04 09:46 · Score: 5, Informative

All dedup operations have a trade-off between disk I/O and memory use. The less memory you use the more disk I/O you have to do, and vise-versa.
Think of it like this: You have to scan every block on the disk at least once (or at least scan all the meta-data at least once if the CRC/SHA/whatever is already recorded in meta-data). You generate (say) a 32 bit CRC for each block. You then [re]read the blocks whos CRCs match to determine if the CRC found a matching block or simply had a collision.
The memory requirement for an all-in-one pass like this is that you have to record each block's CRC plus other information... essentially unbounded from the point of view of filesystem design and so not desirable.
To reduce memory use you can reduce the scan space... on your first pass of the disk only record CRCs in the 0x0-0x7FFFFFFF range, and ignore 0x80000000-0xFFFFFFFF. In other words, now you are using HALF the memory but you have to do TWO passes on the disk drive to find all possible matches.
The method DragonFly's HAMMER uses is to allocate a fixed-sized memory buffer and start recording all CRCs as it scans the meta-data. When the memory buffer becomes full DragonFly dynamically deletes the highest-recorded CRC (and no longer records CRCs >= to that value) to make room. Once the pass is over another pass is started beginning with the remaining range. As many passes are taken as required to exhaust the CRC space.
Because HAMMER stores a data CRC in meta-data the de-dup passes are mostly limited to just meta-data I/O, plus data reads only for those CRCs which collide, so it is fairly optimal.
This can be done with any sized CRC but what you cannot do is avoid the verification pass.. no matter how big your CRC is or your SHA-256 or whatever, you still have to physically verify that the duplicate blocks are, in fact, exactl duplicates, before you de-dup their block references. A larger CRC is preferable to reduce collisions but diminishing returns build up fairly quickly relative to the actual amount of data that can be de-duplicated. 64 bits is a reasonable trade-off, but even 32 bits works relatively well.
In anycase, most deduplication algorithms are going to do something similar unless they were really stupidly written to require unbounded memory use.
-Matt
Re:OpenSolaris but not FreeBSD? by Anonymous Coward · 2012-01-04 09:57 · Score: 4, Informative

People considering either dedup or compression on FreeBSD should be made blatantly aware of one of the issues which exists solely on FreeBSD. When using these features, you will find your system "stalling" intermittently during ZFS I/O (e.g. your SSH session stops accepting characters, etc.). Meaning, interactivity is *greatly* impacted when using dedup or compression. This problem affects RELENG_7 (which you shouldn't be using for ZFS anyway, too many bugs), RELENG_8, the new 9.x releases, and HEAD (10.x). Changing the compression algorithm to lzjb has a big improvement, but it's still easily noticeable.
My point is that I cannot imagine using either of these features on a system where users are actually on the machine trying to do interactive tasks, or on a machine used as a desktop. It's simply not plausible.
Here's confirmation and reading material for those who think my statements are bogus. The problem:
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012718.html
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html
And how OpenIndiana/Illumos solved it:
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html
Re:I've wanted deduplication for a long time! by anomaly256 · 2012-01-04 10:07 · Score: 4, Informative

Considering Linux does have this capability in a few FS drivers now (ok.. some more stable than others, sure) I think the GP should be modded troll rather than the post pointing out it's likely a shill... too bad i'm out of mod points
Easy, use OpenIndiana or NexentaStor by Zemplar · 2012-01-04 10:42 · Score: 5, Informative

Yep, put a nail in OpenSolaris' coffin. Instead, I use and recommend OpenIndiana and NexentaStor (or Nexenta's community edition if you prefer).
Re:Dedup is just a marketing word.... by m.dillon · 2012-01-04 11:15 · Score: 4, Informative

Well, I can tell you why the option is there... it's not because of collisions, it's there to handle the case where there is a huge amount of actual duplication where the blocks would verify as perfect matches. In this case the de-duplication pass winds up having to read a lot of bulk-data to validate that the matches are, in fact, perfect, which can take a lot of time verses only having to read the meta-data.
Just on principle I think it's a bad idea to just trust a checksum, cryptographic hash, CRC, or whatever. Corruption is always an issue... even if the filesystem code itself is perfect and even if the disk subsystem is perfect there is so much code running in a single address space (i.e. the KERNEL itself) that it is possible to corrupt a filesystem just from hitting unrelated bugs in the kernel.
Not to mention radiation flipping a bit somewhere in the cpu or memory (even for ECC memory it is possible to get corruption, but the more likely case is in the billions of transistors making up a modern cpu, even with parity on the L1/L2/L3 caches).
Hell, I don't even trust IP's stupid simple 1's complement checksum in HAMMER's mirroring protocols. Once during my BEST Internet days we had a T3 which bugged out certain bit patterns in a way that actually got past the IP checksum... we only tracked it down because SSH caught it in its stream and screamed bloody murder.
If you de-duplicate trusting the meta-data hash, even a big one, what you can end up doing is turning 9 good and 1 corrupted copies of a file into 10 de-duped corrupted copies of the file.
I'm sure there are many data stores that just won't care if that happens every once in a while. Google's crawlers probably wouldn't care at all, so there is definitely a use for unverified checks like this. I don't plan on using a cryptographic hash as large as the one ZFS uses any time soon but being able to optimally de-dup with 99.9999999999% accuracy it's a reasonable argument to have one that big.
-Matt