Ask Slashdot: Free/Open Deduplication Software?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Free/Open Deduplication Software?

Posted by timothy on Wednesday January 4, 2012 @08:54AM from the the-dept-dept-the-from-dept-from-from dept.

First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.

10 of 306 comments (clear)

Min score:

Reason:

Sort:

Dragonfly BSD's HAMMER... by Anonymous Coward · 2012-01-04 08:57 · Score: 5, Interesting

...includes dedupe.
There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.
Re:I've wanted deduplication for a long time! by lucm · 2012-01-04 09:08 · Score: 5, Insightful

And now, even the next version of Windows Server will contain integrated data deduplication technology! [...] The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.
Well ask anyone who lost documents on DoubleSpace volumes or got corrupted media files on Windows Home Server and they will tell you that even if Microsoft Research says so, it's not something I would put on my production servers any time soon.

--
lucm, indeed.
Re:I've wanted deduplication for a long time! by Ironchew · 2012-01-04 09:09 · Score: 5, Funny

When will Adblock Plus block these Microsoft ads on Slashdot?
BackupPC by Anonymous Coward · 2012-01-04 09:11 · Score: 5, Interesting

Check out BackupPC. Been using it for about 5 years at our company, admittedly a mostly Linux shop, with great results. Deduplication on a per-file basis, block-based transfers via the rsync protocol, and a good web-based UI (at least in terms of function). Thanks to deduplication we are getting about a 10:1 storage compression backing up servers and workstations: a total of 1.28 TB of backups in 130.88 GB of used space.
Re:FreeBSD by TheRaven64 · 2012-01-04 09:21 · Score: 5, Informative

As I said in another post, ZFS development on FreeBSD is now funded by iXSystems. Given that most of their income is from selling large storage solutions built on top of FreeBSD and ZFS (often with a side order of FusionIO and other very expensive hardware things), they have a strong incentive to keep it stable and full of the features that their customers want.

--
I am TheRaven on Soylent News
No dedup in FreeNAS by Svenne · 2012-01-04 09:23 · Score: 5, Informative

However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.

--

Slagborr
What is deduplication? by jdavidb · 2012-01-04 09:26 · Score: 5, Informative

I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication
Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.

--
Secession is the right of all sentient beings.
Re:Dedup is just a marketing word.... by m.dillon · 2012-01-04 09:46 · Score: 5, Informative

All dedup operations have a trade-off between disk I/O and memory use. The less memory you use the more disk I/O you have to do, and vise-versa.
Think of it like this: You have to scan every block on the disk at least once (or at least scan all the meta-data at least once if the CRC/SHA/whatever is already recorded in meta-data). You generate (say) a 32 bit CRC for each block. You then [re]read the blocks whos CRCs match to determine if the CRC found a matching block or simply had a collision.
The memory requirement for an all-in-one pass like this is that you have to record each block's CRC plus other information... essentially unbounded from the point of view of filesystem design and so not desirable.
To reduce memory use you can reduce the scan space... on your first pass of the disk only record CRCs in the 0x0-0x7FFFFFFF range, and ignore 0x80000000-0xFFFFFFFF. In other words, now you are using HALF the memory but you have to do TWO passes on the disk drive to find all possible matches.
The method DragonFly's HAMMER uses is to allocate a fixed-sized memory buffer and start recording all CRCs as it scans the meta-data. When the memory buffer becomes full DragonFly dynamically deletes the highest-recorded CRC (and no longer records CRCs >= to that value) to make room. Once the pass is over another pass is started beginning with the remaining range. As many passes are taken as required to exhaust the CRC space.
Because HAMMER stores a data CRC in meta-data the de-dup passes are mostly limited to just meta-data I/O, plus data reads only for those CRCs which collide, so it is fairly optimal.
This can be done with any sized CRC but what you cannot do is avoid the verification pass.. no matter how big your CRC is or your SHA-256 or whatever, you still have to physically verify that the duplicate blocks are, in fact, exactl duplicates, before you de-dup their block references. A larger CRC is preferable to reduce collisions but diminishing returns build up fairly quickly relative to the actual amount of data that can be de-duplicated. 64 bits is a reasonable trade-off, but even 32 bits works relatively well.
In anycase, most deduplication algorithms are going to do something similar unless they were really stupidly written to require unbounded memory use.
-Matt
Easy, use OpenIndiana or NexentaStor by Zemplar · 2012-01-04 10:42 · Score: 5, Informative

Yep, put a nail in OpenSolaris' coffin. Instead, I use and recommend OpenIndiana and NexentaStor (or Nexenta's community edition if you prefer).
Re:I've wanted deduplication for a long time! by jimicus · 2012-01-04 10:44 · Score: 5, Interesting

Also, perhaps the reason that Linux does not support file-system compression on the fly is because it's a horrid idea, and should never actually be used?
Ah, the "Terrible idea" objection.
This is a common objection to implementing ideas on Linux - so common, in fact, that it's successfully held Linux back for at least ten years.
Multi-master LDAP replication? Terrible idea. Remained terrible for several years after literally every commercial LDAP server on the planet supported multi-master replication, only became non-harmful when OpenLDAP started to support it in version 2.4.
Active Directory support? Such a terrible idea that it's held Samba development back by at least five years. Even now, where Windows Vista deprecates NT4-style policies and 7 deprecates NT 4 domain support altogether (which is about all you get from Samba 3); Samba 4 is considered alpha software.
Some sort of centralised work-together system that integrates email, address book, calendars, task-list? Terrible idea. So much so that Exchange (despite being way too complicated for its own good) is still an extremely popular email solution and the closest you can get to a viable F/OSS alternative either requires your users to completely re-think how they collaborate (yuck) or buy the commercial version simply because the free version lacks vital features.
Free clue to all naysayers who work on F/OSS projects: If you spent as long trying to think of ways to make something work as you do thinking of objections to existing implementations and explaining how you're right and everyone else is wrong, you wouldn't be ten years behind the times.