Ask Slashdot: Free/Open Deduplication Software?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Free/Open Deduplication Software?

Posted by timothy on Wednesday January 4, 2012 @08:54AM from the the-dept-dept-the-from-dept-from-from dept.

First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.

19 of 306 comments (clear)

Min score:

Reason:

Sort:

I've wanted deduplication for a long time! by GPLJonas · 2012-01-04 08:54 · Score: 4, Interesting

And now, even the next version of Windows Server will contain integrated data deduplication technology! So Linux devs better get working on similar features. I still cannot figure out how NTFS can support compressing files and folders but Linux cannot.

That deduplication for NTFS is really interesting, actually. It's not licensed technology but straight from Microsoft Research and it has some clever aspects to it.

Some technical details about the deduplication process:

Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.

After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.

There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.

Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.

New writes are not deduped – this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.
The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all. So when are we going to see Linux equivalents? Because Linux is getting behind on the new technologies.
1. Re:I've wanted deduplication for a long time! by nanoflower · 2012-01-04 08:59 · Score: 4, Insightful
  
  The most likely answer is when some one is willing to pay for it. What you have described above isn't a trivial effort and it's unlikely someone is going to do the work for free so it will have to wait until someone is willing to pay for the development. Even then it's likely that the developer may keep it closed source in order to recoup the investment.
2. Re:I've wanted deduplication for a long time! by lucm · 2012-01-04 09:08 · Score: 5, Insightful
  
  And now, even the next version of Windows Server will contain integrated data deduplication technology! [...] The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.
  Well ask anyone who lost documents on DoubleSpace volumes or got corrupted media files on Windows Home Server and they will tell you that even if Microsoft Research says so, it's not something I would put on my production servers any time soon.
  
  --
  lucm, indeed.
3. Re:I've wanted deduplication for a long time! by Ironchew · 2012-01-04 09:09 · Score: 5, Funny
  
  When will Adblock Plus block these Microsoft ads on Slashdot?
4. Re:I've wanted deduplication for a long time! by Anonymous Coward · 2012-01-04 09:34 · Score: 4, Interesting
  
  I have often wondered why someone doesn't use the rsync algorithm as a basis for this kind of chunking and deduplication. I imagine a FUSE-based filesystem that breaks the application-level files into checksummed pieces and stores both the file fragments and file descriptions into an underlying filesystem. Then it could reconstruct the application-level files on demand, using the description to draw out the right fragments.
  From an academic point of view, it already solves the same problem and just needs some repackaging. It breaks arbitrary data into phrases to be identified by checksum and located in another existing corpus of data. It just needs a metadata model to record the structure of the file as composed of these canonical phrases, rather than performing the actual file reconstruction immediately as rsync does now.
  From my cynical point of view, I realize someone may have patented the repackaging, in the same way that Apple seems to think they can re-patent every idea "on a smartphone".
5. Re:I've wanted deduplication for a long time! by anomaly256 · 2012-01-04 10:07 · Score: 4, Informative
  
  Considering Linux does have this capability in a few FS drivers now (ok.. some more stable than others, sure) I think the GP should be modded troll rather than the post pointing out it's likely a shill... too bad i'm out of mod points
6. Re:I've wanted deduplication for a long time! by jimicus · 2012-01-04 10:44 · Score: 5, Interesting
  
  Also, perhaps the reason that Linux does not support file-system compression on the fly is because it's a horrid idea, and should never actually be used?
  Ah, the "Terrible idea" objection.
  This is a common objection to implementing ideas on Linux - so common, in fact, that it's successfully held Linux back for at least ten years.
  Multi-master LDAP replication? Terrible idea. Remained terrible for several years after literally every commercial LDAP server on the planet supported multi-master replication, only became non-harmful when OpenLDAP started to support it in version 2.4.
  Active Directory support? Such a terrible idea that it's held Samba development back by at least five years. Even now, where Windows Vista deprecates NT4-style policies and 7 deprecates NT 4 domain support altogether (which is about all you get from Samba 3); Samba 4 is considered alpha software.
  Some sort of centralised work-together system that integrates email, address book, calendars, task-list? Terrible idea. So much so that Exchange (despite being way too complicated for its own good) is still an extremely popular email solution and the closest you can get to a viable F/OSS alternative either requires your users to completely re-think how they collaborate (yuck) or buy the commercial version simply because the free version lacks vital features.
  Free clue to all naysayers who work on F/OSS projects: If you spent as long trying to think of ways to make something work as you do thinking of objections to existing implementations and explaining how you're right and everyone else is wrong, you wouldn't be ten years behind the times.
Dragonfly BSD's HAMMER... by Anonymous Coward · 2012-01-04 08:57 · Score: 5, Interesting

...includes dedupe.
There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.
1. Re:Dragonfly BSD's HAMMER... by TheRaven64 · 2012-01-04 10:28 · Score: 4, Funny
  
  No, that's the opposite of deduplication...
  
  --
  I am TheRaven on Soylent News
BackupPC by Anonymous Coward · 2012-01-04 09:11 · Score: 5, Interesting

Check out BackupPC. Been using it for about 5 years at our company, admittedly a mostly Linux shop, with great results. Deduplication on a per-file basis, block-based transfers via the rsync protocol, and a good web-based UI (at least in terms of function). Thanks to deduplication we are getting about a 10:1 storage compression backing up servers and workstations: a total of 1.28 TB of backups in 130.88 GB of used space.
Re:FreeBSD by TheRaven64 · 2012-01-04 09:21 · Score: 5, Informative

As I said in another post, ZFS development on FreeBSD is now funded by iXSystems. Given that most of their income is from selling large storage solutions built on top of FreeBSD and ZFS (often with a side order of FusionIO and other very expensive hardware things), they have a strong incentive to keep it stable and full of the features that their customers want.

--
I am TheRaven on Soylent News
No dedup in FreeNAS by Svenne · 2012-01-04 09:23 · Score: 5, Informative

However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.

--

Slagborr
What is deduplication? by jdavidb · 2012-01-04 09:26 · Score: 5, Informative

I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication
Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.

--
Secession is the right of all sentient beings.
Re:Just store everything in /dev/null by shish · 2012-01-04 09:34 · Score: 4, Funny

$ cat todo.txt > /dev/null $ md5sum /dev/null d41d8cd98f00b204e9800998ecf8427e /dev/null $ cat aaah.png > /dev/null $ md5sum /dev/null d41d8cd98f00b204e9800998ecf8427e /dev/null -
Duplicates!

--
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
Re:Dedup is just a marketing word.... by m.dillon · 2012-01-04 09:46 · Score: 5, Informative

All dedup operations have a trade-off between disk I/O and memory use. The less memory you use the more disk I/O you have to do, and vise-versa.
Think of it like this: You have to scan every block on the disk at least once (or at least scan all the meta-data at least once if the CRC/SHA/whatever is already recorded in meta-data). You generate (say) a 32 bit CRC for each block. You then [re]read the blocks whos CRCs match to determine if the CRC found a matching block or simply had a collision.
The memory requirement for an all-in-one pass like this is that you have to record each block's CRC plus other information... essentially unbounded from the point of view of filesystem design and so not desirable.
To reduce memory use you can reduce the scan space... on your first pass of the disk only record CRCs in the 0x0-0x7FFFFFFF range, and ignore 0x80000000-0xFFFFFFFF. In other words, now you are using HALF the memory but you have to do TWO passes on the disk drive to find all possible matches.
The method DragonFly's HAMMER uses is to allocate a fixed-sized memory buffer and start recording all CRCs as it scans the meta-data. When the memory buffer becomes full DragonFly dynamically deletes the highest-recorded CRC (and no longer records CRCs >= to that value) to make room. Once the pass is over another pass is started beginning with the remaining range. As many passes are taken as required to exhaust the CRC space.
Because HAMMER stores a data CRC in meta-data the de-dup passes are mostly limited to just meta-data I/O, plus data reads only for those CRCs which collide, so it is fairly optimal.
This can be done with any sized CRC but what you cannot do is avoid the verification pass.. no matter how big your CRC is or your SHA-256 or whatever, you still have to physically verify that the duplicate blocks are, in fact, exactl duplicates, before you de-dup their block references. A larger CRC is preferable to reduce collisions but diminishing returns build up fairly quickly relative to the actual amount of data that can be de-duplicated. 64 bits is a reasonable trade-off, but even 32 bits works relatively well.
In anycase, most deduplication algorithms are going to do something similar unless they were really stupidly written to require unbounded memory use.
-Matt
Re:OpenSolaris but not FreeBSD? by Anonymous Coward · 2012-01-04 09:57 · Score: 4, Informative

People considering either dedup or compression on FreeBSD should be made blatantly aware of one of the issues which exists solely on FreeBSD. When using these features, you will find your system "stalling" intermittently during ZFS I/O (e.g. your SSH session stops accepting characters, etc.). Meaning, interactivity is *greatly* impacted when using dedup or compression. This problem affects RELENG_7 (which you shouldn't be using for ZFS anyway, too many bugs), RELENG_8, the new 9.x releases, and HEAD (10.x). Changing the compression algorithm to lzjb has a big improvement, but it's still easily noticeable.
My point is that I cannot imagine using either of these features on a system where users are actually on the machine trying to do interactive tasks, or on a machine used as a desktop. It's simply not plausible.
Here's confirmation and reading material for those who think my statements are bogus. The problem:
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012718.html
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html
And how OpenIndiana/Illumos solved it:
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html
Re:OpenSolaris but not FreeBSD? by TheRaven64 · 2012-01-04 10:06 · Score: 4, Interesting

I've got 8GB of RAM in the machine - RAM is so cheap now that it didn't seem worth skimping. It's a 1.6GHz AMD Fusion system. Over GigE, I was getting 40MB/s writes to the deduplicated filesystem, with the load on one core about 100%.
ZFS definitely likes RAM. I'm not sure what the minimum requirements are, but the general recommendation is 'as much as you can afford'. I think 8GB of SO-DIMMS for the mini-ITX board cost about £40, and maxed out its memory, so that was a pretty obvious choice. I'm not sure what happens when the deduplication tables don't fit into RAM, whether it degrades performance or degrades deduplication efficiency. Having 8GB means that a lot of the time it can satisfy reads from RAM.
I'm using it over WiFi 99% of the time, so I'm not too bothered about the performance: it can easily saturate the WiFi link without any problems. . The compression ratio is 1.11x. ZFS only shows the deduplication ratio for the entire pool, not for individual filesystems. That's currently 1.06x for my system, but that's with 1.43TB of data in total, only 266GB on the deduplicated filesystem, so that means that it's saving about a third. Roughly speaking, the extra space used by RAID-Z and the space saved by dedup seem to balance each other out, so (on my backup filesystem) I am using 1GB of hard disk space for every 1GB of data, and still have redundancy so one disk out of the three can fail without losing any data.
Time Machine on OS X does clever things like make a new copy of a 10GB file if 1MB of it has changed, and the deduplication on the NAS translates to a huge space saving for that. For things like DV footage, I don't bother with the dedup.

--
I am TheRaven on Soylent News
Easy, use OpenIndiana or NexentaStor by Zemplar · 2012-01-04 10:42 · Score: 5, Informative

Yep, put a nail in OpenSolaris' coffin. Instead, I use and recommend OpenIndiana and NexentaStor (or Nexenta's community edition if you prefer).
Re:Dedup is just a marketing word.... by m.dillon · 2012-01-04 11:15 · Score: 4, Informative

Well, I can tell you why the option is there... it's not because of collisions, it's there to handle the case where there is a huge amount of actual duplication where the blocks would verify as perfect matches. In this case the de-duplication pass winds up having to read a lot of bulk-data to validate that the matches are, in fact, perfect, which can take a lot of time verses only having to read the meta-data.
Just on principle I think it's a bad idea to just trust a checksum, cryptographic hash, CRC, or whatever. Corruption is always an issue... even if the filesystem code itself is perfect and even if the disk subsystem is perfect there is so much code running in a single address space (i.e. the KERNEL itself) that it is possible to corrupt a filesystem just from hitting unrelated bugs in the kernel.
Not to mention radiation flipping a bit somewhere in the cpu or memory (even for ECC memory it is possible to get corruption, but the more likely case is in the billions of transistors making up a modern cpu, even with parity on the L1/L2/L3 caches).
Hell, I don't even trust IP's stupid simple 1's complement checksum in HAMMER's mirroring protocols. Once during my BEST Internet days we had a T3 which bugged out certain bit patterns in a way that actually got past the IP checksum... we only tracked it down because SSH caught it in its stream and screamed bloody murder.
If you de-duplicate trusting the meta-data hash, even a big one, what you can end up doing is turning 9 good and 1 corrupted copies of a file into 10 de-duped corrupted copies of the file.
I'm sure there are many data stores that just won't care if that happens every once in a while. Google's crawlers probably wouldn't care at all, so there is definitely a use for unverified checks like this. I don't plan on using a cryptographic hash as large as the one ZFS uses any time soon but being able to optimally de-dup with 99.9999999999% accuracy it's a reasonable argument to have one that big.
-Matt