Ask Slashdot: Free/Open Deduplication Software?
First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.
That deduplication for NTFS is really interesting, actually. It's not licensed technology but straight from Microsoft Research and it has some clever aspects to it.
Some technical details about the deduplication process:
Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.
After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.
There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.
Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.
New writes are not deduped – this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.
The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all. So when are we going to see Linux equivalents? Because Linux is getting behind on the new technologies.
...includes dedupe.
There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.
Combines a bright future and ZFS
changelog == stagnated
...ever heard of it?
Acronis Backup & Recovery 11 Advanced Server has deduplication (licensed addon) and runs on Linux. At roughly $2000 it ain't cheap. I've never used it so can't comment on how well it does.
These posts express my own personal views, not those of my employer
FreeBSD (ZFS) or DragonFly BSD's HAMMER FS perhaps?
I do believe FreeBSD also supports ZFS.
ZFS in FreeBSD 9 has deduplication support. I've been running the betas / release candidates on my NAS for a little while (everything important is backed up, so I thought I'd give it a test). ZFS development in FreeBSD is funded by iXSystems, who sell expensive NAS and SAN systems so they have an incentive to keep it improving.
I have a ZFS filesystem using compression and deduplication for my backups from my Mac laptop. I copy entire files to it, but it only stores the differences.
I am TheRaven on Soylent News
Why dont you look at Nexenta? That's all the goodness of open indiana, zfs and dedupe with the backing of a strong commercial organization.
you could run a nightly script to find duplicates and then deduplicate them.
an example of this would be find all new files since the last run, checksum them and compare this to the checksum of all previously examined files. Once you find the likely duplicates you can decide how careful you want to be about verifying identity of both data and meta data. For example, do you want to preserve the attribute dates if the data is identical? for some programs the creator dates matter. Likewise file permissions might be different.
once you find the duplicate then just erase one of them and create a hard link to the other if it's on the same filesystem or, if you dare, a softlink to the file on another filesystem.
This is not hard, and it very closely approximates what Netapp and Apple TimeMachine do.
Alternatively, if what you are really trying to do is not elminate duplicates in an active file system but merely keep snapshots for backup then the problem is much simpler.
on BSD unix: ../oldsnapshot | cpio -dpl - ./
ms snapshot oldsnapshot
mkdir snapshot
cd snapshot
find -d
rsync -aE source/
where ../oldsnapshot is the old backup of your data
snapshot is the new backup
source is the thing you want to backup.
voila you have an endless set of snapshots in which no file is ever duplicated. However, metadata like file ownership and unix flags are not preserved in the old snapshots.
Some drink at the fountain of knowledge. Others just gargle.
FreeBSD, and FreeNAS which is bases on FreeBSD, both come with ZFS. Neither is going away anytime soon.
I use both at home and am happy as a clam.
Trolling is a art,
I've used LessFS. On my "server" powered by an Intel Atom, it is very very slow. It writes at about 5MB/sec, even when everything is inside a ram disk.
You can't use a block size of 4KB, otherwise write speeds are around 256KB/sec, need to use at least 64KB.
Look, no duplicates!
Check out BackupPC. Been using it for about 5 years at our company, admittedly a mostly Linux shop, with great results. Deduplication on a per-file basis, block-based transfers via the rsync protocol, and a good web-based UI (at least in terms of function). Thanks to deduplication we are getting about a 10:1 storage compression backing up servers and workstations: a total of 1.28 TB of backups in 130.88 GB of used space.
Active developer and Open Source:
http://www.lessfs.com/wordpress/
Your post doesn't make it clear if you're looking for a free backup product to replace DataDomain, NetApp, etc. or if you're now wanting to dedup on live filesystems.
If you're looking for a free backup product that supports deduplication, look at backuppc . Powerful and complex, but free. I've used it for years with good results.
http://backuppc.sourceforge.net/info.html
Isn't that going to eventually delete every file?
So you want dragonfly BSD with a hammer filesystem.
An excellent and stable BSD and an excellent filesystem to go with it. And a very helpful community.
You should be aware that ZFS and most block based deduplication products don't handle streams from NetBackup very well. This is due to block size variability, which invalidates the block by block deduplication algorithms. Information on this issue is readily available if you search for it.
For ZFS options, you can try the following.
-Sun/Oracle sells a ZFS storage array called the Amber Road.
-Check out Nexenta. They provide a supported version of OpenSolaris complete with high availability options if you so desire.
-You can also check out Open Indiana, a fork of OpenSolaris.
Why not just do the following every once in a while:
1. go through all your files,
2. for each file, compute a checksum (e.g. using the unix tools md5sum or sha1sum),
3. for pairs of files giving similar checksum, compare them (optionally) and if equal remove one of them and make it a hard-link to the other.
It would surprise me if there was no free open-source script doing exactly this.
If Pandora's box is destined to be opened, *I* want to be the one to open it.
However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.
Slagborr
it needs incredible amount of memory to operate effectively.
from my university notes:
5TB data, average blocksize 64K = 78125000 blocks
for each block the dedup needs 320 bytes so
78125000 x 320 byte = 25 GB dedup table
use compression instead. (eg zfs compression)
Patrick Zevo: Are you taking my [de]duplication investigation seriously or are you disrespecting my [de]duplication investigation?
(LL Cool J to Robin Wright in the 1992 movie Toys)
I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication
Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.
Secession is the right of all sentient beings.
The change log shows an update two months ago, how's that stagnant?
Because the original date on the post is from April 2010?
They modified a previous post to put all of the changes in one place, dufus.
Duplicity. It's written in Python with various back-end storage types (Disk, FTP, SFTP, Amazon S3, etc) It's especially useful for backups.
I have recently written some personal software that finds duplicate files/directories. I had no idea there was such a demand for something like this (and price for such software).
There are a few hurdles with deduplication that a piece of software will likely need your input for:
- Does something like the directory/file names matter? If so, does case matter? What about comparing names from ASCII Unicode? Do you compare the DOS 8.3 names also?
- What do you want to do when you find duplicates? Delete the duplicates and put links in their place? Which one do you want the links to point to? Maybe remove the duplicates... so which one do you keep?
- Does meta data need compared too? What about the data itself? Different file systems have different gotchas. NTFS can have separate streams, which do not necessarily follow a file. If your file system supports snapshots, what are you really wanting to compare?
"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
md5sum `find . -type f` | sort
...and so on
http://michaelsmith.id.au
OpenSolaris is Dead. Long live http://openindiana.org//
...includes dedupe. There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.
Oracle has more or less killed OpenSolaris.
ALL Solaris, ZFS, DTrace and Zones engineering talent *and* Solaris community energy is now focused on Illumos.
More from highly entertaining ex-Sun engineer Bryan Cantrill in this talk:
http://www.youtube.com/watch?v=-zRN7XLCRhc
(funny and informative for those of you looking for some Oracle-bashing and Solaris history)
Look into the ZFS On Linux project (NOT fuse). It has native performance, and it supports dedup. I've been running it on a small office server for 4 months now without issue, though I have not used the dedup feature due to its RAM requirements. See http://zfsonlinux.org/
The original post doesn't quite describe the application well enough to make any absolute judgements, but I doubt Data Deduplication is going to help much.
I've had experience using ZFS with its data deduplication. If you're using ZFS across your operation, it might make sense to have an alternate ZFS server be a "backup" of your primary store, but if ZFS is not your thing, you're headed down the wrong path. Deduplication is somewhat computationally expensive to implement and locks your data into the target format. Changing your application to fit ZFS as the target is an expense you'll have to factor in. (An expense likely as big as the disk space you believe you will be saving.)
OTOH - disks keep getting bigger and cheaper. You may not be able to count on OpenSolaris or even FreeBSD, but you can certainly count on WesternDigital/Seagate to continue to offer solutions at reasonable cost/byte.
Yes, but the OP never said anything about keeping files that weren't duplicates.
,fine. That one actually works, believe it or not...
O.K
ZFS dedup has a serious impact on write performance if you enable it.
Just buy more disk.
FreeBSD is certainly alive and well with an actively developed implementation real zfs implementation.
Keep in mind however, ZFS with deduplication is not going to be a 'high performance' file system. You aren't going to use it for anything where latency is of any importance at all. It makes a pretty shitty VMware storage pool for instance, even with SSDs for log devices.
Its silly to consider anything else. You never actually wanted to run the current bastard child previously known as Solaris. Its actually proof that open sourcing a project can make it actually suck more than before hand. Linux is out since its silly license is so paranoid about someone using their precious code that they can't use anyone else's without arguing with each other about legality rather than actually do something useful. All you're left with is FreeBSD.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
http://www.synctus.com/ddar/
SuperREP: huge-dictionary LZ77 preprocessor.
de-dupes and preprocesses helps to create zip/gz/bz tighter and faster than without.
http://freearc.org/research/SREP.aspx
I'm utilizing Ubuntu 11.10 Server with ZFS right now and have compression and dedup turned on.
Works great... had to download ZFS as it didn't come preloaded.
Both deduplication and conventional compression are a questionable idea on a file server which has many clients.
The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.
Yeah right. I'll wait for real usage numbers rather than "the vendor selling this stuff to me says it's fucking awesome".
Deleted
I just use rsync from the command line to do deduplication. Been working like a charm for years.
First I sync from the remote directory to a local base directory:
rsync --partial -z -vlhprtogH --delete root@www.mydomain.net:/etc/ /backup/server/www/etc/base/
Then I sync that to the daily backup. Files that have not changed are hard-linked between all the days that share them. It very efficient and simple, and retrieving files is as simple as doing a directory search.
rsync -vlhprtogH --delete --link-dest=/backup/server/www/etc/base/ /backup/server/www/etc/base/ /backup/server/www/etc/2012-01-04
-Dave
Isn't de-dup a fairly trivial application for a DB of MD5sums, even if you don't have the chops to use the filesystem at a more fundamental level?
I mean, off the top of my head, use "find" to get today's files, "md5sum" to MD5 them, "grep" or "gawk" to check your flat file of existing sums, "ln" to link dups to a single copy, and a couple dozen lines of shell to glue it all together and append new sums.
You could probably code it all up in busybox in an afternoon, and run it from cron every hour.
But you'd have to have a clue. If you don't have a clue, you can always buy a package. Yay freedom of choice!
If you're exceptionally clueful you can do this with inotify hooks and existing file checksums.... and "say it doesn't affect performance almost at all" if that sort of linguistic double-dutch appeals to you.
if you want to have backups using deduplication (i.e. 100 identical boxes only get saved once ) you can use bacula.
more about that here : http://www.bacula.org/en/dev-manual/main/main/File_Deduplication_using_Ba.html
Not sure how much development it gets these days but BackupPC looked reasonable but wasn't bomb proof enough for the systems I looked at it for. It primarily uses rsync to copy files (changes only) but then "deduplicates" the rsyncs. It was actually "deduplication" via none lossy compression but in reality they are similar technologies - they both look for repeat data, make a record of the repetion and what is repeated and remove the repetition. It doesn't shrink your primary storage at all - its purely about a duplcated sets of data.
Nexenta plus Napp-IT and LOTS Of RAM will do the trick for 16 TB or less. Been using it For almost two years and never had a problem.
backuppc is backup software that does file-level deduplication via hard links on its backup store. Despite the name's suggestion that it is for backing up (presumably windows) PCs, it's great with *nix.
http://backuppc.sourceforge.net/
Its primary disadvantage is the logical consequence of all those hard links. Duplicating the backup store, so you can send it offsite, is basically impossible with filesystem-level tools. You have to copy the entire filesystem to the offsite media, typically with dd.
It also can make your life difficult if you're trying to restore a lot of data all at once, like after a disaster. You take your offsite disks that you've dd' copied, hook them up, and start to run restores.
The hard links mean lots and lots of disk head seeks, so you are doing random i/o on your restore. This is really slow. If I ever have to do this, my plan is to buy a bunch of SSD's to copy my backup onto. Since there are no seeks on SSDs it will be much faster.
I've been seeing things like this a lot lately, and it secret communication much like how secret agents get orders from codes in advertisements, etc. I've nearly cracked the code. I can't say much now, but I can tell you that these are very nefarious people up to no good. They are very powerful and not your typical slashdot commenter.
With each copy, you add to the security.
If you want to deduplicate, simply stop making backup.
Because that’s what it results in in the end anyway.
Yep, put a nail in OpenSolaris' coffin. Instead, I use and recommend OpenIndiana and NexentaStor (or Nexenta's community edition if you prefer).
Try Squashfs which creates deduplicated and compressed filesystem archives (http://www.linux-mag.com/id/7357/ for a good journal article).
If you're using Ubuntu, Debian, Fedora Squashfs will be already built into your distro kernel, and the squashfs-tools will also already be available in your distro repository.
http://sourceforge.net/projects/staticfiledups/
If you have a large collection of files, particularly files that do not change often, this set of scripts can _help_ you manage them. It stores a database of files (currently as a flat text file), so that they're only hashed once, and will provide you a list of duplicate files on asking. There's a script to check each file in a group of duplicates and link them together, and a script to remove any number of duplicates from a group, ensuring that there is at _least_ one copy remaining somewhere. There's another script to help you determine which file to keep, based on a simple (complex?) rule-set of what directories are more important than which other directories.
Should you wish, it'll also tell you directories with the greatest number of duplicated bytes are -- ordered descending.
Changes have been made since the last update of this project, and I'm currently trying to rewrite it into a C project with sqlite backend and better detection of file changes, but I lack the time for this to happen any time soon.
There's a form of deduplication supported by the Linux kernel, if you use the logical volume manager. If you create base LVM device, and then create a snapshot of that device, the snapshot only requires sufficient real estate on the host physical volume to store the diffs between the snapshot and the base. You can use this for "freezing" a file system to do back-ups, or for incremental back-ups, or whatever.
My rather limited experience with this is that, if you have more than a few snapshots on a base device, your write performance degrades very raplidy. There's also a hard limit of 255 snapshots per device.
You can also do file-based deduplication with the "rsnapshot" tool, which has been available for many years.
Also also, I haven't kept up, but I seem to recall that ZFS for linux was promising this as a major selling point.
2*3*3*3*3*11*251
Have a look at bup (https://github.com/apenwarr/bup). Though still very new and missing some features, it's pretty stable, fast (at deduplication, restore is another story) and very effective. Better ratio than lessfs, sdfs and zfs. I'm rather impressed how it sucks in full partition images, several hundred gigabytes, each day at my place. It's meant for virtual machine images. File based backup is also included, but still missing meta data AFAIK.
Also, have a look at Btrfs. Btrfs does not include deduplication per se yet, but you can use volume snapshots and rsync --inplace so it only does block level changes within updated files, if filenames don't change (watch /var/log).
Do I think its totally optimized and stable no. But ZFS is now on linux. Please read on my blog for a setup which I have put together using the smarts of many good people.
Enjoy the reading
http://solution-wanfuse.blogspot.com/
No Warranties are intended or implied. Use at your own risk!
try out cyphertite ( https://www.cyphertite.com/versions.php ), it does encrypted backup with deduplication across hosts that share crypto keys.
the software is open source, runs on *nix, compresses, encrypts and dedups data for backups.
OpenBSD has had the Epitome deduplication framework for some time. I believe version 2 is considered production-ready.
Hands down, Cyphertite. Been using the beta version for quite a while without issues. It is developed by some OpenBSD developers and runs on my FreeBSD and Linux machines. I guess they have a Windows version coming out soon too. Full recommended.
Consider that subscribers can see stories early (which time goes with a story? live for subscribers or live for all?)
and they are also available on the firehose-
you can ad comments on the firehose too.. if converted to a story- would the comment time match?...
every day http://en.wikipedia.org/wiki/Special:Random
The future of ZFS and the product that was OpenSolaris has really started to take shape over the last few months and there is a lot of good work going on around it. Illumos has set up a proper foundation that will be shepherding the development of their OS and ZFS fork. They've got some good commercial backing (Nexenta, Joyent, and others), and many of the original ZFS engineers from Sun are actively involved in the development. A lot of work is going on right now in terms of revamping the versioning scheme to ensure some level of feature interoperability between "open" ZFS and "Oracle" ZFS (assuming Oracle chooses not to play ball in the long run).
If you're looking for an inbetween solution, Nexenta is at an interesting place in the market. They are an order of magnitude cheaper than the tier 1 providers, but you're not completely on your own if you still have interest in some sort of commercial support contract. For the record I'm not affiliated with them in any way other than being a satisfied customer.
I'll also echo the previous comments about ZFS dedup and RAM - you need enough memory for the entire dedup table to fit in RAM (or a fast L2ARC SSD) or performance will tank. There is a formula buried in the documentation somewhere for determining requirements based on the size of your pool.
Oracle is aggressively marketing it's ZFA-SA (Storage Appliance) as a competitor to the likes of Netapp and EMC. http://www.oracle.com/us/products/servers-storage/storage/nas/overview/index.html
OpenSolaris has a brighter future than you think but it is now under the name Illumos. All the SUN developers went to work for either Joyent or Nexenta, both of which maintain open source OpenSolaris distros.
ZFS is also available as a native filesystem on FreeBSD, and as a FUSE filesystem on Linux found here http://zfs-fuse.net/ Don't let FUSE scare you. for a backup system or a NAS used primarily for backups, you won't notice any performance issues.
ZFS is where the action is when it comes to a forward looking full-featured open source filesystem. However do be careful with dedup. Ever heard of those 1K .zip files which expand to fill your entire hard drive? All dedup technology is a cousin of .ZIP and if you don't plan things out you can easily create a system on which you don't have enough free disk space to do certain maintenance activities. But of you do your research you will learn how to configure your zpools, etc. to avoid these problems. Open source based NAS is more likely to have a problem because people tend to cut corners and skimp on the hardware. Nothing wrong with saving money, just make sure that you factor in the fact that you will occasionally need to recover filesystems or transition filesystems off a bad disk onto a new good one.
The article says that commercial de-dupe solutions are too expensive. It's not cheap technology, that's for sure.
But it's unreasonable to expect a solution for free to such a complex problem. Rather than try to find a solution that "is ready for customer deployment", they should be seeking out a solution whose design meets their goals, and FUND that project. Too many companies think open source means free. And it does, technically.
But if you want quality open source solutions, you're either going to have to pay to help speed up development, or you're going to have to wait for years while people work on it in their spare (and unpaid!) time and hope that the key developers don't abandon the project completely. And make no mistake, with any open source project of any size, there are key developers who produce the majority of the code. Software with as many contributors as the Linux kernel are very, very rare.
Most are more like Eclipse, where a HUGE chunk of the funding comes from one company (IBM in the case of Eclipse.)
It's high time companies started to realize that open source is a way of sharing technology, not a free as in beer provider for your every whim and want.
I do not fail; I succeed at finding out what does not work.
And one that knows gets modded down as an MS Shill for talking about how NTFS now supports DeDuplication.
Most of the posters seem to be confusing copy-on-write with DeDup. Rsync cannot dedup. Time Machine is not dedup. Dedup means different files (not just different *versions* of files) share links to the same block, written just once (or twice) to conserve space. Rsync with hard links and Time Machine are just copy-on-write mechanisms. Similar but different!
But dedupe is 80% snake oil.
First reason, redundancy. If your backup policy specifies that your making copies of the data on a regular basis, and you then proceed to delete all the copies but one, why are you making the copies in the first place? Maybe instead of dedupe you need better backup software that can detect the redundancy and simply choose not to back up what hasn't changed.
Second, performance. By its very nature dedupe degrades over time, the dedupe vendors battle this in a number of ways (secondary caches, inverted dedupe streams, etc) . Eventually though streams of data on become sequences of tokens scattered all over the storage system in nearly random patterns. Its the same as having a file system where every 4k of data in a file requires head seeks.
Scale, as the amount of physical capacity increases the need to maintain hash lookups for that physical capacity increases. These hash lookups pretty much must resides in physical RAM. This leads to expensive dedupe nodes or massive dedupe inefficiency if the data is split across multiple standalone dedupe appliances.
Price, the above limits tend to drive the price per deduped MB up beyond the price of RAID arrays from 2nd tier vendors like nexan, acnc, etc.
Now all that said, there are places where dudup provides an advantage, but they are few and far between. That is because not everyone get 30:1 (especially if their backup software isnt doing full backups). Generally the dedupe systems are at the bottom of the performance curve and not everyone is willing to grow their backup window or slow down there apps either. That tends to leave them in fairly low end enviroments often better served by inexpensive raid boxes.
I've put a deduplication technology comparison table here.
Most deduplication technologies are SIS (deduplicates only entire file content at a time - requires O(n^2) storage for large, slow growing files) or fixed-width blocks (has problems if you change a byte at the beginning or middle of a file).
The strongest offerings do sliding block (or what I've been calling variable-length, content-based blocking). These don't have problems with inserting or changing or deleting a byte somewhere in a file.
I've designed and coded a backup system that does variable-length, content-based blocking for deduplication, called Backshift. It's very nearly ready to hit 1.0, mostly just needing a few more users to try it out. It's got a comprehensive automated test suite, lots of documentation, runs on all the major Pythons except IronPython (fastest on Pypy - Pypy's even faster than CPython combined with Cython for this application), and has been ported to a variety of Linuxes, DragonflyBSD, FreeBSD, OS/X, Cygwin, Haiku, Solaris and Open Indiana.
In addition to the deduplication, it compresses the deduplicated chunks with xz (with a bzip2 fallback if none of the 3 methods of doing xz work). Also, if it notices that a chunk grows when compressed, that chunk gets stored uncompressed - so backing up a file that was already compressed pretty hard doesn't require increased storage in the backup repository.
Use:
The subdirectory specifications are excelerated quite a bit over what plain tar would give.
It sometimes acts so much like tar (especially for restores and listing contents) that you might be tempted to think that it's storing things as tar archives behind the scenes - but it's not.
You can still use OpenIndiana which is based on Illumos, the fork of OpenSolaris. A lot of the dev guys at Sun have left the company and now contribute code to the project.
In fact one killer feature is that they ported KVM to Illumos. Now you can run VMs and take advantage of the many features of zfs including deduplication.
BackupPC is a free disk-based backup system for Linux. It's based on rsync, but written in perl. It supports file-level dedupe, which is most of what you want for backups, and installation on RHEL5 is just a yum install (if you have RPMForge/EPEL configured) and a couple service starts.
It handles scheduling backups of multiple systems concurrently, up to the number you decide, within a timeframe you decide. Will send emails and alert on the web interface if any systems failed, or are over-due and haven't been backed-up. You can assign a (non-admin) user to a given system, so they can login to the web interface and manually launch backups and view and restore deleted or old revisions of files to the live system at-will, as well as getting email notiifications for their systems.
It supports backups over ssh, rsh, smb, nfs, native rsyncd, and probably others. It has a very polished web interface to configure absolutely everything about the backup process (except the initial ssh key setup), and even allows you to coonfigure pre commands to have it tell the OS to make a snapshot (typically LVM on Linux, and VSS on Windows) and backup from the snapshot.
Above and beyond rsync (or rsnapshot for people that can't write a 2-line shell script with rsync) BackupPC has a few features like doing a full checksum on X% of unchanged files which makes for a very good sanity check. It also runs as non-root users, while still preserving devices, permissions, etc. The downsides versus scripting rsync are that BackupPC is still locked into the full/incremental model, and after dozens of incrementals, performance really suffers until a "full" backup is performed. It doesn't properly handle the copyright banners some systems will print, when native rsync will behave properly. And it has it's own filesystem format that mangles filenames and such, but there's a FUSE driver (also in perl) that allows you to virtually mount it, which is very important... when I last looked, said FUSE script needed a one-line fix for either raw or block device nodes, I forget which.
In general, BackupPC is the way to go. It's pretty flexible, polished, capable, and provides the reliability you want in a backup system more than anywhere else. The issues could be fixed by an interested halfway decent perl programmer with some time. It's really very, very close to being the ideal backup solution and surpassing the propietary solutions, with the above issues being the big stumbling blocks.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
If you don't think opensolaris has a future fair enough. FreeBSD does. FreeBSD currently supports ZFS v28, which has dedup. Be aware you need plenty of RAM.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
For things like DV footage, I don't bother with the dedup.
Even with M. Night Shyamalan movies?
BTRFS is far from stable, but it's cow based design allows simple file-based de-duping usin cp --reflink. Some informations can be found here.
We've been using deduplication products, for backup purposes
Here is your problem. Backups should not be copied in myriads of instances and deduplicated on top of that insanity. They should be incremental.
Contrary to the popular belief, there indeed is no God.
High-rely.com has a free deduping utility for Windows called HRDeDupe. A bit simplistic, but designed for use in scripting.
http://high-rely.com/HR3/includes/downloads.php
I wrote a program some time ago that does block-level dedupe within single files. It's not actually any good on a live filesystem at all - I made it as a compressor for drive backups and VM images - but it can be used to at least estimate space savings. If you run a drive (It'll take a block device under linux) through this, the resulting file size will tell you potentially how much capacity deduplication could save under ideal circumstances. If this software can't shrink the drive, don't bother even looking into deduplication: It isn't going to work.
/tmp/ precluding a Windows compressor), just a little project I made as a proof-of-concept. Takes a ridiculous time to run too, but that's the cost of dedup.
http://birds-are-nice.me/programming/BLDD.shtml
It isn't really good enough for general use (things like hard-coded references to
https://github.com/dsrbecky/DeltaZip/downloads
"Creates zip-like files which can reference data from other zip files and thus avoids storing it twice. It is intended for incremental backups. It works on sub-file level using rolling hashes and is therefore very fast and guarantees to find any data duplication. Parallel computation makes it faster then normal zip. License: MIT"
Post processing is the cheap way out, as it takes otherwise idle CPU time to get the job done. But it has a number of drawbacks and one is restorability.
Think about it..
You have a 1TB disk. You fill it up with data. Windows post-processes this and brings it down to 500GB. Cool! So you add another 500GB and a bit more, and eventually you end up with, say, 2TB worth of data on a 1TB disk. Obviously you are a prudent admin, so you make backups of your data. 2TB backups.
And then it happens. The drive dies. Luckily you have a spare disk, so you quickly put a fresh 1TB disk in. And you start restoring..
Guess what happens at 50% of your restore?? *DISK FULL*...
So either you have to buy double the amount of disk, or you have to pause the restore each time just before the disk fills up, let Windows redo its post-processing dedupe and continue. Hmmm. Wonder if your boss will like that.
Think twice before you adopt post-processing dedupe
To Terminate, or not to Terminate, that's the question - SCSIROB
Let's take the marketing claim that dedup will save you 10x on storage.
And let's assume that your tier-1 vendor charges you $2000/TB (raw) for their dedup appliance.
That's an effective cost of their solution of $200/TB of deduplicated data, which is close enough to the cost of a hard drive to ignore, especially when you talk about corporate budgets.
Now, from experience, $2000/TB is a low price for a dedup appliance. $5000 is closer to the mark. That means you get to spend $500/TB on your whitebox non-dedup storage array, which means you can easily mirror (RAID-10) your collection of SATA drives, and still hit your target price.
--Joe
It is trivial. I wrote a python script to do it. It is licenced under gpl 3. It uses hard links, and I run it every time I backup a significant amount of data to my 10tb fileserver. As it uses hard links, it is designed for storing backup data, not for general usage. It can be found at http://jdeifik.com/ under 'Disk deduplicator'
I was doing similar research a few days ago.
Some of these are already mentioned...
Other stuff:
By fixed block I mean that the file system does not search out shared data when the blocks are not on block boundaries. So if you add one byte to the beginning of a 10 GB file, and that has the unfortunate consequence of rippling up through all the blocks that make the file, then there will be no block level sharing with the original file. Of course that's a pathological case, but you get the idea.
Original poster, perhaps you could keep us informed of your findings? There's at least me who is also interested.
Check http://www.exdupe.com/ which is open source and uses sliding window deduplication like rsync, but multithreaded and alot faster. It's a nice simple cmd line tool, not a huge client/server setup or file system.
Do you need dedup with dynamic block sizes? Or is fixed block size enough? Comparing data domain (or similar products) with ZFS's dedup (or most other primary storage "dedup" filesystems) is comparing apples and oranges.
Yoiks, you're right. It's been over a year since I wrote an inotify interface, and I forgot that! In real use - well, OK, in my real use anyway - if I try to set up a completely recursive structure that adds new watches as new folders are created, the number of inotify events due to normal user operations becomes so high that events start getting dropped with IN_Q_OVERFLOW yadda yadda yadda. This turned out not to be a problem for me specifically but that was only because I wasn't de-duping, I was just triggering events on client file transfers, which were restricted to specific folders anyway.
I like being corrected, because I don't like being wrong. Thanks!! And thanks for the Stearns link, too - I may use some of that code (with soft links instead of hard links, though, for a particular use case I have in mind).
Haven't they already been shiping their "Storage Server" platform with integrated de-dupe? Seems to be a feature in 08R2 from their ads.
And Linux has ZFS with de-dupe, if you care for FUSE anyway.
CRC32 is easy to forge, and due to the birthday paradox, it's likely to even happen accidentally if you have a few thousand files of the same size. How is SHA-1 slow in a disk- or network-bound application?
This second check should also bring the chance of a file being incorrectly deduplicated due to a collision as close to zero as possible
Which means you end up having to fully rescan any file that has been involved in a CRC32 collision.
so the script will give even more than the certainty of SHA1 with nearly the speed and low CPU usage of CRC32.
Answers to this question on SO imply that on an x86-64 CPU, both Skein and SHA-1 can process 300 MB/s. So unless you're already using SATA 6G, you're still disk-bound.
The application isn't sure to be disk-bound or network-bound especially in the age of SSDs
If you're backing up an SSD to an HDD, you're still disk-bound.
Time Machine, Apple OS.
Time Machine, Apple OS.
Doesn't do much good for all of the Windows, UNIX and Linux servers I'm backing up...
Any insufficiently advanced magic is indistinguishable from technology.
No problems, I'll spec up a brand new XServe and make that happen.