Ask Slashdot: Free/Open Deduplication Software?
First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.
That deduplication for NTFS is really interesting, actually. It's not licensed technology but straight from Microsoft Research and it has some clever aspects to it.
Some technical details about the deduplication process:
Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.
After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.
There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.
Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.
New writes are not deduped – this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.
The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all. So when are we going to see Linux equivalents? Because Linux is getting behind on the new technologies.
...includes dedupe.
There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.
changelog == stagnated
Acronis Backup & Recovery 11 Advanced Server has deduplication (licensed addon) and runs on Linux. At roughly $2000 it ain't cheap. I've never used it so can't comment on how well it does.
These posts express my own personal views, not those of my employer
ZFS in FreeBSD 9 has deduplication support. I've been running the betas / release candidates on my NAS for a little while (everything important is backed up, so I thought I'd give it a test). ZFS development in FreeBSD is funded by iXSystems, who sell expensive NAS and SAN systems so they have an incentive to keep it improving.
I have a ZFS filesystem using compression and deduplication for my backups from my Mac laptop. I copy entire files to it, but it only stores the differences.
I am TheRaven on Soylent News
you could run a nightly script to find duplicates and then deduplicate them.
an example of this would be find all new files since the last run, checksum them and compare this to the checksum of all previously examined files. Once you find the likely duplicates you can decide how careful you want to be about verifying identity of both data and meta data. For example, do you want to preserve the attribute dates if the data is identical? for some programs the creator dates matter. Likewise file permissions might be different.
once you find the duplicate then just erase one of them and create a hard link to the other if it's on the same filesystem or, if you dare, a softlink to the file on another filesystem.
This is not hard, and it very closely approximates what Netapp and Apple TimeMachine do.
Alternatively, if what you are really trying to do is not elminate duplicates in an active file system but merely keep snapshots for backup then the problem is much simpler.
on BSD unix: ../oldsnapshot | cpio -dpl - ./
ms snapshot oldsnapshot
mkdir snapshot
cd snapshot
find -d
rsync -aE source/
where ../oldsnapshot is the old backup of your data
snapshot is the new backup
source is the thing you want to backup.
voila you have an endless set of snapshots in which no file is ever duplicated. However, metadata like file ownership and unix flags are not preserved in the old snapshots.
Some drink at the fountain of knowledge. Others just gargle.
FreeBSD, and FreeNAS which is bases on FreeBSD, both come with ZFS. Neither is going away anytime soon.
I use both at home and am happy as a clam.
Trolling is a art,
I've used LessFS. On my "server" powered by an Intel Atom, it is very very slow. It writes at about 5MB/sec, even when everything is inside a ram disk.
You can't use a block size of 4KB, otherwise write speeds are around 256KB/sec, need to use at least 64KB.
Look, no duplicates!
Check out BackupPC. Been using it for about 5 years at our company, admittedly a mostly Linux shop, with great results. Deduplication on a per-file basis, block-based transfers via the rsync protocol, and a good web-based UI (at least in terms of function). Thanks to deduplication we are getting about a 10:1 storage compression backing up servers and workstations: a total of 1.28 TB of backups in 130.88 GB of used space.
Active developer and Open Source:
http://www.lessfs.com/wordpress/
Your post doesn't make it clear if you're looking for a free backup product to replace DataDomain, NetApp, etc. or if you're now wanting to dedup on live filesystems.
If you're looking for a free backup product that supports deduplication, look at backuppc . Powerful and complex, but free. I've used it for years with good results.
So you want dragonfly BSD with a hammer filesystem.
An excellent and stable BSD and an excellent filesystem to go with it. And a very helpful community.
Question is how bright the future will be with Oracle effectively going their own way with ZFS.
You should be aware that ZFS and most block based deduplication products don't handle streams from NetBackup very well. This is due to block size variability, which invalidates the block by block deduplication algorithms. Information on this issue is readily available if you search for it.
For ZFS options, you can try the following.
-Sun/Oracle sells a ZFS storage array called the Amber Road.
-Check out Nexenta. They provide a supported version of OpenSolaris complete with high availability options if you so desire.
-You can also check out Open Indiana, a fork of OpenSolaris.
As I said in another post, ZFS development on FreeBSD is now funded by iXSystems. Given that most of their income is from selling large storage solutions built on top of FreeBSD and ZFS (often with a side order of FusionIO and other very expensive hardware things), they have a strong incentive to keep it stable and full of the features that their customers want.
I am TheRaven on Soylent News
Why not just do the following every once in a while:
1. go through all your files,
2. for each file, compute a checksum (e.g. using the unix tools md5sum or sha1sum),
3. for pairs of files giving similar checksum, compare them (optionally) and if equal remove one of them and make it a hard-link to the other.
It would surprise me if there was no free open-source script doing exactly this.
If Pandora's box is destined to be opened, *I* want to be the one to open it.
However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.
Slagborr
it needs incredible amount of memory to operate effectively.
from my university notes:
5TB data, average blocksize 64K = 78125000 blocks
for each block the dedup needs 320 bytes so
78125000 x 320 byte = 25 GB dedup table
use compression instead. (eg zfs compression)
I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication
Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.
Secession is the right of all sentient beings.
The change log shows an update two months ago, how's that stagnant?
Because the original date on the post is from April 2010?
They modified a previous post to put all of the changes in one place, dufus.
Yes and currently using the CE version. It is native ZFS version 15 due to the fact that it is running on OpenSolaris. I chose Nexentastor CE for this reason. Excellent performance, great GUI, available CLI and access to the underlying file system with a few commands. Run the 64-bit version, as it chokes in 32-bit. The HCL is small, you will have to spend time reading the OpenSolaris HCL. It took me a few tries to get the hardware right, MB, CPU, memory, JBOD controllers. I am glad I went through the trouble, I learned a lot and most importantly got away from HW RAID which is inferior in performance, has no dedupe, and does not detect bitrot.
I have recently written some personal software that finds duplicate files/directories. I had no idea there was such a demand for something like this (and price for such software).
There are a few hurdles with deduplication that a piece of software will likely need your input for:
- Does something like the directory/file names matter? If so, does case matter? What about comparing names from ASCII Unicode? Do you compare the DOS 8.3 names also?
- What do you want to do when you find duplicates? Delete the duplicates and put links in their place? Which one do you want the links to point to? Maybe remove the duplicates... so which one do you keep?
- Does meta data need compared too? What about the data itself? Different file systems have different gotchas. NTFS can have separate streams, which do not necessarily follow a file. If your file system supports snapshots, what are you really wanting to compare?
"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
md5sum `find . -type f` | sort
...and so on
http://michaelsmith.id.au
OpenSolaris is Dead. Long live http://openindiana.org//
It means, however, that ZFS is now forked. ZFS volumes from future Solaris releases may be incompatible with future versions of FreeBSD (or IllumOS or whatnot). And the new approach that they're taking to define which new features are supported is flexible, but opens the door to situations where you can't mount a filesystem because your OS is missing some individual feature that the devs chose not to implement (ZFS currently goes by backwards compatible versions of the filesystem, the new forked version works based on feature flags).
To be honest, I don't see the latter being much of a problem, but the former (lack of inter compatibility with "official" ZFS) is annoying. Perhaps the forked version should change the name to something other than ZFS to avoid confusion.
Both deduplication and conventional compression are a questionable idea on a file server which has many clients.
The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.
Yeah right. I'll wait for real usage numbers rather than "the vendor selling this stuff to me says it's fucking awesome".
Deleted
I just use rsync from the command line to do deduplication. Been working like a charm for years.
First I sync from the remote directory to a local base directory:
rsync --partial -z -vlhprtogH --delete root@www.mydomain.net:/etc/ /backup/server/www/etc/base/
Then I sync that to the daily backup. Files that have not changed are hard-linked between all the days that share them. It very efficient and simple, and retrieving files is as simple as doing a directory search.
rsync -vlhprtogH --delete --link-dest=/backup/server/www/etc/base/ /backup/server/www/etc/base/ /backup/server/www/etc/2012-01-04
-Dave
Not sure how much development it gets these days but BackupPC looked reasonable but wasn't bomb proof enough for the systems I looked at it for. It primarily uses rsync to copy files (changes only) but then "deduplicates" the rsyncs. It was actually "deduplication" via none lossy compression but in reality they are similar technologies - they both look for repeat data, make a record of the repetion and what is repeated and remove the repetition. It doesn't shrink your primary storage at all - its purely about a duplcated sets of data.
It appears you aren't aware of the fact that very key members of the ZFS developers community (read: kernel code hackers) on FreeBSD have regular and direct interaction with folks doing IllumOS. Martin Matuska (mm@freebsd.org) is currently the front-man for this task.
It's all presented and documented publicly on the zfs-devel mailing list on freebsd.org. Read it if you wish. Check out November 2011, for example.
So, there is a sharing of ideas, code, and implementations/models on a regular basis. Therefore, the only "beast" you need to worry about is interoperability between ZFS on Solaris (meaning Oracle's commercial product) and ZFS on IllumOS/OpenIndiana or ZFS on FreeBSD. Got it? Good. :-)
backuppc is backup software that does file-level deduplication via hard links on its backup store. Despite the name's suggestion that it is for backing up (presumably windows) PCs, it's great with *nix.
http://backuppc.sourceforge.net/
Its primary disadvantage is the logical consequence of all those hard links. Duplicating the backup store, so you can send it offsite, is basically impossible with filesystem-level tools. You have to copy the entire filesystem to the offsite media, typically with dd.
It also can make your life difficult if you're trying to restore a lot of data all at once, like after a disaster. You take your offsite disks that you've dd' copied, hook them up, and start to run restores.
The hard links mean lots and lots of disk head seeks, so you are doing random i/o on your restore. This is really slow. If I ever have to do this, my plan is to buy a bunch of SSD's to copy my backup onto. Since there are no seeks on SSDs it will be much faster.
I think file-level de-dupe is usually a lot less effective because it can't accomodate files that differ only slightly but are otherwise the same, whereas block-level de-dupe works with everything.
I also don't know what happens in your scheme when you have "de-duped" a file that's the same in 4 different directories but then one application wants to change "its" version of the file. It sounds like it trashes the file for the three other uses of it since there's no way to automate copy-on-write with your shell script but maybe my clue isn't working.
Yep, put a nail in OpenSolaris' coffin. Instead, I use and recommend OpenIndiana and NexentaStor (or Nexenta's community edition if you prefer).
Try Squashfs which creates deduplicated and compressed filesystem archives (http://www.linux-mag.com/id/7357/ for a good journal article).
If you're using Ubuntu, Debian, Fedora Squashfs will be already built into your distro kernel, and the squashfs-tools will also already be available in your distro repository.
Isn't de-dup a fairly trivial application for a DB of MD5sums, even if you don't have the chops to use the filesystem at a more fundamental level?
Yes, but in that case, two multi-GB files that share all of their data except one bit will not be deduplicated. The difference between your approach and Microsofts is grounded in the same though-process that make modern compression algorithms better than older ones:
First you treat all files separately, which is really simple but has the drawback of not cross-linking chances for compression/dedup across files. This is what deflate (ZIP/GZIP) and your approach to dedup do. The same data simply gets recompressed twice -- or in your case not duplicated if the data is even marginally different. You will never reach the maximum space-saving that way, even though you can at least be sure to be reasonably fast.
Then you notice that files sharing most data should only be compressed/deduped once and then just linked together. The easiest way to do that is to cut up the files into blocks and compress them. If two blocks are the same, you don't recompress them but just put in a link to the previous compression. This is what (roughly speaking) BZIP2, RAR, ACE and some other formats do. In deduping terms this means creating multi-level hashes for each files. It works much better, but has the price of being more complex and time consuming.
Finally, you notice that cutting up files at fixed boundaries is also wasteful. If two blocks are the same, but one has all bytes shifted left one position, you needlessly waste space. Thus, you try to identify if you can dynamically cut up the files/stream into chunks that you have already compressed, plus a handful of spare bytes here or there or with a very simple substitution/transposition function applied. This is (extremely roughly speaking) what LZMA of the 7-Zip fame does and what Microsoft tries to do different in their dedup approach.
Of course, going that way is even MORE complex and time consuming, but may be well worth it, if space-saving is what you're intested in. After all, there is no such thing as a free lunch -- you either pay with time or with space (or with general applicability in some corner cases).
So, all in all, the approach itself is not new -- neither yours nor Microsofts -- but the magic lies in actually creating a working product out of the theoretical approach outlined above.
http://sourceforge.net/projects/staticfiledups/
If you have a large collection of files, particularly files that do not change often, this set of scripts can _help_ you manage them. It stores a database of files (currently as a flat text file), so that they're only hashed once, and will provide you a list of duplicate files on asking. There's a script to check each file in a group of duplicates and link them together, and a script to remove any number of duplicates from a group, ensuring that there is at _least_ one copy remaining somewhere. There's another script to help you determine which file to keep, based on a simple (complex?) rule-set of what directories are more important than which other directories.
Should you wish, it'll also tell you directories with the greatest number of duplicated bytes are -- ordered descending.
Changes have been made since the last update of this project, and I'm currently trying to rewrite it into a C project with sqlite backend and better detection of file changes, but I lack the time for this to happen any time soon.
There's a form of deduplication supported by the Linux kernel, if you use the logical volume manager. If you create base LVM device, and then create a snapshot of that device, the snapshot only requires sufficient real estate on the host physical volume to store the diffs between the snapshot and the base. You can use this for "freezing" a file system to do back-ups, or for incremental back-ups, or whatever.
My rather limited experience with this is that, if you have more than a few snapshots on a base device, your write performance degrades very raplidy. There's also a hard limit of 255 snapshots per device.
You can also do file-based deduplication with the "rsnapshot" tool, which has been available for many years.
Also also, I haven't kept up, but I seem to recall that ZFS for linux was promising this as a major selling point.
2*3*3*3*3*11*251
It means, however, that ZFS is now forked.
True, but this is only a problem if you were planning on ever getting Oracle gear... since this is about free solutions, that shouldn't be an issue :)
Besides, the current ZFS implementation in FreeBSD is compatible with Oracle's version - so it's not currently a practical concern.
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Have a look at bup (https://github.com/apenwarr/bup). Though still very new and missing some features, it's pretty stable, fast (at deduplication, restore is another story) and very effective. Better ratio than lessfs, sdfs and zfs. I'm rather impressed how it sucks in full partition images, several hundred gigabytes, each day at my place. It's meant for virtual machine images. File based backup is also included, but still missing meta data AFAIK.
Also, have a look at Btrfs. Btrfs does not include deduplication per se yet, but you can use volume snapshots and rsync --inplace so it only does block level changes within updated files, if filenames don't change (watch /var/log).
I'm aware of all this, and I think that generally it can work out well, but it doesn't guarantee future compatibility if somebody wants to go and do their own thing and then that thing becomes popular. And the interoperability between ZFS and Solaris was my primary concern, although to be honest I'll probably never run "real" Solaris again since it's not free anymore, so I probably shouldn't care about that either.
OpenBSD has had the Epitome deduplication framework for some time. I believe version 2 is considered production-ready.
Consider that subscribers can see stories early (which time goes with a story? live for subscribers or live for all?)
and they are also available on the firehose-
you can ad comments on the firehose too.. if converted to a story- would the comment time match?...
every day http://en.wikipedia.org/wiki/Special:Random
The future of ZFS and the product that was OpenSolaris has really started to take shape over the last few months and there is a lot of good work going on around it. Illumos has set up a proper foundation that will be shepherding the development of their OS and ZFS fork. They've got some good commercial backing (Nexenta, Joyent, and others), and many of the original ZFS engineers from Sun are actively involved in the development. A lot of work is going on right now in terms of revamping the versioning scheme to ensure some level of feature interoperability between "open" ZFS and "Oracle" ZFS (assuming Oracle chooses not to play ball in the long run).
If you're looking for an inbetween solution, Nexenta is at an interesting place in the market. They are an order of magnitude cheaper than the tier 1 providers, but you're not completely on your own if you still have interest in some sort of commercial support contract. For the record I'm not affiliated with them in any way other than being a satisfied customer.
I'll also echo the previous comments about ZFS dedup and RAM - you need enough memory for the entire dedup table to fit in RAM (or a fast L2ARC SSD) or performance will tank. There is a formula buried in the documentation somewhere for determining requirements based on the size of your pool.
Oracle is aggressively marketing it's ZFA-SA (Storage Appliance) as a competitor to the likes of Netapp and EMC. http://www.oracle.com/us/products/servers-storage/storage/nas/overview/index.html
The article says that commercial de-dupe solutions are too expensive. It's not cheap technology, that's for sure.
But it's unreasonable to expect a solution for free to such a complex problem. Rather than try to find a solution that "is ready for customer deployment", they should be seeking out a solution whose design meets their goals, and FUND that project. Too many companies think open source means free. And it does, technically.
But if you want quality open source solutions, you're either going to have to pay to help speed up development, or you're going to have to wait for years while people work on it in their spare (and unpaid!) time and hope that the key developers don't abandon the project completely. And make no mistake, with any open source project of any size, there are key developers who produce the majority of the code. Software with as many contributors as the Linux kernel are very, very rare.
Most are more like Eclipse, where a HUGE chunk of the funding comes from one company (IBM in the case of Eclipse.)
It's high time companies started to realize that open source is a way of sharing technology, not a free as in beer provider for your every whim and want.
I do not fail; I succeed at finding out what does not work.
And one that knows gets modded down as an MS Shill for talking about how NTFS now supports DeDuplication.
Most of the posters seem to be confusing copy-on-write with DeDup. Rsync cannot dedup. Time Machine is not dedup. Dedup means different files (not just different *versions* of files) share links to the same block, written just once (or twice) to conserve space. Rsync with hard links and Time Machine are just copy-on-write mechanisms. Similar but different!
But dedupe is 80% snake oil.
First reason, redundancy. If your backup policy specifies that your making copies of the data on a regular basis, and you then proceed to delete all the copies but one, why are you making the copies in the first place? Maybe instead of dedupe you need better backup software that can detect the redundancy and simply choose not to back up what hasn't changed.
Second, performance. By its very nature dedupe degrades over time, the dedupe vendors battle this in a number of ways (secondary caches, inverted dedupe streams, etc) . Eventually though streams of data on become sequences of tokens scattered all over the storage system in nearly random patterns. Its the same as having a file system where every 4k of data in a file requires head seeks.
Scale, as the amount of physical capacity increases the need to maintain hash lookups for that physical capacity increases. These hash lookups pretty much must resides in physical RAM. This leads to expensive dedupe nodes or massive dedupe inefficiency if the data is split across multiple standalone dedupe appliances.
Price, the above limits tend to drive the price per deduped MB up beyond the price of RAID arrays from 2nd tier vendors like nexan, acnc, etc.
Now all that said, there are places where dudup provides an advantage, but they are few and far between. That is because not everyone get 30:1 (especially if their backup software isnt doing full backups). Generally the dedupe systems are at the bottom of the performance curve and not everyone is willing to grow their backup window or slow down there apps either. That tends to leave them in fairly low end enviroments often better served by inexpensive raid boxes.
I've put a deduplication technology comparison table here.
Most deduplication technologies are SIS (deduplicates only entire file content at a time - requires O(n^2) storage for large, slow growing files) or fixed-width blocks (has problems if you change a byte at the beginning or middle of a file).
The strongest offerings do sliding block (or what I've been calling variable-length, content-based blocking). These don't have problems with inserting or changing or deleting a byte somewhere in a file.
I've designed and coded a backup system that does variable-length, content-based blocking for deduplication, called Backshift. It's very nearly ready to hit 1.0, mostly just needing a few more users to try it out. It's got a comprehensive automated test suite, lots of documentation, runs on all the major Pythons except IronPython (fastest on Pypy - Pypy's even faster than CPython combined with Cython for this application), and has been ported to a variety of Linuxes, DragonflyBSD, FreeBSD, OS/X, Cygwin, Haiku, Solaris and Open Indiana.
In addition to the deduplication, it compresses the deduplicated chunks with xz (with a bzip2 fallback if none of the 3 methods of doing xz work). Also, if it notices that a chunk grows when compressed, that chunk gets stored uncompressed - so backing up a file that was already compressed pretty hard doesn't require increased storage in the backup repository.
Use:
The subdirectory specifications are excelerated quite a bit over what plain tar would give.
It sometimes acts so much like tar (especially for restores and listing contents) that you might be tempted to think that it's storing things as tar archives behind the scenes - but it's not.
You can still use OpenIndiana which is based on Illumos, the fork of OpenSolaris. A lot of the dev guys at Sun have left the company and now contribute code to the project.
In fact one killer feature is that they ported KVM to Illumos. Now you can run VMs and take advantage of the many features of zfs including deduplication.
BackupPC is a free disk-based backup system for Linux. It's based on rsync, but written in perl. It supports file-level dedupe, which is most of what you want for backups, and installation on RHEL5 is just a yum install (if you have RPMForge/EPEL configured) and a couple service starts.
It handles scheduling backups of multiple systems concurrently, up to the number you decide, within a timeframe you decide. Will send emails and alert on the web interface if any systems failed, or are over-due and haven't been backed-up. You can assign a (non-admin) user to a given system, so they can login to the web interface and manually launch backups and view and restore deleted or old revisions of files to the live system at-will, as well as getting email notiifications for their systems.
It supports backups over ssh, rsh, smb, nfs, native rsyncd, and probably others. It has a very polished web interface to configure absolutely everything about the backup process (except the initial ssh key setup), and even allows you to coonfigure pre commands to have it tell the OS to make a snapshot (typically LVM on Linux, and VSS on Windows) and backup from the snapshot.
Above and beyond rsync (or rsnapshot for people that can't write a 2-line shell script with rsync) BackupPC has a few features like doing a full checksum on X% of unchanged files which makes for a very good sanity check. It also runs as non-root users, while still preserving devices, permissions, etc. The downsides versus scripting rsync are that BackupPC is still locked into the full/incremental model, and after dozens of incrementals, performance really suffers until a "full" backup is performed. It doesn't properly handle the copyright banners some systems will print, when native rsync will behave properly. And it has it's own filesystem format that mangles filenames and such, but there's a FUSE driver (also in perl) that allows you to virtually mount it, which is very important... when I last looked, said FUSE script needed a one-line fix for either raw or block device nodes, I forget which.
In general, BackupPC is the way to go. It's pretty flexible, polished, capable, and provides the reliability you want in a backup system more than anywhere else. The issues could be fixed by an interested halfway decent perl programmer with some time. It's really very, very close to being the ideal backup solution and surpassing the propietary solutions, with the above issues being the big stumbling blocks.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
If you don't think opensolaris has a future fair enough. FreeBSD does. FreeBSD currently supports ZFS v28, which has dedup. Be aware you need plenty of RAM.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
For things like DV footage, I don't bother with the dedup.
Even with M. Night Shyamalan movies?
We've been using deduplication products, for backup purposes
Here is your problem. Backups should not be copied in myriads of instances and deduplicated on top of that insanity. They should be incremental.
Contrary to the popular belief, there indeed is no God.
I wrote a program some time ago that does block-level dedupe within single files. It's not actually any good on a live filesystem at all - I made it as a compressor for drive backups and VM images - but it can be used to at least estimate space savings. If you run a drive (It'll take a block device under linux) through this, the resulting file size will tell you potentially how much capacity deduplication could save under ideal circumstances. If this software can't shrink the drive, don't bother even looking into deduplication: It isn't going to work.
/tmp/ precluding a Windows compressor), just a little project I made as a proof-of-concept. Takes a ridiculous time to run too, but that's the cost of dedup.
http://birds-are-nice.me/programming/BLDD.shtml
It isn't really good enough for general use (things like hard-coded references to
Post processing is the cheap way out, as it takes otherwise idle CPU time to get the job done. But it has a number of drawbacks and one is restorability.
Think about it..
You have a 1TB disk. You fill it up with data. Windows post-processes this and brings it down to 500GB. Cool! So you add another 500GB and a bit more, and eventually you end up with, say, 2TB worth of data on a 1TB disk. Obviously you are a prudent admin, so you make backups of your data. 2TB backups.
And then it happens. The drive dies. Luckily you have a spare disk, so you quickly put a fresh 1TB disk in. And you start restoring..
Guess what happens at 50% of your restore?? *DISK FULL*...
So either you have to buy double the amount of disk, or you have to pause the restore each time just before the disk fills up, let Windows redo its post-processing dedupe and continue. Hmmm. Wonder if your boss will like that.
Think twice before you adopt post-processing dedupe
To Terminate, or not to Terminate, that's the question - SCSIROB
Let's take the marketing claim that dedup will save you 10x on storage.
And let's assume that your tier-1 vendor charges you $2000/TB (raw) for their dedup appliance.
That's an effective cost of their solution of $200/TB of deduplicated data, which is close enough to the cost of a hard drive to ignore, especially when you talk about corporate budgets.
Now, from experience, $2000/TB is a low price for a dedup appliance. $5000 is closer to the mark. That means you get to spend $500/TB on your whitebox non-dedup storage array, which means you can easily mirror (RAID-10) your collection of SATA drives, and still hit your target price.
--Joe
It is trivial. I wrote a python script to do it. It is licenced under gpl 3. It uses hard links, and I run it every time I backup a significant amount of data to my 10tb fileserver. As it uses hard links, it is designed for storing backup data, not for general usage. It can be found at http://jdeifik.com/ under 'Disk deduplicator'
I was doing similar research a few days ago.
Some of these are already mentioned...
Other stuff:
By fixed block I mean that the file system does not search out shared data when the blocks are not on block boundaries. So if you add one byte to the beginning of a 10 GB file, and that has the unfortunate consequence of rippling up through all the blocks that make the file, then there will be no block level sharing with the original file. Of course that's a pathological case, but you get the idea.
Original poster, perhaps you could keep us informed of your findings? There's at least me who is also interested.
"Just" is such a dangerous word when deleting data.
You can download William Stearns's fully-debugged version, though:
http://www.stearns.org/freedups/
inotify can't watch an entire filesystem. No current *notify kernel hooks can offer this, unfortunately.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Do you need dedup with dynamic block sizes? Or is fixed block size enough? Comparing data domain (or similar products) with ZFS's dedup (or most other primary storage "dedup" filesystems) is comparing apples and oranges.
Not really. There are real-world advantages to both, and you'll have to examine your specific environment to see what's best. Generally, block-level will work better for giant database files where there are very localized changes, file level will work better for systems with zillions of small files (mailservers etc.), and nothing really works well with giant compressed image files because very small changes can cause every single block to change after compression (the same is true of large databases that are frequently re-orged for performance).
It's not necessarily a problem, depends on the filesystem and use case. In snapshot systems based on rsync --link-dest the toolset handles that issue transparently to the user. And if the files in question absolutely shouldn't be forked (payroll files, perhaps?) then this might be desired behavior.
You really do need to know what you are doing, though - that's why I started this thread by recommending a competent linux sysadmin. Somebody who knows what you are trying to accomplish, what the tools can do, how to write an inotify interface, etc... if you don't want to pay for highly skilled staff this is not the road to take. Rather often it's worthwhile for a business to invest in high quality people, though.
Well, I back up 12 terabytes a night, so I am amazed at your awesomeness, that you consider this trivial.
Yoiks, you're right. It's been over a year since I wrote an inotify interface, and I forgot that! In real use - well, OK, in my real use anyway - if I try to set up a completely recursive structure that adds new watches as new folders are created, the number of inotify events due to normal user operations becomes so high that events start getting dropped with IN_Q_OVERFLOW yadda yadda yadda. This turned out not to be a problem for me specifically but that was only because I wasn't de-duping, I was just triggering events on client file transfers, which were restricted to specific folders anyway.
I like being corrected, because I don't like being wrong. Thanks!! And thanks for the Stearns link, too - I may use some of that code (with soft links instead of hard links, though, for a particular use case I have in mind).
You can implement de-duplication at whatever level is appropriate to your needs. Most people do it at file level; I use it to reduce my nightly rsync snapshot load. Gets me more than 90% savings in disk space because of my specific use case. Understanding what you need and how you can efficiently satisfy those needs is the key to good systems architecting and management.
Several people have pointed out that block-level de-dup is inherently best suited to being implemented in the filesystem, but if your toolset (such as compression utilities, for example) wasn't written to suit such filesystems, you can still get screwed. Again, it depends on your use case - does your database backup software change every bit in every block if one byte in the first 512 changes? If so, any form of de-dup may get you nothing - especially if you keep those backups on a dedicated partition on your hot backup site - you'll just be wasting processor time.
There's no substitute for knowing what you're doing, unfortunately. One-size-fits-all solutions usually don't.
I agree on the competent, but I think you're stretching "competent sysadmin" to "skilled systems developer" if you're including the ability to write I/O interfaces to enable copy-on-write file dedupe.
As for dedupe, I see the situational advantage to file dedupe but it seems most products I run across with dedupe are based on block level dedupe, I guess because it can effectively dedupe files without the overhead of copy-on-write for a whole file.
That's very interesting! Thank you - I will look into BZIP2 more deeply as time permits.
My experience in the field has been that premature compression can be the bane of efficient business continuity planning. Real life example: your client wants to make nightly offsite backups of a live, highly active email system. This can be done using a combination of LVM snapshotting and rsync --link-dest (and you could multicast that backup to multiple sites if rsync batch mode actually worked, which I'm sure it will someday). But if there are re-organization and compression jobs already running on the source system, you'll run out of bandwidth, because there will be too many daily changes and the client can't afford more than a couple T1s. If you stop doing disk space optimization on the source system, instead just adding more hard drive (use AOE cheap multi-terabyte arrays if necessary) you may be able to bring down the number of changes found by rsync's block checksumming to where you can fit easily in the site-to-site WAN bandwith constraints. Now you've got to worry about db performance, so you shove the mail into maildirs and use the filesystem as your db and you're good to go.
As you point out, the magic lies in creating a working system from all the abstract theory. Typically you have to make compromises in one area to suit another... which is why I think a competent sysadmin is the key to getting any systems job done right. Buying fancy products or worshipping the approach of a particular vendor just doesn't cut it once you've passed a certain level of complexity.
CRC32 is easy to forge, and due to the birthday paradox, it's likely to even happen accidentally if you have a few thousand files of the same size. How is SHA-1 slow in a disk- or network-bound application?
You've got a very good point about my high expectations for competence... I plead guilty due to advanced age!
When I got started you didn't call yourself a sysadmin if you couldn't parse a core dump. You had to have already been a systems programmer, and you usually didn't get to do that until after you'd been a successful apps programmer. Nowadays most systems are so much simpler to administrate (and core dumps so much less useful, too) that you don't necessarily need programming experience to achieve some minimal level of competence.
I personally still wouldn't hire an admin who couldn't code to a standardized syscall, though. If you ignore accreditation (college degrees etc.) and just focus on actual ability, you can find some really smart, capable people out there looking for IT jobs.
That's very interesting! Thank you - I will look into BZIP2 more deeply as time permits.
Do note though, that BZIP2 only follows my description very roughly and indirectly, as it will not actually cross-link what its documentation calls "blocks" directly. While it does split up the input stream into blocks, it will not actually cross-link them. It still processes the different blocks separately. The cross-linking compression is only done inside the blocks and indirectly through the properties of the so called "Burrows-Wheeler Transform" -- so in a way BZIP2 forms "sub-blocks" inside its blocks. This is a necessary compromise due to its stream-based nature.
So, in other words: While dedup and compression do follow the same general approaches as I outlined above, their specific implementations of course do differ in some big, but also sometimes deviously little ways. So when I say "roughly speaking", I really do mean roughly speaking. :)
This second check should also bring the chance of a file being incorrectly deduplicated due to a collision as close to zero as possible
Which means you end up having to fully rescan any file that has been involved in a CRC32 collision.
so the script will give even more than the certainty of SHA1 with nearly the speed and low CPU usage of CRC32.
Answers to this question on SO imply that on an x86-64 CPU, both Skein and SHA-1 can process 300 MB/s. So unless you're already using SATA 6G, you're still disk-bound.
The application isn't sure to be disk-bound or network-bound especially in the age of SSDs
If you're backing up an SSD to an HDD, you're still disk-bound.
Time Machine, Apple OS.
Time Machine, Apple OS.
Doesn't do much good for all of the Windows, UNIX and Linux servers I'm backing up...
Any insufficiently advanced magic is indistinguishable from technology.
No problems, I'll spec up a brand new XServe and make that happen.