Slashdot Mirror


Ask Slashdot: Free/Open Deduplication Software?

First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.

64 of 306 comments (clear)

  1. I've wanted deduplication for a long time! by GPLJonas · · Score: 4, Interesting
    And now, even the next version of Windows Server will contain integrated data deduplication technology! So Linux devs better get working on similar features. I still cannot figure out how NTFS can support compressing files and folders but Linux cannot.

    That deduplication for NTFS is really interesting, actually. It's not licensed technology but straight from Microsoft Research and it has some clever aspects to it.

    Some technical details about the deduplication process:

    Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.

    After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.

    There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.

    Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.

    New writes are not deduped – this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.

    The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all. So when are we going to see Linux equivalents? Because Linux is getting behind on the new technologies.

    1. Re:I've wanted deduplication for a long time! by nanoflower · · Score: 4, Insightful

      The most likely answer is when some one is willing to pay for it. What you have described above isn't a trivial effort and it's unlikely someone is going to do the work for free so it will have to wait until someone is willing to pay for the development. Even then it's likely that the developer may keep it closed source in order to recoup the investment.

    2. Re:I've wanted deduplication for a long time! by lucm · · Score: 5, Insightful

      And now, even the next version of Windows Server will contain integrated data deduplication technology! [...] The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.

      Well ask anyone who lost documents on DoubleSpace volumes or got corrupted media files on Windows Home Server and they will tell you that even if Microsoft Research says so, it's not something I would put on my production servers any time soon.

      --
      lucm, indeed.
    3. Re:I've wanted deduplication for a long time! by Ironchew · · Score: 5, Funny

      When will Adblock Plus block these Microsoft ads on Slashdot?

    4. Re:I've wanted deduplication for a long time! by 19thNervousBreakdown · · Score: 2, Insightful

      This has got to be some sort of smear campaign against Microsoft, because I cannot believe that they would think that bludgeoning people with astro-turf is going to get them sales. The first two articles with MS shilling I saw (today!) I wrote off as just people sharing interesting stuff that happened to come from MS, but thanks to your over-heavy hand, the pattern is clear as a bell now.

      So tell me MS marketing people, are you seriously this incompetent, or did a new astro-turf campagin incentive get out of hand? I'm honestly curious how this happened.

      --
      <xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
    5. Re:I've wanted deduplication for a long time! by Anonymous Coward · · Score: 2, Informative

      Interesting link, but it doesn't look like Microsoft has actually released this yet and it is only slated to be released with Win 8 server, and it will come with some caveats.

      FTFA:
      "It is a server-only feature, like so many of Microsoft’s storage developments. But perhaps we might see it deployed in low-end or home servers in the future.
      It is not supported on boot or system volumes.
      Although it should work just fine on removable drives, deduplication requires NTFS so you can forget about FAT or exFAT. And of course the connected system must be running a server edition of Windows 8.
      Although deduplication does not work with clustered shared volumes, it is supported in Hyper-V configurations that do not use CSV.
      Finally, deduplication does not function on encrypted files, files with extended attributes, tiny (less than 64 kB) files, or re-parse points."

    6. Re:I've wanted deduplication for a long time! by Richard_at_work · · Score: 2, Insightful

      Heh, the comment is completely on topic, interesting and factual and yet we still have fucktards insisting that any mention of MS must have been paid for.

      Get a life.

    7. Re:I've wanted deduplication for a long time! by Anonymous Coward · · Score: 4, Interesting

      I have often wondered why someone doesn't use the rsync algorithm as a basis for this kind of chunking and deduplication. I imagine a FUSE-based filesystem that breaks the application-level files into checksummed pieces and stores both the file fragments and file descriptions into an underlying filesystem. Then it could reconstruct the application-level files on demand, using the description to draw out the right fragments.

      From an academic point of view, it already solves the same problem and just needs some repackaging. It breaks arbitrary data into phrases to be identified by checksum and located in another existing corpus of data. It just needs a metadata model to record the structure of the file as composed of these canonical phrases, rather than performing the actual file reconstruction immediately as rsync does now.

      From my cynical point of view, I realize someone may have patented the repackaging, in the same way that Apple seems to think they can re-patent every idea "on a smartphone".

    8. Re:I've wanted deduplication for a long time! by SuricouRaven · · Score: 3, Interesting

      NTFSs file compression actually rather sucks. Space saving is minimal under all but ideal conditions. It's a common problem in filesystem-level compression - the need to be able to read a file without seeking very far, or reconstructing the entire stream. Compression ratio is seriously compromised to achieve that.

    9. Re:I've wanted deduplication for a long time! by dltaylor · · Score: 2

      One instance by default is brain-dead. Lose that and they're ALL gone, if it happens between dedup and backup. You should have at least two copies, on different media, if possible, always, of any data with more than one reference, as well as backing up your data, of course.

    10. Re:I've wanted deduplication for a long time! by the_B0fh · · Score: 2

      Even Richard Stallman agreed that there are cases where GPL is not necessary. So stop being an ass.

    11. Re:I've wanted deduplication for a long time! by anomaly256 · · Score: 4, Informative

      Considering Linux does have this capability in a few FS drivers now (ok.. some more stable than others, sure) I think the GP should be modded troll rather than the post pointing out it's likely a shill... too bad i'm out of mod points

    12. Re:I've wanted deduplication for a long time! by PedXing · · Score: 2

      If you're referring to GPLJonas' comment as containing "perfect grammar, good use of white, buzz words and directness" then I'll take that as a compliment! Everything but the first three and last paragraphs were copied from my blog post! (http://blog.fosketts.net/2012/01/03/microsoft-adds-data-deduplication-ntfs-windows-8/)

      FWIW, I'm not a Microsoft shill or astroturfer. I'm a blogger who happened to write about MS' dedupe yesterday and it got quoted here. I only noticed this thread thanks to all the Slashdotters coming over to have a read. Hi!

    13. Re:I've wanted deduplication for a long time! by jimicus · · Score: 5, Interesting

      Also, perhaps the reason that Linux does not support file-system compression on the fly is because it's a horrid idea, and should never actually be used?

      Ah, the "Terrible idea" objection.

      This is a common objection to implementing ideas on Linux - so common, in fact, that it's successfully held Linux back for at least ten years.

      Multi-master LDAP replication? Terrible idea. Remained terrible for several years after literally every commercial LDAP server on the planet supported multi-master replication, only became non-harmful when OpenLDAP started to support it in version 2.4.

      Active Directory support? Such a terrible idea that it's held Samba development back by at least five years. Even now, where Windows Vista deprecates NT4-style policies and 7 deprecates NT 4 domain support altogether (which is about all you get from Samba 3); Samba 4 is considered alpha software.

      Some sort of centralised work-together system that integrates email, address book, calendars, task-list? Terrible idea. So much so that Exchange (despite being way too complicated for its own good) is still an extremely popular email solution and the closest you can get to a viable F/OSS alternative either requires your users to completely re-think how they collaborate (yuck) or buy the commercial version simply because the free version lacks vital features.

      Free clue to all naysayers who work on F/OSS projects: If you spent as long trying to think of ways to make something work as you do thinking of objections to existing implementations and explaining how you're right and everyone else is wrong, you wouldn't be ten years behind the times.

    14. Re:I've wanted deduplication for a long time! by jamesh · · Score: 2

      NTFSs file compression actually rather sucks.

      It's fine as long as you use it properly. I use it for IIS logfiles. I want to keep the logfiles but rarely actually access them, and they are append only, and they are plain text. Very high compression at a very small loss of performance.

      Compressing binary data in your working set is, as you point out, probably a bad idea, but as long as you don't do anything stupid you shouldn't have any problems.

    15. Re:I've wanted deduplication for a long time! by smash · · Score: 2

      This is why you use Zfs, which can detect and correct corruption.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
  2. Dragonfly BSD's HAMMER... by Anonymous Coward · · Score: 5, Interesting

    ...includes dedupe.

    There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.

    1. Re:Dragonfly BSD's HAMMER... by TheRaven64 · · Score: 4, Funny

      No, that's the opposite of deduplication...

      --
      I am TheRaven on Soylent News
  3. Acronis by syousef · · Score: 2

    Acronis Backup & Recovery 11 Advanced Server has deduplication (licensed addon) and runs on Linux. At roughly $2000 it ain't cheap. I've never used it so can't comment on how well it does.

    --
    These posts express my own personal views, not those of my employer
    1. Re:Acronis by Anonymous Coward · · Score: 2, Informative

      Acronis is a bloody NIGHTMARE to deal with. We have a mixed shop here, and after seeing what Acronis does on Windows I vetoed the idea of having it on our mission critical linux servers.

      I have never seen such a useless backup product before I started working with Acronis. Most backup systems let you set it up once and they WORK. Acronis is always getting itself wedged (dur, a metadata file I miswrote yesterday is corrupt, I will just hang), and when wedged it hangs ALL backup jobs, not just the one that is stuck. And the only "fix" is to redo all the jobs from scratch. No other backup system needs as much handholding as Acronis.

      Acronis claims to have an excellent recovery environment. I haven't used it, but I am sure it is fantastic when you finally dig up a month-old backup to restore from because Acronis had stopped working.

    2. Re:Acronis by arth1 · · Score: 3, Informative

      In Linux, I would avoid any backup system that doesn't support hard links, long paths, file attributes, file access control lists and SElinux contexts.
      Some of the "offerings" out there are so Windows-centric that they can't even handle "Makefile" and "makefile" being in the same directory.

      In Windows, I would require that it backs up and restores both the long and the short name of files, if short name support is enabled for the file system (default in Windows). Why? If "Microsoft Office" has a short name of MICROS~3 and it gets restored with MICROS~2, the apps won't work, because of registry entries using the short name.
      I'd also look for one that can back up NTFS streams. Some apps store their registration keys in NTFS streams.

      In all cases, Acronis does not measure up to what I require of a backup program. Also because the restore environment doesn't even work unless you have hardware compatible with its drivers. You may be able to back up, and even boot the restore environment, but not do an actual restore from it.

      ArcServe is better - for Linux, it still lacks support for file attributes and the hardlink handling is rather peculiar during restore, but at least handles SElinux and dedup of the backup.

      An option for dedup on Linux file systems would be nice - the easiest implementation would be COW hardlinks. But like for Microsoft's new NTFS, you'd need something that scans the file system for duplicates. And it better have an attrib for do-not-dedup too, because of how expensive COW can be for large files, or to avoid file fragmentation for specific files.

    3. Re:Acronis by Anonymous Coward · · Score: 3, Informative

      We check backups and run a test restore on each and every server every month (we had this rule before we started with Acronis).

      Acronis is awful. Frankly, someone should open a fraud investigation. Acronis products have no business being sold as enterprise backup solutions. Fucking ntbackup is far more reliable.

      Right NOW I am wrestling with Acronis Backup & Recovery 11's retarded cousin, Acronis Recovery for Microsoft Exchange. I seriously want to sue the sadistic and incompetent assholes at Acronis for all the pain and suffering they are causing.

    4. Re:Acronis by jd2112 · · Score: 3

      I've never used any backup solution on any platform that wasn't a complete pice of crap. Backing up to /dev/null is much faster and only slightly less reliable for restores.

      --
      Any insufficiently advanced magic is indistinguishable from technology.
    5. Re:Acronis by geoffaus · · Score: 2

      If I had points I would mod you up. Acronis is barely useable when backing up to a NAS and forget about it to tape - at my work we replaced it at all sites with various Symantec products and I can now sleep much better. Maybe people knock Backup Exec but ive recently restored a SAN full of VMs when HP wiped the metadata from the disks and it performed perfectly - I believe all backups need constant monitoring and testing however BE has saved my bacon so im pretty happy with it.

      --
      As an online discussion grows longer, the probability of a reference to Godwin's Law approaches 1
  4. OpenSolaris but not FreeBSD? by TheRaven64 · · Score: 3, Informative

    ZFS in FreeBSD 9 has deduplication support. I've been running the betas / release candidates on my NAS for a little while (everything important is backed up, so I thought I'd give it a test). ZFS development in FreeBSD is funded by iXSystems, who sell expensive NAS and SAN systems so they have an incentive to keep it improving.

    I have a ZFS filesystem using compression and deduplication for my backups from my Mac laptop. I copy entire files to it, but it only stores the differences.

    --
    I am TheRaven on Soylent News
    1. Re:OpenSolaris but not FreeBSD? by Anonymous Coward · · Score: 4, Informative

      People considering either dedup or compression on FreeBSD should be made blatantly aware of one of the issues which exists solely on FreeBSD. When using these features, you will find your system "stalling" intermittently during ZFS I/O (e.g. your SSH session stops accepting characters, etc.). Meaning, interactivity is *greatly* impacted when using dedup or compression. This problem affects RELENG_7 (which you shouldn't be using for ZFS anyway, too many bugs), RELENG_8, the new 9.x releases, and HEAD (10.x). Changing the compression algorithm to lzjb has a big improvement, but it's still easily noticeable.

      My point is that I cannot imagine using either of these features on a system where users are actually on the machine trying to do interactive tasks, or on a machine used as a desktop. It's simply not plausible.

      Here's confirmation and reading material for those who think my statements are bogus. The problem:

      http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012718.html
      http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html

      And how OpenIndiana/Illumos solved it:

      http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html

    2. Re:OpenSolaris but not FreeBSD? by Guspaz · · Score: 2

      ZFS deduplication works on a block level, so it's not only storing the differences. That would require byte-level deduplication.

      For example, imagine I had a block size of two bytes, and had the following two files:

      02468A
      012468A

      These two pieces of data are identical except for one inserted byte, but deduplication will be forced to store both files in their entirety. Why? Because even though only one byte changed, none of the blocks are identical:

      02 46 8A
      01 24 68 A-

      This is a fairly common case in many kinds of modified files. Imagine trying to deduplicate ISOs where one file in the middle of the ISO changed, or a word processor document where I fixed a typo by adding an extra letter to a word...

      Don't get me wrong, there is still a big savings to be had in the average case with block-level deduplication, but there are limitations.

    3. Re:OpenSolaris but not FreeBSD? by TheRaven64 · · Score: 4, Interesting

      I've got 8GB of RAM in the machine - RAM is so cheap now that it didn't seem worth skimping. It's a 1.6GHz AMD Fusion system. Over GigE, I was getting 40MB/s writes to the deduplicated filesystem, with the load on one core about 100%.

      ZFS definitely likes RAM. I'm not sure what the minimum requirements are, but the general recommendation is 'as much as you can afford'. I think 8GB of SO-DIMMS for the mini-ITX board cost about £40, and maxed out its memory, so that was a pretty obvious choice. I'm not sure what happens when the deduplication tables don't fit into RAM, whether it degrades performance or degrades deduplication efficiency. Having 8GB means that a lot of the time it can satisfy reads from RAM.

      I'm using it over WiFi 99% of the time, so I'm not too bothered about the performance: it can easily saturate the WiFi link without any problems. . The compression ratio is 1.11x. ZFS only shows the deduplication ratio for the entire pool, not for individual filesystems. That's currently 1.06x for my system, but that's with 1.43TB of data in total, only 266GB on the deduplicated filesystem, so that means that it's saving about a third. Roughly speaking, the extra space used by RAID-Z and the space saved by dedup seem to balance each other out, so (on my backup filesystem) I am using 1GB of hard disk space for every 1GB of data, and still have redundancy so one disk out of the three can fail without losing any data.

      Time Machine on OS X does clever things like make a new copy of a 10GB file if 1MB of it has changed, and the deduplication on the NAS translates to a huge space saving for that. For things like DV footage, I don't bother with the dedup.

      --
      I am TheRaven on Soylent News
    4. Re:OpenSolaris but not FreeBSD? by TheRaven64 · · Score: 2

      In my cases, the biggest files are VM images. In this case, it is only storing the differences, because changing a few files in the VM only modifies a few blocks of the image but causes Time Machine to create a new backup copy of the entire image. This means that it's backing up 10GB files quite regularly, but I'm only storing a few MBs for each one.

      --
      I am TheRaven on Soylent News
    5. Re:OpenSolaris but not FreeBSD? by Doogie5526 · · Score: 2

      That's why Apple swapped their monolithic data blobs (for iPhoto, Aperture, etc) in to smaller files when Time Machine was released--like Sparse Bundle Disk Images, for example. Since the data is banded across multiple files, when some data is changed, you only need to back up the affected bands. I believe they marketed this as "Time Machine Aware" and advised third-party apps to adopt this approach. I thought Time Machine's approach was clever given the constraints of a traditional file system (in lieu of using something like zfs).

    6. Re:OpenSolaris but not FreeBSD? by ThorGod · · Score: 2

      So for now, let's not use compression on write-heavy volumes, where it adds next to benefit anyway. Problem solved or at least circumvented, no?

      Yeah, and ZFS is still a major new feature to FreeBSD, regardless. I know people in the FreeBSD forums talk about running ZFS on root, but if you don't have the hardware you shouldn't run ZFS. We're in Unix land here, the OS isn't going to keep you from doing anything...but that doesn't mean you should!

      --
      PS: I don't reply to ACs.
  5. FreeBSD has ZFS by grub · · Score: 3, Informative


    FreeBSD, and FreeNAS which is bases on FreeBSD, both come with ZFS. Neither is going away anytime soon.
    I use both at home and am happy as a clam.

    --
    Trolling is a art,
    1. Re:FreeBSD has ZFS by grub · · Score: 2, Funny


      Best get your eyes checked, I think that was our cat's anus you were in.
      My wife and I love to sit around and de-dupe all night.

      .

      --
      Trolling is a art,
  6. Lessfs is slow on Atom by Dwedit · · Score: 3, Informative

    I've used LessFS. On my "server" powered by an Intel Atom, it is very very slow. It writes at about 5MB/sec, even when everything is inside a ram disk.
    You can't use a block size of 4KB, otherwise write speeds are around 256KB/sec, need to use at least 64KB.

    1. Re:Lessfs is slow on Atom by TheRaven64 · · Score: 2

      I have a NAS with a 1.6GHz AMD Fusion thingy, which should be in the same speed ballpark as an Atom. It happily got 40MB/s writing to the deduplicated filesystem with FreeBSD ZFS (with a kernel with all of the debug knobs turned on) over GigE. With a release kernel, I'd expect it to be a bit faster, but since I mainly use it over WiFi the bottleneck is generally somewhere else...

      --
      I am TheRaven on Soylent News
    2. Re:Lessfs is slow on Atom by BitZtream · · Score: 3, Informative

      No you didn't.

      You got 40MB writing to memory cache possibly, not the ZFS store.

      I have a quad disk, 8 core, 8 GIG machine that ONLY does ZFS, Sustaining 40MB/s doesn't happen without special tuning, turning off write cache flushing and a whole bunch of other stuff ... unless I stay in memory buffers. Once that 8 gig fills or the 30 second default timeout for ZFS to flush, the machine comes to a stand still while the disks are flushed, and at that point, the throughput rate drops well below 40MB/s since it is actually finally putting that data on disk.

      Without compression and dedup, with possibly low end checksuming, you may be able to write that fast. With compression or checksuming, theres absolutely no way your processor is moving the data fast enough.

      This is a well known and well documented set of issues. If you haven't experienced it, its only because you really aren't using your NAS under any sort of real work load.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    3. Re:Lessfs is slow on Atom by TheRaven64 · · Score: 3, Interesting

      You got 40MB writing to memory cache possibly, not the ZFS store.

      I did? That's interesting. I copied 500GB of data from an external FireWire disk attached to my laptop to the NAS via a GigE connection, yet the NAS only has 8GB of RAM. That's one hell of a compression algorithm they're using for the RAM cache...

      --
      I am TheRaven on Soylent News
  7. BackupPC by Anonymous Coward · · Score: 5, Interesting

    Check out BackupPC. Been using it for about 5 years at our company, admittedly a mostly Linux shop, with great results. Deduplication on a per-file basis, block-based transfers via the rsync protocol, and a good web-based UI (at least in terms of function). Thanks to deduplication we are getting about a 10:1 storage compression backing up servers and workstations: a total of 1.28 TB of backups in 130.88 GB of used space.

  8. Backup or live FS? by Anonymous Coward · · Score: 3, Interesting

    Your post doesn't make it clear if you're looking for a free backup product to replace DataDomain, NetApp, etc. or if you're now wanting to dedup on live filesystems.

    If you're looking for a free backup product that supports deduplication, look at backuppc . Powerful and complex, but free. I've used it for years with good results.

  9. dragonflybsd by Anonymous Coward · · Score: 2, Informative

    So you want dragonfly BSD with a hammer filesystem.
    An excellent and stable BSD and an excellent filesystem to go with it. And a very helpful community.

  10. Re:FreeBSD by TheRaven64 · · Score: 5, Informative

    As I said in another post, ZFS development on FreeBSD is now funded by iXSystems. Given that most of their income is from selling large storage solutions built on top of FreeBSD and ZFS (often with a side order of FusionIO and other very expensive hardware things), they have a strong incentive to keep it stable and full of the features that their customers want.

    --
    I am TheRaven on Soylent News
  11. No dedup in FreeNAS by Svenne · · Score: 5, Informative

    However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.

    --

    Slagborr
  12. Dedup is just a marketing word.... by Anonymous Coward · · Score: 3, Informative

    it needs incredible amount of memory to operate effectively.
    from my university notes:
    5TB data, average blocksize 64K = 78125000 blocks
    for each block the dedup needs 320 bytes so
    78125000 x 320 byte = 25 GB dedup table

    use compression instead. (eg zfs compression)

    1. Re:Dedup is just a marketing word.... by m.dillon · · Score: 5, Informative

      All dedup operations have a trade-off between disk I/O and memory use. The less memory you use the more disk I/O you have to do, and vise-versa.

      Think of it like this: You have to scan every block on the disk at least once (or at least scan all the meta-data at least once if the CRC/SHA/whatever is already recorded in meta-data). You generate (say) a 32 bit CRC for each block. You then [re]read the blocks whos CRCs match to determine if the CRC found a matching block or simply had a collision.

      The memory requirement for an all-in-one pass like this is that you have to record each block's CRC plus other information... essentially unbounded from the point of view of filesystem design and so not desirable.

      To reduce memory use you can reduce the scan space... on your first pass of the disk only record CRCs in the 0x0-0x7FFFFFFF range, and ignore 0x80000000-0xFFFFFFFF. In other words, now you are using HALF the memory but you have to do TWO passes on the disk drive to find all possible matches.

      The method DragonFly's HAMMER uses is to allocate a fixed-sized memory buffer and start recording all CRCs as it scans the meta-data. When the memory buffer becomes full DragonFly dynamically deletes the highest-recorded CRC (and no longer records CRCs >= to that value) to make room. Once the pass is over another pass is started beginning with the remaining range. As many passes are taken as required to exhaust the CRC space.

      Because HAMMER stores a data CRC in meta-data the de-dup passes are mostly limited to just meta-data I/O, plus data reads only for those CRCs which collide, so it is fairly optimal.

      This can be done with any sized CRC but what you cannot do is avoid the verification pass.. no matter how big your CRC is or your SHA-256 or whatever, you still have to physically verify that the duplicate blocks are, in fact, exactl duplicates, before you de-dup their block references. A larger CRC is preferable to reduce collisions but diminishing returns build up fairly quickly relative to the actual amount of data that can be de-duplicated. 64 bits is a reasonable trade-off, but even 32 bits works relatively well.

      In anycase, most deduplication algorithms are going to do something similar unless they were really stupidly written to require unbounded memory use.

      -Matt

    2. Re:Dedup is just a marketing word.... by Guspaz · · Score: 2

      HAMMER's online deduplication doesn't need an incredible amount of memory because it will only match duplicate block that it has the checksums of cached. If the checksum is not cached, HAMMER won't deduplicate the block even if a duplicate exists on disk, because it won't know the duplicate exists. That's my interpretation of the coder's explanation, anyhow ( http://leaf.dragonflybsd.org/mailarchive/users/2011-04/msg00044.html ).

      HAMMER's offline deduplication doesn't have this limitation because it can re-iterate as many times as is required to compare each checksum, but that's not helpful if you want (or require) online deduplication.

    3. Re:Dedup is just a marketing word.... by m.dillon · · Score: 3, Informative

      Yes, this is correct.

      For on-line de-duplication the most optimal case in my view is to only de-dup data which may already be present in the buffer cache from prior recent operations, so the on-line dedup only maintains a small in-kernel-memory table of recent CRCs. This catches common operations such as file and directory tree copying fairly nicely.

      The off-line dedup catches everything using a fixed amount of memory and multiple passes (if necessary) on the meta-data, then bulk data reads only for those blocks which appear to be duplicates to verify that they are exact copies.

      I've run dedup on a 2TB backup from a VM with as little as 192MB of ram and it works. A more preferable setup would be to have a bit more memory, like a gigabyte, but more importantly to have a SSD large enough to cache the filesystem meta-data. A 40G SSD is usually enough for a 2TB filesystem. That makes the off-line dedup quite optimal and also makes other maintainance and administrative operations on the large filesystem, such as du, find, ls -lR, cpdup, even a smart diff... let alone rsync or other things one might want to run... it makes all of that go screaming fast without having to waste money buying a bigger system or waste money on excessive energy use.

      -Matt

    4. Re:Dedup is just a marketing word.... by m.dillon · · Score: 4, Informative

      Well, I can tell you why the option is there... it's not because of collisions, it's there to handle the case where there is a huge amount of actual duplication where the blocks would verify as perfect matches. In this case the de-duplication pass winds up having to read a lot of bulk-data to validate that the matches are, in fact, perfect, which can take a lot of time verses only having to read the meta-data.

      Just on principle I think it's a bad idea to just trust a checksum, cryptographic hash, CRC, or whatever. Corruption is always an issue... even if the filesystem code itself is perfect and even if the disk subsystem is perfect there is so much code running in a single address space (i.e. the KERNEL itself) that it is possible to corrupt a filesystem just from hitting unrelated bugs in the kernel.

      Not to mention radiation flipping a bit somewhere in the cpu or memory (even for ECC memory it is possible to get corruption, but the more likely case is in the billions of transistors making up a modern cpu, even with parity on the L1/L2/L3 caches).

      Hell, I don't even trust IP's stupid simple 1's complement checksum in HAMMER's mirroring protocols. Once during my BEST Internet days we had a T3 which bugged out certain bit patterns in a way that actually got past the IP checksum... we only tracked it down because SSH caught it in its stream and screamed bloody murder.

      If you de-duplicate trusting the meta-data hash, even a big one, what you can end up doing is turning 9 good and 1 corrupted copies of a file into 10 de-duped corrupted copies of the file.

      I'm sure there are many data stores that just won't care if that happens every once in a while. Google's crawlers probably wouldn't care at all, so there is definitely a use for unverified checks like this. I don't plan on using a cryptographic hash as large as the one ZFS uses any time soon but being able to optimally de-dup with 99.9999999999% accuracy it's a reasonable argument to have one that big.

      -Matt

    5. Re:Dedup is just a marketing word.... by Guspaz · · Score: 2

      It does sound like a decent tradeoff; ZFS deduplication, I can say from personal experience with my home file server that has a deduplicated filesystem used for incremental backups, is painfully slow when you don't have enough RAM. So the idea that you do a best-effort deduplication on write and a best-quality deduplication offline each night would be a big improvement.

      What sort of timespans are involved in doing offline deduplications when multiple passes are required? Is it the kind of thing you can do every night on a large deduplicated data set?

    6. Re:Dedup is just a marketing word.... by m.dillon · · Score: 3, Informative

      For our production systems it depends 100% on the actual amount of duplicated data, since bulk data reads are needed to verify the duplication. The number of passes is almost irrelevant because they primarily scan meta-data N times, not bulk data (duplicated bulk data only has to be verified once).

      The meta-data can be scanned much more quickly than the verification of duplicated bulk data because the meta-data is laid out on the physical disk fairly optimally for the B-Tree scan the de-dup code issues. So meta-data can be read from the hard disk at 40 MBytes/sec even without the use of a SSD to cache it. Of course, with DFly's swapcache and the meta-data cached on the SSD that scan runs at 200-300 MBytes/sec.

      But in contrast, the bulk reads used to validate the duplicate data just aren't going to be laid out linearly on the disk. There's a lot of skipping around... so the more actual duplicate data we have the larger the percentage of the disk's surface we have to read to verify it.

      This is an area which I could further optimize in HAMMER's dedup code. Currently I do not sort the bulk data block numbers when running the data verification pass. Not only that but I am scanning a sorted CRC list, so the bulk data offsets are going to be seriously unsorted. Doing so would definitely improve performance, probably quite a bit, but still not be anywhere near the 40 MBytes/sec the meta-data scan can achieve off the platter. It would not be a whole lot of programming, probably a day to do that. Currently isn't at the top of my list though.

      What this means, in summary (and even with semi-sorting of the bulk data blocks), is that one can use a bounded amount of ram without really effecting the efficiency of the off-line de-duplication.

      -Matt

  13. What is deduplication? by jdavidb · · Score: 5, Informative

    I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication

    Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.

    1. Re:What is deduplication? by BitZtream · · Score: 2, Interesting

      Seriously, at this point on slashdot its been talked about enough that unless you bought your UID from someone, you should be fully aware of what it is from here alone.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
  14. write a script by MichaelSmith · · Score: 3, Interesting

    md5sum `find . -type f` | sort
    ...and so on

  15. Re:Just store everything in /dev/null by shish · · Score: 4, Funny
    $ cat todo.txt > /dev/null
    $ md5sum /dev/null
    d41d8cd98f00b204e9800998ecf8427e /dev/null
    $ cat aaah.png > /dev/null
    $ md5sum /dev/null
    d41d8cd98f00b204e9800998ecf8427e /dev/null
    -

    Duplicates!

    --
    I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
  16. Use rsync! by TheTempest · · Score: 2

    I just use rsync from the command line to do deduplication. Been working like a charm for years.

    First I sync from the remote directory to a local base directory:

    rsync --partial -z -vlhprtogH --delete root@www.mydomain.net:/etc/ /backup/server/www/etc/base/

    Then I sync that to the daily backup. Files that have not changed are hard-linked between all the days that share them. It very efficient and simple, and retrieving files is as simple as doing a directory search.

    rsync -vlhprtogH --delete --link-dest=/backup/server/www/etc/base/ /backup/server/www/etc/base/ /backup/server/www/etc/2012-01-04

    --
    -Dave
  17. BackupPC by danpritts · · Score: 3, Informative

    backuppc is backup software that does file-level deduplication via hard links on its backup store. Despite the name's suggestion that it is for backing up (presumably windows) PCs, it's great with *nix.

    http://backuppc.sourceforge.net/

    Its primary disadvantage is the logical consequence of all those hard links. Duplicating the backup store, so you can send it offsite, is basically impossible with filesystem-level tools. You have to copy the entire filesystem to the offsite media, typically with dd.

    It also can make your life difficult if you're trying to restore a lot of data all at once, like after a disaster. You take your offsite disks that you've dd' copied, hook them up, and start to run restores.

    The hard links mean lots and lots of disk head seeks, so you are doing random i/o on your restore. This is really slow. If I ever have to do this, my plan is to buy a bunch of SSD's to copy my backup onto. Since there are no seeks on SSDs it will be much faster.

  18. Re:Why don't you just hire a competent sysadmin? by swb · · Score: 2

    I think file-level de-dupe is usually a lot less effective because it can't accomodate files that differ only slightly but are otherwise the same, whereas block-level de-dupe works with everything.

    I also don't know what happens in your scheme when you have "de-duped" a file that's the same in 4 different directories but then one application wants to change "its" version of the file. It sounds like it trashes the file for the three other uses of it since there's no way to automate copy-on-write with your shell script but maybe my clue isn't working.

  19. Easy, use OpenIndiana or NexentaStor by Zemplar · · Score: 5, Informative

    Yep, put a nail in OpenSolaris' coffin. Instead, I use and recommend OpenIndiana and NexentaStor (or Nexenta's community edition if you prefer).

  20. Squashfs creates deduped and compressed archives by ploppy · · Score: 2

    Try Squashfs which creates deduplicated and compressed filesystem archives (http://www.linux-mag.com/id/7357/ for a good journal article).

    If you're using Ubuntu, Debian, Fedora Squashfs will be already built into your distro kernel, and the squashfs-tools will also already be available in your distro repository.

  21. Re:Why don't you just hire a competent sysadmin? by Jappus · · Score: 3, Informative

    Isn't de-dup a fairly trivial application for a DB of MD5sums, even if you don't have the chops to use the filesystem at a more fundamental level?

    Yes, but in that case, two multi-GB files that share all of their data except one bit will not be deduplicated. The difference between your approach and Microsofts is grounded in the same though-process that make modern compression algorithms better than older ones:

    First you treat all files separately, which is really simple but has the drawback of not cross-linking chances for compression/dedup across files. This is what deflate (ZIP/GZIP) and your approach to dedup do. The same data simply gets recompressed twice -- or in your case not duplicated if the data is even marginally different. You will never reach the maximum space-saving that way, even though you can at least be sure to be reasonably fast.

    Then you notice that files sharing most data should only be compressed/deduped once and then just linked together. The easiest way to do that is to cut up the files into blocks and compress them. If two blocks are the same, you don't recompress them but just put in a link to the previous compression. This is what (roughly speaking) BZIP2, RAR, ACE and some other formats do. In deduping terms this means creating multi-level hashes for each files. It works much better, but has the price of being more complex and time consuming.

    Finally, you notice that cutting up files at fixed boundaries is also wasteful. If two blocks are the same, but one has all bytes shifted left one position, you needlessly waste space. Thus, you try to identify if you can dynamically cut up the files/stream into chunks that you have already compressed, plus a handful of spare bytes here or there or with a very simple substitution/transposition function applied. This is (extremely roughly speaking) what LZMA of the 7-Zip fame does and what Microsoft tries to do different in their dedup approach.

    Of course, going that way is even MORE complex and time consuming, but may be well worth it, if space-saving is what you're intested in. After all, there is no such thing as a free lunch -- you either pay with time or with space (or with general applicability in some corner cases).

    So, all in all, the approach itself is not new -- neither yours nor Microsofts -- but the magic lies in actually creating a working product out of the theoretical approach outlined above.

  22. Re:deduplication is just compression by afidel · · Score: 2

    Since most file servers have about 95% unused processor cycles and a limited amount of disk I/O both compression and dedupe can be significant wins provided they don't create an I/O profile that is a smaller percentage more random than their effective compression (ie if they add 10% randomness to the I/O profile but provide 30% compression then it's probably a net win). The fact that they potentially increase cache effectiveness is just gravy since cache is a few orders of magnitude faster than spinning disk and at least an order of magnitude faster than even SSD's.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  23. Re:Roll your own? by nullchar · · Score: 2

    Hard links are awesome, but they're limited to a per-file basis. SDFS and other block-level de-dupers will only store unique blocks. E.g. storing multiple virtual machine images -- as each image is one huge file, hard links do nothing.

  24. LVM magic by Urban+Garlic · · Score: 2

    There's a form of deduplication supported by the Linux kernel, if you use the logical volume manager. If you create base LVM device, and then create a snapshot of that device, the snapshot only requires sufficient real estate on the host physical volume to store the diffs between the snapshot and the base. You can use this for "freezing" a file system to do back-ups, or for incremental back-ups, or whatever.

    My rather limited experience with this is that, if you have more than a few snapshots on a base device, your write performance degrades very raplidy. There's also a hard limit of 255 snapshots per device.

    You can also do file-based deduplication with the "rsnapshot" tool, which has been available for many years.

    Also also, I haven't kept up, but I seem to recall that ZFS for linux was promising this as a major selling point.

    --
    2*3*3*3*3*11*251
  25. OpenBSD's Epitome by funkboy · · Score: 2

    OpenBSD has had the Epitome deduplication framework for some time. I believe version 2 is considered production-ready.

  26. FreeBSD + ZFS by smash · · Score: 2

    If you don't think opensolaris has a future fair enough. FreeBSD does. FreeBSD currently supports ZFS v28, which has dedup. Be aware you need plenty of RAM.

    --
    I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.