Slashdot Mirror


Data Deduplication Comparative Review

snydeq writes "InfoWorld's Keith Schultz provides an in-depth comparative review of four data deduplication appliances to vet how well the technology stacks up against the rising glut of information in today's datacenters. 'Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year,' Schultz writes. 'If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.' Under review are dedupe boxes from FalconStor, NetApp, and SpectraLogic."

37 of 195 comments (clear)

  1. Second post by Anonymous Coward · · Score: 2, Funny

    Same as the first.

  2. Wrong layer by Hatta · · Score: 4, Insightful

    Filesystems should be doing this.

    --
    Give me Classic Slashdot or give me death!
    1. Re:Wrong layer by bersl2 · · Score: 2, Interesting

      No, deduplication has quite a bit of policy attached to it. Sometimes you want multiple independent copies of a file (well, maybe not in a data center, but why should the filesystem know that?). The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.

    2. Re:Wrong layer by PCM2 · · Score: 2, Interesting

      The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.

      But if that's the kind of deduplication you're talking about, does it really make sense to try to do it at the block level, as these boxes seem to be doing? Seems like you'd want to analyze files or databases in a more intelligent fashion.

      --
      Breakfast served all day!
    3. Re:Wrong layer by KiloByte · · Score: 2, Informative

      It's not fully automatic, I assume? Since that would cause a major slowdown.

      For manual dedupes, btrfs can do that as well, and a part of vserver patchset (not related to the main functionality) includes a hack that works for most Unix filesystems.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    4. Re:Wrong layer by dougmc · · Score: 2, Interesting

      But if that's the kind of deduplication you're talking about, does it really make sense to try to do it at the block level, as these boxes seem to be doing? Seems like you'd want to analyze files or databases in a more intelligent fashion

      This isn't a new thing -- it's a tried and true backup strategy, and it's quite effective at making your backup tapes go further. It increases the complexity of the backup setup, but it's mostly transparent to the user beyond the saved space.

      As for doing it at the file level rather than the block level, yes, that makes sense, but the block level does too. Think of a massive database file where only a few rows in a table changed, or a massive log file that only had some stuff appended to the end.

    5. Re:Wrong layer by phantomcircuit · · Score: 4, Informative

      It is fully automatic and it's not that much of a slow down. The reduced IO might actual provide a performance boost.

    6. Re:Wrong layer by suutar · · Score: 5, Informative

      Actually, it is automatic. ZFS already assumes you have a multithreaded OS running on more cpu than you probably need (e.g. Solaris), so it's already doing checksums (up to and including SHA256) for each data block in the filesystem. Comparing checksums (and optionally entire datablocks) to determine what blocks are duplicates isn't that much extra work at that point, although for deduplication you probably want to use a beefier checksum than you might choose otherwise, so there is some increase in work. http://blogs.sun.com/bonwick/entry/zfs_dedup has some more information on it. Getting it onto my linux box, now.. there's the rub. userspace ZFS exists, but I've only seen one pointer to a patch for it that includes dedup, and I haven't heard any stability reports on it yet.

    7. Re:Wrong layer by hoggoth · · Score: 2, Interesting

      > Getting it onto my linux box, now.. there's the rub

      So don't put it on Linux. Set up a Solaris or Nexenta box. I just did it. I installed a Nexenta server with 1TB of mirrored, checksummed storage in 15 minutes. I wrote it up here http://petertheobald.blogspot.com/ - it was extremely easy. Now all of my computers back up to the Nexenta server. All of my media is on it. I have daily snapshots of everything at almost no cost in disk storage.

      --
      - For the complete works of Shakespeare: cat /dev/random (may take some time)
    8. Re:Wrong layer by dgatwood · · Score: 3, Interesting

      I think it depends on which scheme you're talking about.

      Basic de-duplication techniques might focus only on blocks being identical. That would work for eliminating actual duplicated files, but would be nearly useless for eliminating portions of files unless those files happen to be block-structured themselves (e.g. two disk images that contain mostly the same files at mostly the same offsets).

      De-duplicating the boilerplate content in two Word documents, however, requires not only discovering that the content is the same, but also dealing with the fact that the content in question likely spans multiple blocks, and more to the point, dealing with the fact that the content will almost always span those blocks differently in different files. Thus, I would expect the better de-duplication schemes to treat files as glorified streams, and to de-duplicate stream fragments rather than operating at the block level. Block level de-duplication is at best a good start.

      What de-duplication should ideally not be concerned with (and I think this is what you are asking about) are the actual names of the files or where they came from. That information is a good starting point for rapidly de-duplicating the low hanging fruit (identical files, multiple versions of a single file, etc.), but that doesn't mean that the de-duplication software should necessarily limit itself to files with the same name or whatever.

      Does that answer the question?

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    9. Re:Wrong layer by hoytak · · Score: 2, Informative

      The latest stable version of zfs-fuse, 0.6.9, includes pool version 23 which has dedup support. Haven't tried it out yet, though.

      http://zfs-fuse.net/releases/0.6.9

      --
      Does having a witty signature really indicate normality?
    10. Re:Wrong layer by h4rr4r · · Score: 2, Insightful

      Open Solaris is dead, and there are kernel bugs in the latest version, so good luck with that. I looked at doing it at one time and due to fears about Opensolaris I stayed away. I consider myself lucky.

    11. Re:Wrong layer by drsmithy · · Score: 2, Insightful

      Sweet, thanks for the pointer. I was also concerned about the death of OpenSolaris but it sounds like Nexenta may be just what I want.

      Nexenta is built off Open Solaris and is, therefore, also dead - though it may take longer for the thrashing to stop.

    12. Re:Wrong layer by drsmithy · · Score: 3, Insightful

      Filesystems should be doing this.

      No, block devices should be doing this. Then you get the benefits regardless of which filesystem you want to layer on top.

    13. Re:Wrong layer by drsmithy · · Score: 2, Interesting

      Wouldn't a compressed filesystem already do this? They don't just get the compression from nowhere. They eliminate duplicates blocks and empty space. You don't just get compression from nowhere.

      No, because compression is limited to a single dataset. Deduplication can act across multiple datasets (assuming they're all on the same underlying block device).

      Consider an example with 4 identical files of 10MB in 4 different locations on a drive, that cat be compressed at 50%.

      "Logical" space used is 40MB.
      Using compression, they will fit into 20MB.
      Using dedupe, they will fit somewhere in between 5MB and 10MB.
      Using dedupe and compression, they will fit into ~5MB (probably a bit less).

      It doesn't really negate the need for good housekeeping routines, nor good programming. Do you really want 100 copies of record X, or would one suffice?

      Far better to let the computer do the heavy lifting, than trying to impose partial order on an inherently chaotic situation.

      Not to mention that the three textbook scenarios where dedupe really shines are backups, email and virtual machines, none of which can really be helped by "better housekeeping".

    14. Re:Wrong layer by TheRaven64 · · Score: 2, Informative

      Nexenta is developed by the people behind the Illumous Foundation, who have created a 'spork' of OpenSolaris, which will continue to import code from each of the source dumps that Oracle has said they will do after each Solaris release, will fix bugs, and will replace the binary-only components of OpenSolaris with open ones.

      --
      I am TheRaven on Soylent News
    15. Re:Wrong layer by StikyPad · · Score: 2, Funny

      Sounds like what we need is a giant table of all possible byte values up to 2^n length, then we can just provide the index to this master table instead of the data itself. I call this proposal the storage-storage tradeoff where, in exchange for requiring large amounts of storage, we require even more storage. I'll even throw in the extra time requirements for free.

  3. Don't forget to weigh in the cost by leathered · · Score: 2, Informative

    The shiny new NetApp appliance that my PHB decided to blow the last of our budget on saves around 30% by using de-dupe, however we could have had 3 times conventional storage for the same cost.

    NetApp is neat and all but horribly overpriced.

    --
    For all intensive porpoises your a bunch of rediculous loosers
    1. Re:Don't forget to weigh in the cost by hardburn · · Score: 2, Informative

      Was it near the end of the fiscal year? Good department managers know that if they use up their full budget, then it's harder to argue for a budget cut next year. Managers will sometimes blow any excess funds at the end of the year on things like this for that very reason.

      --
      Not a typewriter
    2. Re:Don't forget to weigh in the cost by h4rr4r · · Score: 3, Insightful

      More disk is still so much cheaper it really cannot be justified on that front. More disks also mean more IOPS, so reducing sinning platters can be a bad thing.

      There are some reasons to go for it, but even with thousands of clients it may or may not be suitable for what you are doing.

    3. Re:Don't forget to weigh in the cost by zooblethorpe · · Score: 2, Funny

      ...so reducing sinning platters can be a bad thing.

      Satan, is that you?

      Cheers,

      --
      "What in the name of Fats Waller is that?"
      "A four-foot prune."
    4. Re:Don't forget to weigh in the cost by Krahar · · Score: 3, Insightful

      Sinning platters cause original spin.

    5. Re:Don't forget to weigh in the cost by TheRaven64 · · Score: 2, Insightful

      No, good department managers don't know that. Department managers in companies with bad senior management know that. Companies with competent senior management are willing to increase the budgets for departments that have shown that they are fiscally responsible, and cut the budgets or fire the department heads of others.

      --
      I am TheRaven on Soylent News
  4. Not enough products by ischorr · · Score: 2, Interesting

    Odd that if they reviewed this class of products they didn't review the most common deduping NAS/SAN applicance - the EMC NS-series (particularly NS20).

  5. Which filesystem should be doing this??? by DanDD · · Score: 2, Insightful

    Filesystems should be doing this.

    The one on your desktop machine, or the primary NAS storage that you access shared data from, or the backup server that ends up getting it all anyway? You see, this is a shared database problem. If your local filesystem does this, then it has to 'share' knowledge of all the unique blocklets with every other server/filesystem that wishes to share in this compressed file space. De-duplication is a means of compression that works across many filesystems - or at least it can be, if it is properly implemented.

    --
    "Every time I see an adult on a bicycle, I no longer despair for the future of the human race." - H. G. Wells
  6. Use ZFS. It offers dedupe, compression, etc. by jgreco · · Score: 3, Informative

    ZFS offers dedupe, and is even available in prepackaged NAS distributions such as Nexenta and OpenNAS. You too can have these great features, for much less than NetApp and friends.

  7. This is new? by Angst+Badger · · Score: 2, Interesting

    Didn't Plan 9's filesystem combine journaling and block-level de-duplication years ago?

    --
    Proud member of the Weirdo-American community.
  8. Re:Um.. by cetialphav · · Score: 2, Informative

    AFAIK this is pretty much how every compression algorithm works. No need to give it a fancy name.

    The reason it has a different name is to distinguish this from a compressed file system. The blocks of data are not compressed in these systems. Imagine that you have a file system that stores lots of vmware images. In this system, there are lots of files that store the same information because the underlying data is OS system files and applications. Even if you compress each image, you will still have lots of blocks that have duplicate values.

    Deduplication says that the file system recognizes and eliminates duplicate blocks across the entire file system. If a given block has redundant data within it, that redundancy is not removed because the blocks themselves are not actually compressed. This is the difference between a compressed file system and a deduplicated file system. In fact, there is no reason that you could not combine both of these methods into a single system.

  9. Re:Use ZFS. It offers dedupe, compression, etc. by lisany · · Score: 2, Informative

    Except NexentaStor (3.0.3) has an OpenSolaris upstream (which has gone away, by the way) kernel bug that hanged our Nexenta test box. Not a real good first impression.

  10. Ya it is by Sycraft-fu · · Score: 3, Insightful

    Something you start to appreciate when you are called on to do a really high availability, high reliability system is to have features like this. For one thing it reduces the time it takes to get a replacement. Unless a drive fails late at night, you get one the next day. You don't have to rely on someone to notice the alert, place the order, etc. It just happens. Also, like most high end support companies, their shipping time is fairly late so even late in the day it is next day service. What arrives is the drive you need, in its caddy, ready to go.

    Then there's just the fact of having someone else help monitor things. It's easy to say "Oh ya I'll watch everything important and deal with it right away," but harder to do it. I've known more than a few people who are not nearly as good at monitoring their critical system as they ought to be. A backup is not a bad thing.

    You have to remember that the kind of stuff you are talking about for things like NetApps is when no downtime is ok, when no data loss is ok. You can't say "Ya a disk died and before we got a new on in another died so sorry, stuff is gone."

    Not saying that your situation needs it, but there are those that do. They offer other features along those lines like redundant units, so if one fails the other continues no problem.

    Basically they are for when data (and performance) is very important and you are willing to spend money for that. You put aside the tech-tough guy attitude of "I can manage it all myself," and accept that the data is that important.

    1. Re:Ya it is by h4rr4r · · Score: 2, Insightful

      I mean have the nagios server order the drive without any human intervention.

      Also if it was really critical you would keep several disks ready to go on site. You know for when you can't wait for next day. Also like netapp you too can have many hot spares in the volume.

      If you have problems with people not noticing or reacting to alerts you need to fire them.

    2. Re:Ya it is by Anonymous Coward · · Score: 2, Insightful

      I'll but in and say that firing people is a piss poor way to fix problems unless you've made very sure that the person in question needs to go. What you do is find out what happened if an alert goes unnoticed and make a change that removes the root cause of that failure. That may be that you have to let go of the guy doing drugs in the corner, but it may also be that your hardware issues alerts in a way that it is easy to miss. You may also realize that perhaps an alert happens only once a year, and in that case you may need to issue spurious alerts to make sure that people know what to do and remain vigilant. The root cause may even be that your staff is completely overworked, and just think where firing someone is going to put you then. Or maybe what you need is to put a siren on the damn thing that will make it impossible to miss even at 3 in the night when the guy at watch falls asleep because he's been pulling all-nigthers to keep your company in business. Firing someone just because a fuck-up happened is sometimes a very bad response.

    3. Re:Ya it is by totally+bogus+dude · · Score: 2, Insightful

      Developing a monitoring system for a complicated piece of storage that reacts properly to every possible failure mode is a massive undertaking. It will take a lot of time just to figure out everything that you need to monitor, and the possible values for them during normal operation; let alone actually test that your system correctly detects and responds to every possibility.

      If your business is providing SAN management/support services, then I can see this as being worthwhile. It's a massive investment in technology and skills amongst your staff, but if that's what you make your money doing, it may well give you a competitive edge.

      But if your business is anything else, why are you going to invest so much into something that's really just a background piece of infrastructure? What's your plan for retaining the staff that know how the monitoring system works, and know your storage system in sufficient detail to be able to understand all the things it's checking, etc?

      If you really have the expertise on-hand to implement such a thing in a way that you're comfortable relying on, why on earth wouldn't you use them for something more productive that will actually make your business money? Again, if your business is monitoring storage infrastructure, it makes sense. If your business is anything else, why are you spending the time of highly skilled people to implement something you can easily buy off-the-shelf (i.e. a standard support contract)?

  11. Re:De-Dupe on Linux? by suutar · · Score: 2, Informative

    There's a few. I've read there's a patchset for ZFS on FUSE that can do deduplication; there's also opendedup and lessfs. The problem is that none of these has been around long enough to be considered bulletproof yet, and for a filesystem whose job is to play fast and loose with file contents in the name of space savings, that's kinda worrisome.

  12. Re:Um.. by igny · · Score: 2, Funny

    Yeah! To fight dupes I compute CRC checksum for each file and store it (and only it) on my back up drive. That method removes dupes almost automatically and there is a side effect of a huge compression ratio too. I have been downloading the high def videos from Internet for quite a while now and with my compression method I have used less than 10 percent of 1GB flash drive! I strongly recommend this method to everyone!

    --
    In theory there is no difference between theory and practice. In practice there is. - Yogi Berra
  13. Re:Nor do they give proper mention to Quantum DXi by immortalpob · · Score: 2, Interesting

    You are missing his point. On a non-deduplicated system if one block goes bad you lose one file, on a deduplicated system you can lose any number of files due to one bad block. It gets worse when you consider the panacea of non-backup deduplication, consider all of your servers are VMs and reside on the same deduplicated storage, one bad block can take them ALL DOWN. Now admittedly any dedupe solution will sit on some type of raid, however there is still the possibility of something terrible, and this is made worse by the likelihood of a URE during a raid-5 rebuild.

  14. I already do this by MyLongNickName · · Score: 3, Funny

    After an analysis of a 1TB drive, I noticed that roughly 95% were 0's with only 5% being 1's.

    I was then able to compress this dramatically. I just record that there are 950M 0's and 50M 1's. The space taken up drops to around 37 bits. Throw in a few checksum bits, and I am still under eight bytes.

    I am not sure what is so hard about this disaster recovery planning. Heck, I figure I am up for a promotion after I implement this.

    --
    See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year