Data Deduplication Comparative Review

← Back to Stories (view on slashdot.org)

Data Deduplication Comparative Review

Posted by samzenpus on Wednesday September 15, 2010 @11:10AM from the a-little-order-please dept.

snydeq writes "InfoWorld's Keith Schultz provides an in-depth comparative review of four data deduplication appliances to vet how well the technology stacks up against the rising glut of information in today's datacenters. 'Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year,' Schultz writes. 'If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.' Under review are dedupe boxes from FalconStor, NetApp, and SpectraLogic."

10 of 195 comments (clear)

Min score:

Reason:

Sort:

Wrong layer by Hatta · 2010-09-15 11:15 · Score: 4, Insightful

Filesystems should be doing this.

--
Give me Classic Slashdot or give me death!
1. Re:Wrong layer by phantomcircuit · 2010-09-15 11:51 · Score: 4, Informative
  
  It is fully automatic and it's not that much of a slow down. The reduced IO might actual provide a performance boost.
2. Re:Wrong layer by suutar · 2010-09-15 11:52 · Score: 5, Informative
  
  Actually, it is automatic. ZFS already assumes you have a multithreaded OS running on more cpu than you probably need (e.g. Solaris), so it's already doing checksums (up to and including SHA256) for each data block in the filesystem. Comparing checksums (and optionally entire datablocks) to determine what blocks are duplicates isn't that much extra work at that point, although for deduplication you probably want to use a beefier checksum than you might choose otherwise, so there is some increase in work. http://blogs.sun.com/bonwick/entry/zfs_dedup has some more information on it. Getting it onto my linux box, now.. there's the rub. userspace ZFS exists, but I've only seen one pointer to a patch for it that includes dedup, and I haven't heard any stability reports on it yet.
3. Re:Wrong layer by dgatwood · 2010-09-15 13:02 · Score: 3, Interesting
  
  I think it depends on which scheme you're talking about.
  Basic de-duplication techniques might focus only on blocks being identical. That would work for eliminating actual duplicated files, but would be nearly useless for eliminating portions of files unless those files happen to be block-structured themselves (e.g. two disk images that contain mostly the same files at mostly the same offsets).
  De-duplicating the boilerplate content in two Word documents, however, requires not only discovering that the content is the same, but also dealing with the fact that the content in question likely spans multiple blocks, and more to the point, dealing with the fact that the content will almost always span those blocks differently in different files. Thus, I would expect the better de-duplication schemes to treat files as glorified streams, and to de-duplicate stream fragments rather than operating at the block level. Block level de-duplication is at best a good start.
  What de-duplication should ideally not be concerned with (and I think this is what you are asking about) are the actual names of the files or where they came from. That information is a good starting point for rapidly de-duplicating the low hanging fruit (identical files, multiple versions of a single file, etc.), but that doesn't mean that the de-duplication software should necessarily limit itself to files with the same name or whatever.
  Does that answer the question?
  
  --
  Check out my sci-fi/humor trilogy at PatriotsBooks.
4. Re:Wrong layer by drsmithy · 2010-09-15 15:41 · Score: 3, Insightful
  
  Filesystems should be doing this.
  No, block devices should be doing this. Then you get the benefits regardless of which filesystem you want to layer on top.
Use ZFS. It offers dedupe, compression, etc. by jgreco · 2010-09-15 11:48 · Score: 3, Informative

ZFS offers dedupe, and is even available in prepackaged NAS distributions such as Nexenta and OpenNAS. You too can have these great features, for much less than NetApp and friends.
Re:Don't forget to weigh in the cost by h4rr4r · 2010-09-15 13:08 · Score: 3, Insightful

More disk is still so much cheaper it really cannot be justified on that front. More disks also mean more IOPS, so reducing sinning platters can be a bad thing.
There are some reasons to go for it, but even with thousands of clients it may or may not be suitable for what you are doing.
Ya it is by Sycraft-fu · 2010-09-15 13:20 · Score: 3, Insightful

Something you start to appreciate when you are called on to do a really high availability, high reliability system is to have features like this. For one thing it reduces the time it takes to get a replacement. Unless a drive fails late at night, you get one the next day. You don't have to rely on someone to notice the alert, place the order, etc. It just happens. Also, like most high end support companies, their shipping time is fairly late so even late in the day it is next day service. What arrives is the drive you need, in its caddy, ready to go.
Then there's just the fact of having someone else help monitor things. It's easy to say "Oh ya I'll watch everything important and deal with it right away," but harder to do it. I've known more than a few people who are not nearly as good at monitoring their critical system as they ought to be. A backup is not a bad thing.
You have to remember that the kind of stuff you are talking about for things like NetApps is when no downtime is ok, when no data loss is ok. You can't say "Ya a disk died and before we got a new on in another died so sorry, stuff is gone."
Not saying that your situation needs it, but there are those that do. They offer other features along those lines like redundant units, so if one fails the other continues no problem.
Basically they are for when data (and performance) is very important and you are willing to spend money for that. You put aside the tech-tough guy attitude of "I can manage it all myself," and accept that the data is that important.
I already do this by MyLongNickName · 2010-09-15 13:57 · Score: 3, Funny

After an analysis of a 1TB drive, I noticed that roughly 95% were 0's with only 5% being 1's.
I was then able to compress this dramatically. I just record that there are 950M 0's and 50M 1's. The space taken up drops to around 37 bits. Throw in a few checksum bits, and I am still under eight bytes.
I am not sure what is so hard about this disaster recovery planning. Heck, I figure I am up for a promotion after I implement this.

--
See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
Re:Don't forget to weigh in the cost by Krahar · 2010-09-15 15:54 · Score: 3, Insightful

Sinning platters cause original spin.