Data Deduplication Comparative Review
snydeq writes "InfoWorld's Keith Schultz provides an in-depth comparative review of four data deduplication appliances to vet how well the technology stacks up against the rising glut of information in today's datacenters. 'Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year,' Schultz writes. 'If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.' Under review are dedupe boxes from FalconStor, NetApp, and SpectraLogic."
Filesystems should be doing this.
Give me Classic Slashdot or give me death!
ZFS offers dedupe, and is even available in prepackaged NAS distributions such as Nexenta and OpenNAS. You too can have these great features, for much less than NetApp and friends.
More disk is still so much cheaper it really cannot be justified on that front. More disks also mean more IOPS, so reducing sinning platters can be a bad thing.
There are some reasons to go for it, but even with thousands of clients it may or may not be suitable for what you are doing.
Something you start to appreciate when you are called on to do a really high availability, high reliability system is to have features like this. For one thing it reduces the time it takes to get a replacement. Unless a drive fails late at night, you get one the next day. You don't have to rely on someone to notice the alert, place the order, etc. It just happens. Also, like most high end support companies, their shipping time is fairly late so even late in the day it is next day service. What arrives is the drive you need, in its caddy, ready to go.
Then there's just the fact of having someone else help monitor things. It's easy to say "Oh ya I'll watch everything important and deal with it right away," but harder to do it. I've known more than a few people who are not nearly as good at monitoring their critical system as they ought to be. A backup is not a bad thing.
You have to remember that the kind of stuff you are talking about for things like NetApps is when no downtime is ok, when no data loss is ok. You can't say "Ya a disk died and before we got a new on in another died so sorry, stuff is gone."
Not saying that your situation needs it, but there are those that do. They offer other features along those lines like redundant units, so if one fails the other continues no problem.
Basically they are for when data (and performance) is very important and you are willing to spend money for that. You put aside the tech-tough guy attitude of "I can manage it all myself," and accept that the data is that important.
After an analysis of a 1TB drive, I noticed that roughly 95% were 0's with only 5% being 1's.
I was then able to compress this dramatically. I just record that there are 950M 0's and 50M 1's. The space taken up drops to around 37 bits. Throw in a few checksum bits, and I am still under eight bytes.
I am not sure what is so hard about this disaster recovery planning. Heck, I figure I am up for a promotion after I implement this.
See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
Sinning platters cause original spin.