Data Deduplication Comparative Review
snydeq writes "InfoWorld's Keith Schultz provides an in-depth comparative review of four data deduplication appliances to vet how well the technology stacks up against the rising glut of information in today's datacenters. 'Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year,' Schultz writes. 'If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.' Under review are dedupe boxes from FalconStor, NetApp, and SpectraLogic."
Same as the first.
Filesystems should be doing this.
Give me Classic Slashdot or give me death!
AFAIK this is pretty much how every compression algorithm works. No need to give it a fancy name.
The shiny new NetApp appliance that my PHB decided to blow the last of our budget on saves around 30% by using de-dupe, however we could have had 3 times conventional storage for the same cost.
NetApp is neat and all but horribly overpriced.
For all intensive porpoises your a bunch of rediculous loosers
I only see three. Was data deduplication at work in the article?
Odd that if they reviewed this class of products they didn't review the most common deduping NAS/SAN applicance - the EMC NS-series (particularly NS20).
I can't wait until the Dilbert strip hits where the PHB does this across all their backups and deduplicates them all away, thinking he's just saved a ton of money on backup media.
Redundancy can be a very good thing!
Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
Filesystems should be doing this.
The one on your desktop machine, or the primary NAS storage that you access shared data from, or the backup server that ends up getting it all anyway? You see, this is a shared database problem. If your local filesystem does this, then it has to 'share' knowledge of all the unique blocklets with every other server/filesystem that wishes to share in this compressed file space. De-duplication is a means of compression that works across many filesystems - or at least it can be, if it is properly implemented.
"Every time I see an adult on a bicycle, I no longer despair for the future of the human race." - H. G. Wells
Are there any open-source filesystems that offer deduplication?
It seems that the FS du-jour changes faster than any of the promised 'optional' features ever materialize.
Instead of working full-bore on The Next Great FS, it would be really nice to have compression, encryption, deduplication, shadow copies, and idle optimization running in EXT4.
Maybe I'm just jaded, but I've been a Linux user for 12 years now. Sometimes it feels like the names of the technologies are changing, but nothing ever gets 'finished'. Maybe the NTFS/BSD model (good core design, long intervals with only minor changes) would be wise in Linux filesystem development.
"Sometimes, I think Trent just needs a cup of hot chocolate and a blankie." -Tori Amos on Nine Inch Nails
However they have a ton of features including extremely high performance and reliability. For example they monitor your unit and if a drive fails, they'll send one you next day air. Sometimes the first you know of the failure is a disk shows up at your office.
Don't get me wrong, they aren't the only way to go, we have a much cheaper storage solution for less critical data, but the people who think dropping a bunch of disks in a Linux server gives you the same thing as a NetApp for less cost are fooling themselves.
It is exceedingly high end stuff, which is why it costs so much.
ZFS offers dedupe, and is even available in prepackaged NAS distributions such as Nexenta and OpenNAS. You too can have these great features, for much less than NetApp and friends.
Didn't Plan 9's filesystem combine journaling and block-level de-duplication years ago?
Proud member of the Weirdo-American community.
This is what I need. I can't swing a dead cat around my head without hitting a bunch of USB drives with fuck knows what's on them. But I can't bring myself to toss them out, or, even less likely, go through them to see if there is anything on them worth saving. Where is all that AI technology that everyone promised me in the 80's? I need an intellegent agent that tells me:
"Listen, dude, this data has gone way off, and has to go. Just look at the expiration date. Chuck this drive tomorrow!"
However, the article seems centered on primary storage, and not the marriage of backup/replication/physical tape, which is Quantum's focus.
Personally, I'd be _terrified_ of using dedup for primary storage. What this does is exactly the opposite of RAID - it squeezes every last bit of redundancy out of your data, and makes everything dependent upon the integrity of your blockpool database. Loose a single blocklet and you stand to lose _all_ of your data.
Compressing common data across many filesystems for things like backups makes a lot more sense, and seems more cost effective.
"Every time I see an adult on a bicycle, I no longer despair for the future of the human race." - H. G. Wells
http://www.opendedup.org/
aside from the mentioned 'to reduce duplicate data to increase available storage space' are there any other benefits to de-duplicating your storage? As I understand THAT point....instead of having 20 or 50 copies of the same email that has been forwarded to everybody in the organization 2 or 3 times, KEEP only 1 copy on the storage space, remove the duplicates and place "placeholders" in place of the duplicates which link back to that 1 copy on the same storage space, thereby reducing needed storage space and increasing available space, HOWEVER, if the sig on my email is different than the sig on anyone else's, how are those forwarded emails "duplicates"? and so then what good is it anyway? The forwards usually contain the quotations from each quote, or whatever they call the >>>>>>> marks, so again, how are those duplicates? And so what about the "near duplicates" ? -They just don't get considered because they aren't exact, right? WHAT IS THE POINT?!? note: I started reading the 8 pages of the linked story from the OP listing, but. . . .
Data domain pretty much started this market and is still the best product / market leader. Oddly they did not review them here, probably because these boxes just don't stack up. All of these reviewed are post-processing dedup appliances which in my mind suck.
I administer a DataDomain DD660. It's amazing. I have 140TB of backups sitting on 10TB of space. Why not include the market leader in this review? Too expensive?
Except NexentaStor (3.0.3) has an OpenSolaris upstream (which has gone away, by the way) kernel bug that hanged our Nexenta test box. Not a real good first impression.
Something you start to appreciate when you are called on to do a really high availability, high reliability system is to have features like this. For one thing it reduces the time it takes to get a replacement. Unless a drive fails late at night, you get one the next day. You don't have to rely on someone to notice the alert, place the order, etc. It just happens. Also, like most high end support companies, their shipping time is fairly late so even late in the day it is next day service. What arrives is the drive you need, in its caddy, ready to go.
Then there's just the fact of having someone else help monitor things. It's easy to say "Oh ya I'll watch everything important and deal with it right away," but harder to do it. I've known more than a few people who are not nearly as good at monitoring their critical system as they ought to be. A backup is not a bad thing.
You have to remember that the kind of stuff you are talking about for things like NetApps is when no downtime is ok, when no data loss is ok. You can't say "Ya a disk died and before we got a new on in another died so sorry, stuff is gone."
Not saying that your situation needs it, but there are those that do. They offer other features along those lines like redundant units, so if one fails the other continues no problem.
Basically they are for when data (and performance) is very important and you are willing to spend money for that. You put aside the tech-tough guy attitude of "I can manage it all myself," and accept that the data is that important.
After an analysis of a 1TB drive, I noticed that roughly 95% were 0's with only 5% being 1's.
I was then able to compress this dramatically. I just record that there are 950M 0's and 50M 1's. The space taken up drops to around 37 bits. Throw in a few checksum bits, and I am still under eight bytes.
I am not sure what is so hard about this disaster recovery planning. Heck, I figure I am up for a promotion after I implement this.
See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
I found a ton of stuff I didn't really care for with Nexenta. They've put some good effort into it, and it'd be a fine way to go if you wanted commercial support, but overall it doesn't really seem to fit our needs here. ZFS itself is a resource pig, but on the other hand, resources have become relatively cheap. It's not unthinkable to jam gigs of RAM in a storage server ... today. Five years ago, though, that would have been much more likely to be a deal-breaker.
If the market leader isn't included in the review, I am wondering how worthy this report is.
Looks like Huffman coding to me. Simply make a giant tarball and zip it. Same idea, no licenses due.
The concept of "deduplicating" is nothing new - it's one of the base concepts that data compression is built upon. But not only do we have people touting this "new development" but there's even questions as to its compatibility with compression. Sheesh!
One of the primary compression methods in Zip compression (deflate) uses a 32K buffer and replaces any duplicated data with pointer / length pointing to the first example of the data. This operation is computationally lightweight and can be done on the fly by any modern computer. There's open source Zip libraries / code available on the net. Building a cheap Linux box running deflate from Zip in realtime isn't an invention, it's a cheap trick that's being touted as the solution to storage problems.
EMC's software is the most buggy, unintuitive, poorly documented and abysmally supported piece of shit I've ever had the displeasure to use. It is simply revolting. Considering how much it costs, it's mind boggling.
(Well, what could I expect from an appliance that runs on Windows 95?)
I have to use that shit and it's obviously designed by complete morons. Seriously, I have to work with a spanking new SAN worth hundreds of thousands and it's so full of bug I can't imagine why people buy that crap. Oh wait I do know but I can't tell.
Their software's so buggy and poorly designed, it's gotta be written by complete retards.
There is good reason not to review EMC -- so that people stop buying those piece of shit and sysadmin like myself don't have to endure the misery of having to make those monstrosity work.
Navisphere, I curse you.
DoubleSpace still lives.
Do you actually use the word "ya" in converstation or is it just for posting online?
This is to show you why you saw/heard what you saw/heard is all, from a mgt. perspective this time (rather than yours which seems to be one more of a subordinate and more of one possibly in an engineering title or in such a department):
"The shiny new NetApp appliance that my PHB decided to blow the last of our budget on" - by leathered (780018) on Wednesday September 15, @07:19PM (#33594114)
Mgt. will often blow the last of its budgets on softwares or hardwares of great expense at year ending. Why? This is generally done so that budget DOES get "burned" by year end, so you can ask for more next year is why. Once more - TRY to think of this from a "mgt. perspective" (POV = point of view, per my subject line above), because mo' money = mo' power/control, etc./et al!
----
"saves around 30% by using de-dupe, " - by leathered (780018) on Wednesday September 15, @07:19PM (#33594114)
A savings, better over the LONG haul though, but still a 'savings' is notable, even by you, said mgt.'s opponent... in addition to that, once more? See above again: "rinse, lather, & repeat"...
----
"however we could have had 3 times conventional storage for the same cost." - by leathered (780018) on Wednesday September 15, @07:19PM (#33594114)
True, but disks break down faster than software typically... case in point: Software often runs processor architecture-to-processor-architecture and does so just as cost effectively (where disks do not, e.g. WD disks for desktops 15 yrs. ago, 240mb++ sized (or less even) disks, vs. those today? No comparison in size to use ratio compared to today's disks. However, software algorithms never completely go out of usage, and neither does the wares that use them!)
Especially in the MOST used hardware platform for computing there is in x86 & its derivants + ancestors for backward compatibility purposes. I'd take that trade-off personally, & based on what is saved in dedupe work? Absolutely, because you're getting back 30% or more (I have seen more in file dedup work result many times, compression being a single example, & other code I have written for file manipulations internals does so for space and cpu usage too as yet another) on the disks you already have, meaning more time and use of them, and in future ones also you buy!
(Which are only getting bigger & faster, 300gb + 150gb SATA II 10,000rpm 16mb buffered PRT utilizing WD "Velociraptor" drives utilizing user here, in addition to a Promise SuperTrak Ex8350 128mb Caching PCI-E RAID 6 SATA II controller here on them both, and a Gigabyte IRAM 4gb DDR-2 RAM True Non-Flash based SSD doing my %temp% ops, logging, webpage caching, pagefile.sys placement & more on it to off load my HDD's, fast as they are (serious into disk performance here is why. Speed up the slowest part of anything? You gain a 1,000 fold in my estimation personally...)).
----
"NetApp is neat and all but horribly overpriced." - by leathered (780018) on Wednesday September 15, @07:19PM (#33594114)
For the short term vs. disks? Maybe. I am swayed by that view you hold myself also, because it has merit from an engineering or user's POV (POV = point of view). However/Again, from a mgt. POV (POV = point of view)?
Once more - NetApp is that "budget blower" that works well by giving you 30++% or more life out of what you have now AND IN THE FUTURE in say, disks as you noted yourself, AND it's again also great for mgt. - in order to "burn budget" so they can ask for more next year for say, lol, those disks you mention (SATA III by the time next year end happens), for example, and NetApp DOES have a demonstratable "ROI" which you youself even conceded... though I have seen deduplication work for gains much larger than 30% as you noted, and so have you, or anyone here reading I wager... compression, anyone? Compression ALONE deals heavily in dedup, and it's commonplace in nearly all we do online (e.g. -> in file transfers, data transfers, you name it!).
APK
In realm of big IT where I have about 13PB of backup data on deduplicated disk we didn't even look at these products. Data Domain's DD690 and 880 are overall excellent but can't compress Oracle data if their lives depended on it. At least not the ways that the DBAs like to back up Oracle. IBM's Diligent product is a fantastic piece of technology for both Open and Mainframe systems, but is VTL based and does not come cheap.
Optimized replication between sites is one of the best parts of dedupe, even over storage. If something can actually get 10x compression then that 1GbE link I have between locations functionally acts more like a 10GbE for no more cost. A huge boost on the WAN for DR.
I know this is functionally different, but I have a Windows Home Server backing up my PCs at home. WHS apparently uses a "Single Instance Store" model to store backup sets. If the same file is detected on multiple computers, it is stored only once saving storage space. I'm backing up the C: drive of three Windows 7 Home Premium PCs to the WHS. Each PC uses between 40 and 60GB of space, yet the backup sets on the WHS total only 80GB. I'm sure that could partly be attributed to compression, but still, this seems to be pretty cool.
My mom always said, "Jim, you're 1 in a million." Given the current population, there are 7000 of me. God help us all!
I wrote a simple python program that does file based disk de-duplication. It will work anywhere python runs as long as the filesystem supports hard links.
It is available under gpl at: http://jdeifik.com/
With the hardware solutions in the article, prices start at $25k or so, and you are beholden to the hardware vendor. Doing it in software will work with any reasonable OS, any reasonable filesystem, and any hardware. Sure, block based de-duplication is more efficient, but it is filesystem specific, and right now ZFS is the only somewhat reasonable filesystem that supports it.
How could they perform this comparison and leave out DataDomain and Exagrid. Just about a worthless review.
I've used the ZFS dedup and I've used the others. The ZFS dedup still sucks and doesn't do 1/10 of the reduction that these others already do. I'm a Solaris and ZFS fan myself, but there's a reason these other alternatives cost as much as they do -- because they work and they work WELL. I've done tests with identical data on ZFS and the other high dollar storage appliances, and ZFS can't break a dedup ratio of 2x. The others were going past 20x with the same data. ZFS dedup is still too young to compete with these big players.
The ZFS-FUSE setup is fantastic. For most things you are very much limited by platter speed; I've found the performance to be quite acceptable.
As far as stability goes, the 0.6.9 release, which has been out for around 3 or 4 months, has been exceptional. I did extensive stress testing of it over the last 9 months or so, and all the issues I found were resolved (quickly) by the ZFS-FUSE folks.
I currently have a 16TB backup system running with something around 2,000 snapshots and 80% space used, and it works just great. I also have a personal storage system that I have been running ZFS-FUSE on for around 3 years now, and it also has been great. I was originally running 0.5.0 on it but upgraded to the 0.6.9 after my above stress testing. It also has been great.
I used Nexenta in the past, it was ok but I think there were definite ZFS issues in it at that time, maybe 3 years ago. The systems I had would reboot every 30 to 60 days. Then I upgraded it to the latest Nexenta a year or 18 months later and had all sorts of data loss issues while trying to do the zfs send/recv from the old systems. An annoyance with Nexenta was the hardware support; coming from Linux which supports just about anything to having to dig around to find compatible storage controllers... I recently (maybe 10 months ago now, before I started really hard down the ZFS-FUSE testing path) tried OpenSolaris and ran into some weirdness where I did the install, then did an update and it spent an hour downloading updates, then bombed out. So I started it again and it spent an hour downloading updates and bombed out.
There are two reasons I'm using ZFS-FUSE so heavily: 3 years ago there was no option for encrypting my ZFS storage system in Solaris, and I just am so much more comfortable with Linux than Solaris. My home storage system stores a lot of private data, that I want to have close at hand at home, but if someone steals it I don't want to worry about the scanned check images and bills we have saved there, etc... So crypto was a huge deal for me in that server.
Sean