Open Source Deduplication For Linux With Opendedup

← Back to Stories (view on slashdot.org)

Open Source Deduplication For Linux With Opendedup

Posted by timothy on Saturday March 27, 2010 @03:31PM from the its-missing-apostrophes dept.

tazzbit writes "The storage vendors have been crowing about data deduplication technology for some time now, but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes and the number of files is limited by the underlying file system. Opendedup runs in user space, making it platform independent, easier to scale and cluster, and it can integrate with other user space services like Amazon S3."

31 of 186 comments (clear)

In case you don't know much about it by stoolpigeon · 2010-03-27 15:32 · Score: 5, Informative

Data deduplication
( I don't )

--
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
1. Re:In case you don't know much about it by MyLongNickName · 2010-03-27 15:52 · Score: 4, Informative
  
  Data deduplication is huge in virtualized environments. Four virtual servers with identical OS's running on one host server? Deduplicate the data and save a lot of space.
  This is even bigger in the virutulized desktop envirornment where you could literally have hundreds of PCs virtualized on the same physical box.
  
  --
  See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
2. Re:In case you don't know much about it by rubycodez · 2010-03-27 16:33 · Score: 2, Informative
  
  hundreds of virtualized desktops per physical server does happen, my employer sells such solutions from several vendors.
3. Re:In case you don't know much about it by MyLongNickName · 2010-03-27 16:51 · Score: 3, Informative
  
  If you have a couple hundred people running business apps, it ain't all that difficult. Generally you will get spikes of CPU utilization that last a few seconds mashed between many minutes, or even hours of very low CPU utilization. A powerful server can handle dozens or even hundreds of virtual desktops in this type of environment.
  
  --
  See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
4. Re:In case you don't know much about it by zappepcs · 2010-03-27 17:12 · Score: 4, Informative
  
  In a word, No. There are many types of 'virtualization' and more than one approach to de-duplication. In a system as engineered as one with de-duplication, you should have replication as part of the data integrity processes. If the file is corrupted in all the main copies (everywhere it exists, including backups) then the scenario you describe would be correct. This is true for any individual file that exists on computer systems today. De-duplication strives to reduce the number of copies needed across some defined data 'space' whether that is user space, or server space, or storage space etc.
  This is a problem in many aspects of computing. Imagine you have a business with 50 users. Each must use a web application which has many graphics. The browser caches of each user has copies of each of those graphics images. When the cache is backed up, the backup is much larger than it needs to be. You can do several things to reduce backup times, storage space, and user quality of service
  1 - disable caching for that site in the browser and cache them on a single server locally located
  2 - disable backing up the browser caches, or back up only one
  3 - enable deduplication in the backup and storage processes
  4 - implement all or several of the above
  The problems are not single ended and the answers or solutions will also not be single ended or faceted. That is no one solution is the answer to all possible problems. This one has some aspects to it that are appealing to certain groups of people. You average home user might not be able to take advantage of this yet. Small businesses though might need to start looking at this type of solution. Think how many people got the same group email message with a 12MB attachment. How many times do all those copies get archived? In just that example you see the waste that duplicated data represents. Solutions such as this offer an affordable way to positively affect bottom lines in fighting those types of problems problems.
  
  --
  Support NYCountryLawyer RIAA vs People
5. Re:In case you don't know much about it by fatp · 2010-03-27 17:54 · Score: 2, Funny
  
  Oh in fact it requires jdk 7...
6. Re:In case you don't know much about it by GNUALMAFUERTE · 2010-03-27 18:48 · Score: 2, Funny
  
  Hey, slow down cowboy. Explain that concept to me again. I don't know if it's applicable here, but if we find a way to implement it, it might just prove revolutionary.
  I work in the quality assurance department of Geeknet Inc, Slashdot's parent company. We are constantly looking for ways to improve all the sites on our network.
  I don't know if this method you propose, that, if I understand correctly, would involve parsing the content of the html document linked, and having an editor analyze the output of such html document after being rendered (let's call it, reading the story), is at all possible. But if we implement it the right way, it might prove useful.
  We'll get our research team to work over this reading-the-story concept. It's something absolutely novel to us, so it might take a while. We'll let you know when we reach a conclusion, so that we might license this reading-the-story technology from you.
  Kind Regards,
  Lazy Rodriguez
  GeekNet INC.
  
  --
  WTF am I doing replying to an AC at 5 A.M on a Friday night?
7. Re:In case you don't know much about it by DarkOx · 2010-03-27 23:10 · Score: 2, Informative
  
  It really is hundreds, on a modern nehalem core system with 64 gigs of memory or so. We used to do dozens on each node in a citrix farm back in the PIII days.
  
  --
  Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
8. Re:In case you don't know much about it by Degrees · 2010-03-28 04:27 · Score: 2, Interesting
  
  It is one of those things that once you start using it, the benefits become apparent.
  Here are some:
  1) One application on one machine. No more wondering if application X has somehow messed up application Y. The writers of the software probably developed the application in a clean environment, and this lets you run it in a clean environment. Gets rid of vendor finger-pointing, too.
  2) One application on one machine. If application X fouls the nest, you can reboot it and know that you are not also terminating applications Y, Z, A, and B.
  3) Machine portability. The drivers in a VM guest are generic -and- uniform. Nothing inside the (guest) machine changes if you move the machine from a host with an Intel NIC to a host with a Broadcom NIC. The benefit here is that when hardware fails (and it will), it is pretty quick and easy to assign the boot disk to a different host, and boot the machine up. Think 10 - 30 minutes (per machine) to recover from a burned up power supply*.
  4) Machine portability. There are some solutions that let you auto-fail-over to a new host when the guest stops responding. That burned up power supply could now be a two minute outage and NO emergency notification call.
  5) Machine portability. Platespin lets you auto-migrate machines on a schedule to a few blades at night, power down those blades for power savings, and then power them up a little before business hours and migrate back. In a large data center, the electricity savings is enough to make it worth it.
  6) Machine flexibility. Does application X not need much in the way of processing power? With the VM manager software, assign it one CPU and 256 MB RAM. Later find out that wasn't enough? Up the specs and reboot.
  7) Reboot speed. In paravirtualized environments, the OS is already loaded in the host VM, so the guest VM just links and loads. I've seen entire machine reboots that take 16 seconds.
  Along these lines, an anecdote from my life: How to add RAM to a server so nobody notices: virtualize
  Hope this helps explain why some people are such a fan of virtualization.
  *This is really a benefit that comes from disconnecting the machine from its disks, but VM and SAN go exceptionally well together.
  
  --
  "The most sensible request of government we make is not, "Do something!" But "Quit it!"
9. Re:In case you don't know much about it by Eil · 2010-03-28 04:40 · Score: 2, Interesting
  
  Almost every mission critical system these days is running in either a clustered or virtualized environment. I work in the financial services industry and there are many reasons we virtualize pretty much everything these days. These, however, are probably the biggies:
  - Redundancy: If a physical machine dies, its virtual machines can be moved over to a spare, often with no interruption in service.
  - Isolation: Just because you can run multiple services on a box doesn't mean you should. It poses potential security problems (one compromised app can open the door to compromise another), makes managing users and resources more difficult, and the applications can interact or conflict in unexpected ways. Many vendors demand that their application be the only one running on a machine or they won't support it.
  - Portability: An OS configured for use on a virtual machine can be run on any platform which runs the virtual machine without modification.
This is for hard disks by ZERO1ZERO · 2010-03-27 15:36 · Score: 2, Interesting

Does software like ESX and others (Xen etc) perform this in memory already for running VMs? I.e. if you have 2 Windows VMs it will only store one copy of the libs etc in the hosts memory ?
Also, is there easy way to get multiple machines running 'as one' to pool resources for running a vm setup? Does openmosix do that?
1. Re:This is for hard disks by fatp · 2010-03-27 17:49 · Score: 2, Funny
  
  I really googled "memory deduplication $VM_TECH"... It returned this post as the only result
  
  what an idiot I am T.T
2. Re:This is for hard disks by Island+Admin · 2010-03-27 22:14 · Score: 2, Funny
  
  Go to your browser preferences - uncheck "enable Great Firewall of China". ;)
Hasn't this been posted before? by Required+Snark · 2010-03-27 15:40 · Score: 5, Funny

Just wondering...

--
Why is Snark Required?
Yea, I RTFA, but... by mrsteveman1 · 2010-03-27 16:10 · Score: 2, Interesting

......from what i can tell, this is NOT a way to deduplicate existing filesystems or even layer it on top of existing data, but a new filesystem operating perhaps like eCryptfs, storing backend data on an existing filesystem in some FS-specific format.
So, having said that, does anyone know if there is a good way to resolve EXISTING duplicate files on Linux using hard links? For every identical pair found, a+b, b is deleted and instead hardlinked to a? I know there are plenty of duplicate file finders (fdupes, some windows programs, etc), but they're all focused on deleting things rather than simply recovering space using hardlinks.
1. Re:Yea, I RTFA, but... by dlgeek · 2010-03-27 16:23 · Score: 3, Informative
  
  You could easily write a script to do that using find, sha1sum or md5sum, sort and link. It would probably only take about 5-10 minutes to write but you most likely don't want to do that. When you modify one item in a hard linked pair, the other one is edited as well, whereas a copy doesn't do this. Unless you are sure your data is immutable, this will lead to problems down the road.
  
  Deduplication systems pay attention to this and maintain independent indexes to do copy-on-write and the like to preserve the independence of each reference.
2. Re:Yea, I RTFA, but... by Lorens · 2010-03-27 16:28 · Score: 2, Interesting
  
  I wrote fileuniq (http://sourceforge.net/projects/fileuniq/) exactly for this reason. You can symlink or hardlink, decide how identical a file must be (timestamp, uid...), or delete.
  It's far from optimized, but I accept patches :-)
This just gave me a good idea! by thePowerOfGrayskull · 2010-03-27 16:17 · Score: 3, Interesting

Actually, just the title did it. I've historically had a bad habit of backing things up by taking tar/gzs of directory structures, giving them an obscure name, and putting them onto network storage. Or sometimes just copying directory structures without zipping first. Needless to say, this makes for a huge mess.
Just occurred to me that it would not be difficult to write a quick script to extract everything into its own tree; run sha1sum on all files; and identify duplicate files automatically; probably in just one or two lines.
So in other words -- thanks Slashdot! The otherwise unintelligible summary did me a world of good -- mostly because there was no context as to what the hell it was talking about, so I had to supply my own definition...
1. Re:This just gave me a good idea! by Hooya · 2010-03-27 17:06 · Score: 3, Informative
  
  try this::
  mv backup.0 backup.1
  rsync -a --delete --link-dest=../backup.1 source_directory/ backup.0/
  see this
2. Re:This just gave me a good idea! by Z8 · 2010-03-28 04:25 · Score: 2, Informative
  Yep, and then you don't have to worry about
  
  Changes in permissions/mtimes/atimes corrupting all your old backups because all of them are hard linked, or alternatively
  Changes in permissions/mtimes/atimes causing an entire file to get copied
  There are also other things to worry about. To be fair, the guy who invented --link-dest wrote a backup program called Dirvish so that is a better comparison to rdiff-backup.
Offtopic? by SanityInAnarchy · 2010-03-27 16:18 · Score: 3, Informative

If you'd mentioned the fact that this appears to be written in Java, you might have a point. But despite this, and the fact that it's in userland, they seem to be getting pretty decent performance out of it.
And keep in mind, all of this is to support reducing the amount of storage required on a hard disk, and it's a fairly large programming effort to do so. Seems like this entire project is just the opposite of what you claim -- it's software types doing extra work so they can spend less on storage.

--
Don't thank God, thank a doctor!
Re:A hypothetical question. by tlhIngan · 2010-03-27 16:25 · Score: 2, Interesting

I appreciate any deduplication solution for linux for sure, but isnt any deplucation creating a lot of shared ressources which could be possibly exploited for attacks (e.g. on the privacy of other users)?
Most likely in the implementation itself, not the de-duplication process.
Let's say user A and B have some file in common. Without de-duplication, the file exists on both home directories. With de-duplication, one copy of the file exists for both users. Now, if there is an exploit such that you could find out if this has happened, then user A or B will know that the other has a copy of the same file. That knowledge could be useful.
Ditto on critical system files - if you could generate a file and have it match a protected system file, this might be useful to exploit the system. E.g., /etc/shadow (which isn't normally world-readable). If you can find a way to tell the deduplication happens, you can get access to these critical files for other purposes.
Note that you can't *change* the file (because that would just split the files up again), but being able to read the file (when you couldn't before) or knowing that another copy exists elsewhere can be very useful knowledge. But the de-duplication mechanism must inadvertently reveal when this happens.
Re:Let's get down to brass tacks. by Hooya · 2010-03-27 17:01 · Score: 3, Funny

very repetitive. back and fourth. back and fourth. oh wait... that's not what you meant. never mind.
Re:How useful is this in realistic scenarios? by QuantumRiff · 2010-03-27 17:19 · Score: 3, Informative

If you cut up a large file into lots of chunks of whatever size, lets say 64KB each. Then, you look at the chunks. If you have two chunks that are the same, you remove the second one, and just place a pointer to the first one. Data Deduplication is much more complicated than that in real life, but basically, the more data you have, or the smaller the chunks you look at, the more likely you are to have duplication, or collisions. (how many word documents have a few words in a row? remove every repeat of the phrase "and then the" and replace it with a pointer, if you will).
This is also similar to WAN acceleration, which at a high enough level, is just deduplicating traffic that the network would have to transmit.
It is amazing how much space you can free up, when your not just looking at the file level. This has become very big in recent years, cause storage has exploded, and processors are finally fast enough to do this in real-time.

--

What are we going to do tonight Brain?
Re:Patent 5,813,008 by pem · 2010-03-27 17:36 · Score: 2, Interesting

A good lawyer could probably argue that this doesn't apply.
Claim 1(a) requires "dividing an information item into a common portion and a unique portion".
It may be that the patent covers the case where the unique portion is empty, but then again maybe not, especially if the computer never takes the step to find out! In other words, if you treat every item as a common item (even if there is only one copy), there is a good chance the patent might not apply.
(There is also a good chance that the patent is written the way it is specifically because it doesn't apply to that case -- it may be that there is prior art in one of the referenced patents.)
Re:How useful is this in realistic scenarios? by mysidia · 2010-03-27 18:03 · Score: 4, Informative

First of all.... one of the most commonly duplicated blocks is the NUL block, that is a block of data where all bits are 0, corresponding with unused space, or space that was used and then zero'd.
If you have a virtual machine on a fresh 30GB disk with 10GB actually in use, you have at least 25GB that could be freed up by dedup.
Second, if you have multiple VMs on a dedup store, many of the OS files will be duplicates.
Even on a single system, many system binaries and libraries, will contain duplicate blocks.
Of course multiple binaries statically linked against the same libraries will have dups.
But also, there is a common structure to certain files in the OS, similarities between files so great, that they will contain duplicate blocks.
Then if the system actually contains user data, there is probably duplication within the data.
For example, mail stores... will commonly have many duplicates.
One user sent an e-mail message to 300 people in your organization -- guess what, that message is going to be in 300 mailboxes.
If users store files on the system, they will commonly make multiple copies of their own files..
Ex... mydocument-draft1.doc, mydocument-draft2.doc, mydocument-draft3.doc
Can MS Word files be large enough to matter? Yes.. if you get enough of them.
Besides they have common structure that is the same for almost all MS Word files. Even documents' whose text is not at all similar are likely to have some duplicate blocks, which you have just accepted in the past -- it's supposed to be a very small amount of space per file, but in reality: a small amount of waste multiplied by thousands of files, adds up.
Just because data seems to be all different doesn't mean dedup won't help with storage usage.
Re:deduplication by GNUALMAFUERTE · 2010-03-27 19:11 · Score: 2, Funny

So, Blade Runner was about de-duplication?

--
WTF am I doing replying to an AC at 5 A.M on a Friday night?
Re:How useful is this in realistic scenarios? by drsmithy · 2010-03-27 19:13 · Score: 3, Informative

I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical system with 10 Centos VMs running different apps or similar apps with different data? You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.
Assume 200 VMs at, say, 2GB per OS install. Allowing for some uniqueness, you'll probably end up using something in the ballpark of 20-30GB of "real" space to store 400GB of "virtual" data. That's a *massive* saving, not only disk space, but also in IOPS, since any well-engineered system will carry that deduplication through to the cache layer as well.
Deduplication is *huge* in virtual environments. The other big place it provides benefits, of course, is D2D backups.
Re:deduplication by nacturation · 2010-03-27 22:10 · Score: 4, Funny

What kind of lame recursive acronym is "deduplication"? I'm flummoxed in any attempt to decipher it.
Deduplication Eases Disk Utilization Purposefully Linking Information Common Among Trusted Independent Operating Nodes

--
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
See Also LESSFS by sharper56 · 2010-03-27 22:37 · Score: 4, Interesting

Another nice OpenSource FS De-Dup project to look into is LESSFS.
Block-level de-dup and good speed. Also offers per block encryption and compression.
I'm using it backup VMs. 2TB of raw VMs plus 60 days of changes store down to 300GB. Write to de-dup FS is > 50MB/s.
1. Re:See Also LESSFS by phoenix_rizzen · 2010-03-28 14:18 · Score: 2, Interesting
  
  ZFS also offers block-level dedupe support since ZFSv21. You can run it via FUSE on Linux, or natively on OpenSolaris. Hopefully, it'll also be available in FreeBSD 9.0 if not sooner (FreeBSD 7.3/8.0 have ZFSv14).
  Since ZFS already checksums every block that hits the disk, dedupe is almost free, as those checksums are re-used for finding/tracking duplicate blocks.