Open Source Deduplication For Linux With Opendedup
tazzbit writes "The storage vendors have been crowing about data deduplication technology for some time now, but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes and the number of files is limited by the underlying file system. Opendedup runs in user space, making it platform independent, easier to scale and cluster, and it can integrate with other user space services like Amazon S3."
Data deduplication
( I don't )
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Also, is there easy way to get multiple machines running 'as one' to pool resources for running a vm setup? Does openmosix do that?
Just wondering...
Why is Snark Required?
......from what i can tell, this is NOT a way to deduplicate existing filesystems or even layer it on top of existing data, but a new filesystem operating perhaps like eCryptfs, storing backend data on an existing filesystem in some FS-specific format.
So, having said that, does anyone know if there is a good way to resolve EXISTING duplicate files on Linux using hard links? For every identical pair found, a+b, b is deleted and instead hardlinked to a? I know there are plenty of duplicate file finders (fdupes, some windows programs, etc), but they're all focused on deleting things rather than simply recovering space using hardlinks.
Just occurred to me that it would not be difficult to write a quick script to extract everything into its own tree; run sha1sum on all files; and identify duplicate files automatically; probably in just one or two lines.
So in other words -- thanks Slashdot! The otherwise unintelligible summary did me a world of good -- mostly because there was no context as to what the hell it was talking about, so I had to supply my own definition...
If you'd mentioned the fact that this appears to be written in Java, you might have a point. But despite this, and the fact that it's in userland, they seem to be getting pretty decent performance out of it.
And keep in mind, all of this is to support reducing the amount of storage required on a hard disk, and it's a fairly large programming effort to do so. Seems like this entire project is just the opposite of what you claim -- it's software types doing extra work so they can spend less on storage.
Don't thank God, thank a doctor!
Most likely in the implementation itself, not the de-duplication process.
Let's say user A and B have some file in common. Without de-duplication, the file exists on both home directories. With de-duplication, one copy of the file exists for both users. Now, if there is an exploit such that you could find out if this has happened, then user A or B will know that the other has a copy of the same file. That knowledge could be useful.
Ditto on critical system files - if you could generate a file and have it match a protected system file, this might be useful to exploit the system. E.g., /etc/shadow (which isn't normally world-readable). If you can find a way to tell the deduplication happens, you can get access to these critical files for other purposes.
Note that you can't *change* the file (because that would just split the files up again), but being able to read the file (when you couldn't before) or knowing that another copy exists elsewhere can be very useful knowledge. But the de-duplication mechanism must inadvertently reveal when this happens.
very repetitive. back and fourth. back and fourth. oh wait... that's not what you meant. never mind.
If you cut up a large file into lots of chunks of whatever size, lets say 64KB each. Then, you look at the chunks. If you have two chunks that are the same, you remove the second one, and just place a pointer to the first one. Data Deduplication is much more complicated than that in real life, but basically, the more data you have, or the smaller the chunks you look at, the more likely you are to have duplication, or collisions. (how many word documents have a few words in a row? remove every repeat of the phrase "and then the" and replace it with a pointer, if you will).
This is also similar to WAN acceleration, which at a high enough level, is just deduplicating traffic that the network would have to transmit.
It is amazing how much space you can free up, when your not just looking at the file level. This has become very big in recent years, cause storage has exploded, and processors are finally fast enough to do this in real-time.
What are we going to do tonight Brain?
Claim 1(a) requires "dividing an information item into a common portion and a unique portion".
It may be that the patent covers the case where the unique portion is empty, but then again maybe not, especially if the computer never takes the step to find out! In other words, if you treat every item as a common item (even if there is only one copy), there is a good chance the patent might not apply.
(There is also a good chance that the patent is written the way it is specifically because it doesn't apply to that case -- it may be that there is prior art in one of the referenced patents.)
First of all.... one of the most commonly duplicated blocks is the NUL block, that is a block of data where all bits are 0, corresponding with unused space, or space that was used and then zero'd.
If you have a virtual machine on a fresh 30GB disk with 10GB actually in use, you have at least 25GB that could be freed up by dedup.
Second, if you have multiple VMs on a dedup store, many of the OS files will be duplicates.
Even on a single system, many system binaries and libraries, will contain duplicate blocks.
Of course multiple binaries statically linked against the same libraries will have dups.
But also, there is a common structure to certain files in the OS, similarities between files so great, that they will contain duplicate blocks.
Then if the system actually contains user data, there is probably duplication within the data.
For example, mail stores... will commonly have many duplicates.
One user sent an e-mail message to 300 people in your organization -- guess what, that message is going to be in 300 mailboxes.
If users store files on the system, they will commonly make multiple copies of their own files..
Ex... mydocument-draft1.doc, mydocument-draft2.doc, mydocument-draft3.doc
Can MS Word files be large enough to matter? Yes.. if you get enough of them.
Besides they have common structure that is the same for almost all MS Word files. Even documents' whose text is not at all similar are likely to have some duplicate blocks, which you have just accepted in the past -- it's supposed to be a very small amount of space per file, but in reality: a small amount of waste multiplied by thousands of files, adds up.
Just because data seems to be all different doesn't mean dedup won't help with storage usage.
So, Blade Runner was about de-duplication?
WTF am I doing replying to an AC at 5 A.M on a Friday night?
I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical system with 10 Centos VMs running different apps or similar apps with different data? You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.
Assume 200 VMs at, say, 2GB per OS install. Allowing for some uniqueness, you'll probably end up using something in the ballpark of 20-30GB of "real" space to store 400GB of "virtual" data. That's a *massive* saving, not only disk space, but also in IOPS, since any well-engineered system will carry that deduplication through to the cache layer as well.
Deduplication is *huge* in virtual environments. The other big place it provides benefits, of course, is D2D backups.
What kind of lame recursive acronym is "deduplication"? I'm flummoxed in any attempt to decipher it.
Deduplication Eases Disk Utilization Purposefully Linking Information Common Among Trusted Independent Operating Nodes
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Another nice OpenSource FS De-Dup project to look into is LESSFS.
Block-level de-dup and good speed. Also offers per block encryption and compression.
I'm using it backup VMs. 2TB of raw VMs plus 60 days of changes store down to 300GB. Write to de-dup FS is > 50MB/s.