Open Source Deduplication For Linux With Opendedup
tazzbit writes "The storage vendors have been crowing about data deduplication technology for some time now, but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes and the number of files is limited by the underlying file system. Opendedup runs in user space, making it platform independent, easier to scale and cluster, and it can integrate with other user space services like Amazon S3."
Data deduplication
( I don't )
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Just wondering...
Why is Snark Required?
First of all.... one of the most commonly duplicated blocks is the NUL block, that is a block of data where all bits are 0, corresponding with unused space, or space that was used and then zero'd.
If you have a virtual machine on a fresh 30GB disk with 10GB actually in use, you have at least 25GB that could be freed up by dedup.
Second, if you have multiple VMs on a dedup store, many of the OS files will be duplicates.
Even on a single system, many system binaries and libraries, will contain duplicate blocks.
Of course multiple binaries statically linked against the same libraries will have dups.
But also, there is a common structure to certain files in the OS, similarities between files so great, that they will contain duplicate blocks.
Then if the system actually contains user data, there is probably duplication within the data.
For example, mail stores... will commonly have many duplicates.
One user sent an e-mail message to 300 people in your organization -- guess what, that message is going to be in 300 mailboxes.
If users store files on the system, they will commonly make multiple copies of their own files..
Ex... mydocument-draft1.doc, mydocument-draft2.doc, mydocument-draft3.doc
Can MS Word files be large enough to matter? Yes.. if you get enough of them.
Besides they have common structure that is the same for almost all MS Word files. Even documents' whose text is not at all similar are likely to have some duplicate blocks, which you have just accepted in the past -- it's supposed to be a very small amount of space per file, but in reality: a small amount of waste multiplied by thousands of files, adds up.
Just because data seems to be all different doesn't mean dedup won't help with storage usage.
What kind of lame recursive acronym is "deduplication"? I'm flummoxed in any attempt to decipher it.
Deduplication Eases Disk Utilization Purposefully Linking Information Common Among Trusted Independent Operating Nodes
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Another nice OpenSource FS De-Dup project to look into is LESSFS.
Block-level de-dup and good speed. Also offers per block encryption and compression.
I'm using it backup VMs. 2TB of raw VMs plus 60 days of changes store down to 300GB. Write to de-dup FS is > 50MB/s.