Open Source Deduplication For Linux With Opendedup
tazzbit writes "The storage vendors have been crowing about data deduplication technology for some time now, but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes and the number of files is limited by the underlying file system. Opendedup runs in user space, making it platform independent, easier to scale and cluster, and it can integrate with other user space services like Amazon S3."
Data deduplication
( I don't )
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Also, is there easy way to get multiple machines running 'as one' to pool resources for running a vm setup? Does openmosix do that?
I appreciate any deduplication solution for linux for sure, but isnt any deplucation creating a lot of shared ressources which could be possibly exploited for attacks (e.g. on the privacy of other users)?
Just wondering...
Why is Snark Required?
What kind of lame recursive acronym is "deduplication"?
I'm flummoxed in any attempt to decipher it.
"I believe in Karma. That means I can do bad things to people all day long and I assume they deserve it." : Dogbert
Does this mean I'll finally be able to store my entire porn collection on a single volume?
Yeah, I gave up on bitching about code inefficiency back in the early 90s. Do they even teach assembly any more?
If you are storing that amount of data wouldn't you use a SAN and don't most already have data de-duplication technology? I suppose this project will be pillaged by all of the backup appliance MFG's and those who build consumer grade NAS devices
......from what i can tell, this is NOT a way to deduplicate existing filesystems or even layer it on top of existing data, but a new filesystem operating perhaps like eCryptfs, storing backend data on an existing filesystem in some FS-specific format.
So, having said that, does anyone know if there is a good way to resolve EXISTING duplicate files on Linux using hard links? For every identical pair found, a+b, b is deleted and instead hardlinked to a? I know there are plenty of duplicate file finders (fdupes, some windows programs, etc), but they're all focused on deleting things rather than simply recovering space using hardlinks.
with NexentaStor CE, which is based on OpenSolaris b134. It's free.. and has an excellent Storage WebUI. /plug
For a detailed explanation of OpenSolaris dedup see this blog entry.
~Anil
http://dilemma.gulecha.org - My philospohical short film.
I wonder how well it performs, or if this is just functionality for demonstration purposes ?
Given that usually most of the disk space is swallowed by the data of an application and that data rarely is identical to the data on another system (why would you have two systems then?) I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical system with 10 Centos VMs running different apps or similar apps with different data? You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.
Just occurred to me that it would not be difficult to write a quick script to extract everything into its own tree; run sha1sum on all files; and identify duplicate files automatically; probably in just one or two lines.
So in other words -- thanks Slashdot! The otherwise unintelligible summary did me a world of good -- mostly because there was no context as to what the hell it was talking about, so I had to supply my own definition...
If you'd mentioned the fact that this appears to be written in Java, you might have a point. But despite this, and the fact that it's in userland, they seem to be getting pretty decent performance out of it.
And keep in mind, all of this is to support reducing the amount of storage required on a hard disk, and it's a fairly large programming effort to do so. Seems like this entire project is just the opposite of what you claim -- it's software types doing extra work so they can spend less on storage.
Don't thank God, thank a doctor!
Both VMware and KVM can do this. Not sure about Xen. Google "memory deduplication $VM_TECH" China Mobile Phones
Chinese Girls
I usedthichnlgywrp!
[...] Opendedup runs in user space, making it platform independent, easier to scale and cluster, [...]
... and slow, prone to locking issues, etc. There's a reason no one runs ZFS over FUSE, why would we do it with this?
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Consider that things may be spread over more than one SAN or that it is a situation where an old style file server makes better sense anyway.
Can anyone offer wisdom on what the volume size is supposed to signify, being different from the maximum size that SDFS is scalable to?
Why would anyone keep their blocks so cold?
September 22, 1998.
Single instance storage of information
They are very poor programmers. Almost nothing works in retarded clam shell (rcsh).
I stopped being able to read English. WTF does any of that mean? Is it written in moonspeak?
One of the biggest targets for data de-duplication is for efficient off-site replication which you see in the EMC Avamar product line. This is advantageous when your WAN links aren't fast enough so that you can't do synchronous replication and a scheduled asynchronous replication would take too long. I'd like to see the SDSF storage engine be intelligent enough to snapshot the data, then when the next "backup/replication" occurs, it gathers up all the hashes of the blocks that have changed since the snapshot was created, communicates those hashes to the off-site system, and then transfer just the blocks that currently don't have a comparable hash on the target system, the target system receives a complete hash table update of the snapshot block difference from the source, and then both systems merge their snapshots and then take a new snapshot to get ready for the next replication cycle.
that's one instance where the troll's term "open sores" could be used for humorous effect, instead of just grating people...
The LessFS project also deserves mention: http://www.lessfs.com/ . Just think of the effect of combining a deduplication system with an iSCSI shared virtual tape library like http://sites.google.com/site/linuxvtl2/
stoolpigeon asks: Are you taking my deduplication investigation seriously or are you disrespecting my deduplication investigation?
-- minor misquote of LL Cool J speaking to Robin Wright in the movie Toys
Isn't this just an application of 'tokenizing' as it is used in compression of data streams? Build an index of unique(read non-repetitive) data segments and store the (smaller)index and resulting data?
This has been around for some time....hard to believe that this use has just come to light.
Not every SAN has dedupe, for instance my HP EVA doesn't. Also many of the lowend Netapp boxes have too anemic processors to be able to do dedupe. Most of the lowend iSCSI boxes also lack dedupe.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
We're a bit off topic here, seeing as this has nothing to do with file systems, but being off-topic is on-topic for /.
Anyhow: StoreBackup is a great backup system that automatically detects duplicates.
Enjoy life! This is not a dress rehearsal.
just make sure it checks for collisions
http://en.wikipedia.org/wiki/Collision_(computer_science)
Too bad it's just another new filesystem. I would have preferred integration into (some future version of) EXTn or BTRFS.
Not only would that mean it gets more widely available, it also means you don't have to miss al the nice functions of these filesystems. You may even be able to use it out of the box.
.sig: No such file or directory
Another nice OpenSource FS De-Dup project to look into is LESSFS.
Block-level de-dup and good speed. Also offers per block encryption and compression.
I'm using it backup VMs. 2TB of raw VMs plus 60 days of changes store down to 300GB. Write to de-dup FS is > 50MB/s.
but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2)
Firstly, vmWare's hypervisor isn't based on Linux. It is a proprietary kernel. The Service Console in vmWare is a custom Linux based on Redhat, but the hypervisor itself is not Linux.
Secondly, vmWare uses their own proprietary VMFS filesystem that allows multiple physical servers access to the same SAN-attached LUN. It can also use NFS for VM storage. It does NOT support the use of SDFS.
This is why some vendors protect some duplicated VM data ( like the OS ).
And sure stock DDup is not the end all to be all, but it goes a long way to that goal and the risks are more then worth the gains.
---- Booth was a patriot ----
I've never used opendedup but I have been using lessfs http://www.lessfs.com/wordpress/ to store backups of virtual servers.
So now we have two choices for open source de-duplication!
'nuff said.
I had 3 backups of home data of about 300gbytes each.
Each one was almost but not quite the same due to some rather poor backup policies on mypart.
I was able to dedup per backup to get them small enough to combine and dedup the combo.
Left with one pure 150gbytes combo. Rsync is amazing
... and slow, prone to locking issues, etc. There's a reason no one runs ZFS over FUSE, why would we do it with this?
doesn't Luster use a ZFS over Fuse implementation on linux nodes?
anyhow, there are decent alternatives for what ZFS provides (no where near as comprehensive as ZFS, but workable at least). Afaik, there is nothing that provides a deduplicated FS - and if this is able to get 150MB/s then that's a good start.
The way de-duplication works is the system maintains a hash table for the file system (usually block level). When it detects that two files have a block in common, it sets a flag that says "this block is common to both of these files".
The entry is essentialy an inode entry (linked list) and a reference count.
The effort is more commonly used in virtual tape systems, because you will normally have multiple generations of the same tape file. It is also the way that zones (under Solaris) and virtual systems (under AIX) work, since there is generally a certain amount of static data shared between zones.
It does however have implications for common data between web server instances and/or web+(s)ftp instances. If you should need to restore data to a web server instance where dedup is active, the restore is much faster when you only have to actually write a subset of the data back.
It would be well worth it (if you should have a test system) to experiment with the tech. After all, the product is free.
And ye shall know the truth, and the truth shall make you free.
John 8:32(King James Version)
ROTFL! Filesystem in userland with Java. Buhhaha! The only valuable as a PoC. Trading some MB of storage for hedge performance impact doesn't sound like a good trade-off :)
There's a program that automates what you describe, and it's called "RSnapshot":
http://rsnapshot.org/
If you have a system that isn't always up you want something like this to launch it:
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=27;filename=run-rsnapshot;att=1;bug=523923
Installed the Bubblemon yet?
vmware does share memory pages. KSM appears to have that now too, haven't read much about it - unix-linux uses this very well in multiuser, especially in LTSP, where users running the same program share the memory. I don't know if windows terminal server does it nowadays - it didn't when I used it, several versions ago.
Build your own energy sources from scratch. http://otherpower.com/