Open Source Deduplication For Linux With Opendedup
tazzbit writes "The storage vendors have been crowing about data deduplication technology for some time now, but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes and the number of files is limited by the underlying file system. Opendedup runs in user space, making it platform independent, easier to scale and cluster, and it can integrate with other user space services like Amazon S3."
Data deduplication
( I don't )
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Also, is there easy way to get multiple machines running 'as one' to pool resources for running a vm setup? Does openmosix do that?
I appreciate any deduplication solution for linux for sure, but isnt any deplucation creating a lot of shared ressources which could be possibly exploited for attacks (e.g. on the privacy of other users)?
Just wondering...
Why is Snark Required?
Yeah, I gave up on bitching about code inefficiency back in the early 90s. Do they even teach assembly any more?
AND it will make sure that all those 60,000 duplicate files no longer take up most of your hard drive space!
......from what i can tell, this is NOT a way to deduplicate existing filesystems or even layer it on top of existing data, but a new filesystem operating perhaps like eCryptfs, storing backend data on an existing filesystem in some FS-specific format.
So, having said that, does anyone know if there is a good way to resolve EXISTING duplicate files on Linux using hard links? For every identical pair found, a+b, b is deleted and instead hardlinked to a? I know there are plenty of duplicate file finders (fdupes, some windows programs, etc), but they're all focused on deleting things rather than simply recovering space using hardlinks.
with NexentaStor CE, which is based on OpenSolaris b134. It's free.. and has an excellent Storage WebUI. /plug
For a detailed explanation of OpenSolaris dedup see this blog entry.
~Anil
http://dilemma.gulecha.org - My philospohical short film.
It's neither acronym or abbreviation. Duplication is making copies. De-duplication is getting rid of the copies.
Well, just how repetitive is your porn collection?
Don't thank God, thank a doctor!
Given that usually most of the disk space is swallowed by the data of an application and that data rarely is identical to the data on another system (why would you have two systems then?) I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical system with 10 Centos VMs running different apps or similar apps with different data? You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.
Just occurred to me that it would not be difficult to write a quick script to extract everything into its own tree; run sha1sum on all files; and identify duplicate files automatically; probably in just one or two lines.
So in other words -- thanks Slashdot! The otherwise unintelligible summary did me a world of good -- mostly because there was no context as to what the hell it was talking about, so I had to supply my own definition...
If you'd mentioned the fact that this appears to be written in Java, you might have a point. But despite this, and the fact that it's in userland, they seem to be getting pretty decent performance out of it.
And keep in mind, all of this is to support reducing the amount of storage required on a hard disk, and it's a fairly large programming effort to do so. Seems like this entire project is just the opposite of what you claim -- it's software types doing extra work so they can spend less on storage.
Don't thank God, thank a doctor!
[...] Opendedup runs in user space, making it platform independent, easier to scale and cluster, [...]
... and slow, prone to locking issues, etc. There's a reason no one runs ZFS over FUSE, why would we do it with this?
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Consider that things may be spread over more than one SAN or that it is a situation where an old style file server makes better sense anyway.
Can anyone offer wisdom on what the volume size is supposed to signify, being different from the maximum size that SDFS is scalable to?
Good try, but after skimming it, does not seem to apply. Seems to be for deduplicating e-mail attachments.
very repetitive. back and fourth. back and fourth. oh wait... that's not what you meant. never mind.
SIS is frequently implemented in file systems, e-mail server software, data backup and other storage-related solutions.
I stopped being able to read English. WTF does any of that mean? Is it written in moonspeak?
One of the biggest targets for data de-duplication is for efficient off-site replication which you see in the EMC Avamar product line. This is advantageous when your WAN links aren't fast enough so that you can't do synchronous replication and a scheduled asynchronous replication would take too long. I'd like to see the SDSF storage engine be intelligent enough to snapshot the data, then when the next "backup/replication" occurs, it gathers up all the hashes of the blocks that have changed since the snapshot was created, communicates those hashes to the off-site system, and then transfer just the blocks that currently don't have a comparable hash on the target system, the target system receives a complete hash table update of the snapshot block difference from the source, and then both systems merge their snapshots and then take a new snapshot to get ready for the next replication cycle.
Which claims apply? I can see no claim that does not reference "information items [...] transferred between a plurality of servers connected on a distributed network". In fact, e-mail attachment dedup is seen as prior art (Background, fourth paragraph). File dedup is simpler than that.
Claim 1(a) requires "dividing an information item into a common portion and a unique portion".
It may be that the patent covers the case where the unique portion is empty, but then again maybe not, especially if the computer never takes the step to find out! In other words, if you treat every item as a common item (even if there is only one copy), there is a good chance the patent might not apply.
(There is also a good chance that the patent is written the way it is specifically because it doesn't apply to that case -- it may be that there is prior art in one of the referenced patents.)
The LessFS project also deserves mention: http://www.lessfs.com/ . Just think of the effect of combining a deduplication system with an iSCSI shared virtual tape library like http://sites.google.com/site/linuxvtl2/
Not every SAN has dedupe, for instance my HP EVA doesn't. Also many of the lowend Netapp boxes have too anemic processors to be able to do dedupe. Most of the lowend iSCSI boxes also lack dedupe.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
We're a bit off topic here, seeing as this has nothing to do with file systems, but being off-topic is on-topic for /.
Anyhow: StoreBackup is a great backup system that automatically detects duplicates.
Enjoy life! This is not a dress rehearsal.
So, Blade Runner was about de-duplication?
WTF am I doing replying to an AC at 5 A.M on a Friday night?
it seems so, but the ordering was always: physical, partition, filesystem, compression (sometimes fs integrating compression) and compression applied to relatively small chunks (blocks).
Now you have compression layer above partition layer, which means two identical files on two different partitions will occupy space of one physically.
So, say, your LAMP server takes up 4GB generic system plus 1GB custom data. One 1TB of storage could fit 200 partition-files of such server. Now you'll fit 995 of them and it will work faster as the commonly used parts of the FS will be read and buffered once for all instances.
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
Too bad it's just another new filesystem. I would have preferred integration into (some future version of) EXTn or BTRFS.
Not only would that mean it gets more widely available, it also means you don't have to miss al the nice functions of these filesystems. You may even be able to use it out of the box.
.sig: No such file or directory
What kind of lame recursive acronym is "deduplication"? I'm flummoxed in any attempt to decipher it.
Deduplication Eases Disk Utilization Purposefully Linking Information Common Among Trusted Independent Operating Nodes
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Another nice OpenSource FS De-Dup project to look into is LESSFS.
Block-level de-dup and good speed. Also offers per block encryption and compression.
I'm using it backup VMs. 2TB of raw VMs plus 60 days of changes store down to 300GB. Write to de-dup FS is > 50MB/s.
This is why some vendors protect some duplicated VM data ( like the OS ).
And sure stock DDup is not the end all to be all, but it goes a long way to that goal and the risks are more then worth the gains.
---- Booth was a patriot ----
I prefer SIS (single instance storage) or ASIS (Advanced SIS)
I had 3 backups of home data of about 300gbytes each.
Each one was almost but not quite the same due to some rather poor backup policies on mypart.
I was able to dedup per backup to get them small enough to combine and dedup the combo.
Left with one pure 150gbytes combo. Rsync is amazing
... and slow, prone to locking issues, etc. There's a reason no one runs ZFS over FUSE, why would we do it with this?
doesn't Luster use a ZFS over Fuse implementation on linux nodes?
anyhow, there are decent alternatives for what ZFS provides (no where near as comprehensive as ZFS, but workable at least). Afaik, there is nothing that provides a deduplicated FS - and if this is able to get 150MB/s then that's a good start.
The way de-duplication works is the system maintains a hash table for the file system (usually block level). When it detects that two files have a block in common, it sets a flag that says "this block is common to both of these files".
The entry is essentialy an inode entry (linked list) and a reference count.
The effort is more commonly used in virtual tape systems, because you will normally have multiple generations of the same tape file. It is also the way that zones (under Solaris) and virtual systems (under AIX) work, since there is generally a certain amount of static data shared between zones.
It does however have implications for common data between web server instances and/or web+(s)ftp instances. If you should need to restore data to a web server instance where dedup is active, the restore is much faster when you only have to actually write a subset of the data back.
It would be well worth it (if you should have a test system) to experiment with the tech. After all, the product is free.
And ye shall know the truth, and the truth shall make you free.
John 8:32(King James Version)
There's a program that automates what you describe, and it's called "RSnapshot":
http://rsnapshot.org/
If you have a system that isn't always up you want something like this to launch it:
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=27;filename=run-rsnapshot;att=1;bug=523923
Installed the Bubblemon yet?
de- duplication
So, Blade Runner was about de-duplication?
It was an early form of it. At the time there was no distinction between 'clones' and 'duplicated'. They mistakenly eliminated clones, while the duplicates escaped - they couldn't told apart from originals. These days clones have their roles and rights better defined, and can usually survive if the commit no illegal operations in the known social memory. Duplicates however are usually found by the Trusted Computing(c) DRM de-duplication techniques. They are rumored to sometimes destroy the originals mistakenly, along with their hosts, though that's safely prevented and handled by the Public Relations (c) BRNWSH technologies. So we have never heard of any such cases.
Build your own energy sources from scratch. http://otherpower.com/
vmware does share memory pages. KSM appears to have that now too, haven't read much about it - unix-linux uses this very well in multiuser, especially in LTSP, where users running the same program share the memory. I don't know if windows terminal server does it nowadays - it didn't when I used it, several versions ago.
Build your own energy sources from scratch. http://otherpower.com/