Slashdot Mirror


Open Source Deduplication For Linux With Opendedup

tazzbit writes "The storage vendors have been crowing about data deduplication technology for some time now, but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes and the number of files is limited by the underlying file system. Opendedup runs in user space, making it platform independent, easier to scale and cluster, and it can integrate with other user space services like Amazon S3."

7 of 186 comments (clear)

  1. In case you don't know much about it by stoolpigeon · · Score: 5, Informative

    Data deduplication
    ( I don't )

    --
    It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
    1. Re:In case you don't know much about it by MyLongNickName · · Score: 4, Informative

      Data deduplication is huge in virtualized environments. Four virtual servers with identical OS's running on one host server? Deduplicate the data and save a lot of space.

      This is even bigger in the virutulized desktop envirornment where you could literally have hundreds of PCs virtualized on the same physical box.

      --
      See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
    2. Re:In case you don't know much about it by zappepcs · · Score: 4, Informative

      In a word, No. There are many types of 'virtualization' and more than one approach to de-duplication. In a system as engineered as one with de-duplication, you should have replication as part of the data integrity processes. If the file is corrupted in all the main copies (everywhere it exists, including backups) then the scenario you describe would be correct. This is true for any individual file that exists on computer systems today. De-duplication strives to reduce the number of copies needed across some defined data 'space' whether that is user space, or server space, or storage space etc.

      This is a problem in many aspects of computing. Imagine you have a business with 50 users. Each must use a web application which has many graphics. The browser caches of each user has copies of each of those graphics images. When the cache is backed up, the backup is much larger than it needs to be. You can do several things to reduce backup times, storage space, and user quality of service

      1 - disable caching for that site in the browser and cache them on a single server locally located
      2 - disable backing up the browser caches, or back up only one
      3 - enable deduplication in the backup and storage processes
      4 - implement all or several of the above

      The problems are not single ended and the answers or solutions will also not be single ended or faceted. That is no one solution is the answer to all possible problems. This one has some aspects to it that are appealing to certain groups of people. You average home user might not be able to take advantage of this yet. Small businesses though might need to start looking at this type of solution. Think how many people got the same group email message with a 12MB attachment. How many times do all those copies get archived? In just that example you see the waste that duplicated data represents. Solutions such as this offer an affordable way to positively affect bottom lines in fighting those types of problems problems.

  2. Hasn't this been posted before? by Required+Snark · · Score: 5, Funny

    Just wondering...

    --
    Why is Snark Required?
  3. Re:How useful is this in realistic scenarios? by mysidia · · Score: 4, Informative

    First of all.... one of the most commonly duplicated blocks is the NUL block, that is a block of data where all bits are 0, corresponding with unused space, or space that was used and then zero'd.

    If you have a virtual machine on a fresh 30GB disk with 10GB actually in use, you have at least 25GB that could be freed up by dedup.

    Second, if you have multiple VMs on a dedup store, many of the OS files will be duplicates.

    Even on a single system, many system binaries and libraries, will contain duplicate blocks.

    Of course multiple binaries statically linked against the same libraries will have dups.

    But also, there is a common structure to certain files in the OS, similarities between files so great, that they will contain duplicate blocks.

    Then if the system actually contains user data, there is probably duplication within the data.

    For example, mail stores... will commonly have many duplicates.

    One user sent an e-mail message to 300 people in your organization -- guess what, that message is going to be in 300 mailboxes.

    If users store files on the system, they will commonly make multiple copies of their own files..

    Ex... mydocument-draft1.doc, mydocument-draft2.doc, mydocument-draft3.doc

    Can MS Word files be large enough to matter? Yes.. if you get enough of them.

    Besides they have common structure that is the same for almost all MS Word files. Even documents' whose text is not at all similar are likely to have some duplicate blocks, which you have just accepted in the past -- it's supposed to be a very small amount of space per file, but in reality: a small amount of waste multiplied by thousands of files, adds up.

    Just because data seems to be all different doesn't mean dedup won't help with storage usage.

  4. Re:deduplication by nacturation · · Score: 4, Funny

    What kind of lame recursive acronym is "deduplication"? I'm flummoxed in any attempt to decipher it.

    Deduplication Eases Disk Utilization Purposefully Linking Information Common Among Trusted Independent Operating Nodes

    --
    Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
  5. See Also LESSFS by sharper56 · · Score: 4, Interesting

    Another nice OpenSource FS De-Dup project to look into is LESSFS.

    Block-level de-dup and good speed. Also offers per block encryption and compression.

    I'm using it backup VMs. 2TB of raw VMs plus 60 days of changes store down to 300GB. Write to de-dup FS is > 50MB/s.