Open Source Deduplication For Linux With Opendedup

In case you don't know much about it by stoolpigeon · 2010-03-27 15:32 · Score: 5, Informative

--
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?

Re:In case you don't know much about it by MyLongNickName · 2010-03-27 15:52 · Score: 4, Informative

Data deduplication is huge in virtualized environments. Four virtual servers with identical OS's running on one host server? Deduplicate the data and save a lot of space.
This is even bigger in the virutulized desktop envirornment where you could literally have hundreds of PCs virtualized on the same physical box.

--
See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
Re:In case you don't know much about it by Hurricane78 · 2010-03-27 16:22 · Score: 1

Unless you “deduplicate” the CPU work, that’s not going to happen. ^^

--
Any sufficiently advanced intelligence is indistinguishable from stupidity.
Re:In case you don't know much about it by rubycodez · 2010-03-27 16:33 · Score: 2, Informative

hundreds of virtualized desktops per physical server does happen, my employer sells such solutions from several vendors.
Re:In case you don't know much about it by fyoder · 2010-03-27 16:41 · Score: 1

I don't know much about the subject, so forgive me if this is a dumb question, but in that scenario, if the data for a file becomes corrupted on the hard drive, say a critical system file, doesn't that mean that all vm's using it are pooched?

--
Loose lips lose spit.
Re:In case you don't know much about it by MyLongNickName · 2010-03-27 16:51 · Score: 3, Informative

If you have a couple hundred people running business apps, it ain't all that difficult. Generally you will get spikes of CPU utilization that last a few seconds mashed between many minutes, or even hours of very low CPU utilization. A powerful server can handle dozens or even hundreds of virtual desktops in this type of environment.

--
See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
Re:In case you don't know much about it by zappepcs · 2010-03-27 17:12 · Score: 4, Informative

In a word, No. There are many types of 'virtualization' and more than one approach to de-duplication. In a system as engineered as one with de-duplication, you should have replication as part of the data integrity processes. If the file is corrupted in all the main copies (everywhere it exists, including backups) then the scenario you describe would be correct. This is true for any individual file that exists on computer systems today. De-duplication strives to reduce the number of copies needed across some defined data 'space' whether that is user space, or server space, or storage space etc.
This is a problem in many aspects of computing. Imagine you have a business with 50 users. Each must use a web application which has many graphics. The browser caches of each user has copies of each of those graphics images. When the cache is backed up, the backup is much larger than it needs to be. You can do several things to reduce backup times, storage space, and user quality of service
1 - disable caching for that site in the browser and cache them on a single server locally located
2 - disable backing up the browser caches, or back up only one
3 - enable deduplication in the backup and storage processes
4 - implement all or several of the above
The problems are not single ended and the answers or solutions will also not be single ended or faceted. That is no one solution is the answer to all possible problems. This one has some aspects to it that are appealing to certain groups of people. You average home user might not be able to take advantage of this yet. Small businesses though might need to start looking at this type of solution. Think how many people got the same group email message with a 12MB attachment. How many times do all those copies get archived? In just that example you see the waste that duplicated data represents. Solutions such as this offer an affordable way to positively affect bottom lines in fighting those types of problems problems.

--
Support NYCountryLawyer RIAA vs People
Re:In case you don't know much about it by fatp · 2010-03-27 17:52 · Score: 1

It is also huge for java developer, as every java software normally installs at least one jdk and jre
Re:In case you don't know much about it by fatp · 2010-03-27 17:54 · Score: 2, Funny

Oh in fact it requires jdk 7...
Re:In case you don't know much about it by jamesh · 2010-03-27 18:09 · Score: 1

I don't know much about the subject, so forgive me if this is a dumb question, but in that scenario, if the data for a file becomes corrupted on the hard drive, say a critical system file, doesn't that mean that all vm's using it are pooched?
Yes, but not because of deduplication. If you had one sector go bad then yes, you could affect many more vm's if you you were using data deduplication than if you weren't, but in my experience, data corruption is seldom just a '1 sector' thing, and once you detect it you should restore anything that uses that disk from a backup that probably was taken before the corruption started (which is tricky... how do you know when that was?)
Bitrot is one of the nastiest failure modes around.
Re:In case you don't know much about it by GNUALMAFUERTE · 2010-03-27 18:48 · Score: 2, Funny

Hey, slow down cowboy. Explain that concept to me again. I don't know if it's applicable here, but if we find a way to implement it, it might just prove revolutionary.
I work in the quality assurance department of Geeknet Inc, Slashdot's parent company. We are constantly looking for ways to improve all the sites on our network.
I don't know if this method you propose, that, if I understand correctly, would involve parsing the content of the html document linked, and having an editor analyze the output of such html document after being rendered (let's call it, reading the story), is at all possible. But if we implement it the right way, it might prove useful.
We'll get our research team to work over this reading-the-story concept. It's something absolutely novel to us, so it might take a while. We'll let you know when we reach a conclusion, so that we might license this reading-the-story technology from you.
Kind Regards,
Lazy Rodriguez
GeekNet INC.

--
WTF am I doing replying to an AC at 5 A.M on a Friday night?
Re:In case you don't know much about it by drsmithy · 2010-03-27 18:57 · Score: 1

I don't know much about the subject, so forgive me if this is a dumb question, but in that scenario, if the data for a file becomes corrupted on the hard drive, say a critical system file, doesn't that mean that all vm's using it are pooched?
Yes, but a) this is something inherent to anything using shared resources, and b) there's not a lot of scope for such corruption to happen in a decent system (RAID, block-level checksums, etc).
Re:In case you don't know much about it by drsmithy · 2010-03-27 19:01 · Score: 1

Unless you "deduplicate" the CPU work, that's not going to happen. ^^
Sure it does. CPU power is generally the _last_ thing you run out of in virtualised environments, and that's been true for years.
On a modern, Core i7-based server, you should be able to get 10+ "virtual desktops" per core on average, without too much trouble. IOPS and RAM are typically your two biggest limitations.
Re:In case you don't know much about it by ObsessiveMathsFreak · 2010-03-27 20:46 · Score: 1

Deduplicate the data and save a lot of space.
Or just use chroot or something. I don't know.

--
May the Maths Be with you!
Re:In case you don't know much about it by kylegordon · 2010-03-27 22:55 · Score: 1

Put all client side caches and temp directories on a RAM disk. Save backup space and time, reduce your IOPS, and decrease client latency.
Re:In case you don't know much about it by DarkOx · 2010-03-27 23:10 · Score: 2, Informative

It really is hundreds, on a modern nehalem core system with 64 gigs of memory or so. We used to do dozens on each node in a citrix farm back in the PIII days.

--
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Re:In case you don't know much about it by RAMMS+EIN · 2010-03-27 23:44 · Score: 1

Right.
It's one of the things I never really managed to wrap my head around: why would you want to install many instances of the same OS on the same machine to begin with? Besides using lots of disk space, each instance will also use up memory and redundantly use up resources for updates, background tasks that each OS is running, and basically everything else they have in common.
Sure, you can add on a lot of clever tricks to deduplicate resource usage, but why introduce the duplication in the first place?

--
Please correct me if I got my facts wrong.
Re:In case you don't know much about it by vrmlguy · 2010-03-27 23:53 · Score: 1

Here's another explaination: http://storagezilla.typepad.com/storagezilla/2009/02/unified-storage-file-system-deduplication.html
There's a table about half-way down showing the differences between file-level dedup (elimination of duplicate files), fixed block dedup (elimilation of duplicate blocks as stored on the disk, which is what Opendedup is doing), and variable block dedup (which handles non-block aligned data, such as when you insert or delete someting at the start of a large file). File level dedup is (almost) drop dead easy, you just take a checksum of every file and link those that match to a single copy. (Handling file updates can be problematic, though. You want your deduped files to be read-only.) Fixed block is almost as easy, since a file is just a list of blocks. You use FUSE to turn those blocks into fixed length files, which are then themselves deduped. This fixes the file-update problem, since each update creates a new block.
Variable block dedup looks for special groups of bytes to divided a file into chunks (like using newlines to divide a text file into lines). These chunks are then dedups as above. If you aren't careful, you can waste space (since the blocks aren't exactly multiples of the disk's block size). Random seeks can be harder, since you can't multiply the block number by the block size to find a location.

--
Nothing for 6-digit uids?
Re:In case you don't know much about it by Degrees · 2010-03-28 04:27 · Score: 2, Interesting

It is one of those things that once you start using it, the benefits become apparent.
Here are some:
1) One application on one machine. No more wondering if application X has somehow messed up application Y. The writers of the software probably developed the application in a clean environment, and this lets you run it in a clean environment. Gets rid of vendor finger-pointing, too.
2) One application on one machine. If application X fouls the nest, you can reboot it and know that you are not also terminating applications Y, Z, A, and B.
3) Machine portability. The drivers in a VM guest are generic -and- uniform. Nothing inside the (guest) machine changes if you move the machine from a host with an Intel NIC to a host with a Broadcom NIC. The benefit here is that when hardware fails (and it will), it is pretty quick and easy to assign the boot disk to a different host, and boot the machine up. Think 10 - 30 minutes (per machine) to recover from a burned up power supply*.
4) Machine portability. There are some solutions that let you auto-fail-over to a new host when the guest stops responding. That burned up power supply could now be a two minute outage and NO emergency notification call.
5) Machine portability. Platespin lets you auto-migrate machines on a schedule to a few blades at night, power down those blades for power savings, and then power them up a little before business hours and migrate back. In a large data center, the electricity savings is enough to make it worth it.
6) Machine flexibility. Does application X not need much in the way of processing power? With the VM manager software, assign it one CPU and 256 MB RAM. Later find out that wasn't enough? Up the specs and reboot.
7) Reboot speed. In paravirtualized environments, the OS is already loaded in the host VM, so the guest VM just links and loads. I've seen entire machine reboots that take 16 seconds.
Along these lines, an anecdote from my life: How to add RAM to a server so nobody notices: virtualize
Hope this helps explain why some people are such a fan of virtualization.
*This is really a benefit that comes from disconnecting the machine from its disks, but VM and SAN go exceptionally well together.

--
"The most sensible request of government we make is not, "Do something!" But "Quit it!"
Re:In case you don't know much about it by Eil · 2010-03-28 04:40 · Score: 2, Interesting

Almost every mission critical system these days is running in either a clustered or virtualized environment. I work in the financial services industry and there are many reasons we virtualize pretty much everything these days. These, however, are probably the biggies:
- Redundancy: If a physical machine dies, its virtual machines can be moved over to a spare, often with no interruption in service.
- Isolation: Just because you can run multiple services on a box doesn't mean you should. It poses potential security problems (one compromised app can open the door to compromise another), makes managing users and resources more difficult, and the applications can interact or conflict in unexpected ways. Many vendors demand that their application be the only one running on a machine or they won't support it.
- Portability: An OS configured for use on a virtual machine can be run on any platform which runs the virtual machine without modification.
Re:In case you don't know much about it by b4dc0d3r · 2010-03-28 04:41 · Score: 1

Before anyone gets all crazy here, be warned. You don't just checksum every file - you might use those checksums to find collisions, and then compare bytes to ensure the files are actually the same. There are already MD5 collision creation methods, many years old, so you should assume any checksum or hash can be manipulated and check the bytes before removing copies.
Then you don't just delete files and make links to one file. You have to let the filesystem present this single file to the operating system as multiple files.
The filesystem handles updates, so it can decide whether to unlink and effectively branch an updated file in that location. Read-only is a kludge for inadequate condition handling. You could hopefully mark files as "unbranchable" so that updates happen once and propagate everywhere, for easier patching.
And finally, keep in mind simple ideas like when you right-click in Windows Explorer, choose New, and then one of the document templates. It copies the default template from your user (or "all users") template location, creates a duplicate copy in the location you clicked, enters a new row in the filesystem named for example "New Microsoft Office Word Document.docx" and highlights the name so you can change it.
In this scenario, the de-duplication would delay making a copy until the file is opened for writing/updating/appending, and then actually gets a write/append/update. At the same time, the filesystem has the potential for a filename duplication and may temporarily re-use parts of a directory entry until the new name is chosen.
This simple process is potentially a CPU and disk IO intensive chore, as a template copy gets hashed and compared to all files on the disk just to find that it's an exact copy from its source file. The GUI knew that much, so the filesystem should be able to skip that. Then the file is intended to be updated, so the entire process was unnecessary, unless the user gets distracted and leaves the new file there, so you want it de-duped.
For performance reasons this has to be aware of the intent of the files, or else it does a whole lot of nothing.
Re:In case you don't know much about it by Degrees · 2010-03-28 05:11 · Score: 1

dang it, have an error: "power down those blades for power savings" should be "power down the empty blades for power savings". Where's the 'edit my post button'?

--
"The most sensible request of government we make is not, "Do something!" But "Quit it!"
Re:In case you don't know much about it by martin-boundary · 2010-03-28 10:10 · Score: 1

This is even bigger in the virutulized desktop envirornment where you could literally have hundreds of PCs virtualized on the same physical box.

Vurutuluzud vowols ara alsa big in slashdat commonts to redece the lod on the server.

This is for hard disks by ZERO1ZERO · 2010-03-27 15:36 · Score: 2, Interesting

Does software like ESX and others (Xen etc) perform this in memory already for running VMs? I.e. if you have 2 Windows VMs it will only store one copy of the libs etc in the hosts memory ?

Also, is there easy way to get multiple machines running 'as one' to pool resources for running a vm setup? Does openmosix do that?

Re:This is for hard disks by TooMuchToDo · 2010-03-27 15:55 · Score: 1

Both VMware and KVM can do this. Not sure about Xen. Google "memory deduplication $VM_TECH"
Re:This is for hard disks by fatp · 2010-03-27 17:49 · Score: 2, Funny

I really googled "memory deduplication $VM_TECH"... It returned this post as the only result

what an idiot I am T.T
Re:This is for hard disks by Island+Admin · 2010-03-27 22:14 · Score: 2, Funny

Go to your browser preferences - uncheck "enable Great Firewall of China". ;)
Re:This is for hard disks by DarkOx · 2010-03-27 23:28 · Score: 1

Does software like ESX and others (Xen etc) perform this in memory already for running VMs? I.e. if you have 2 Windows VMs it will only store one copy of the libs etc in the hosts memory ?

I don't know about Xen but VMWare will do that.

is there easy way to get multiple machines running 'as one' to pool resources for running a vm setup? Does openmosix do that?
I am not entirely certain what you mean by 'as one' to pool resources. Openmosix more or less is a load distributor that dispatches jobs across hosts. I am not sure what advantage you would gain by virtualizing the hosts other than granularity.

--
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Re:This is for hard disks by dchaffey · 2010-03-28 21:30 · Score: 1

VMware's technology you are referring to is called Transparent Page Sharing, if you want to look it up.
To my knowledge they're the only major Hypervisor to have this for Windows VMs, and it is a huge contributor to their VM density leadership; I'm not sure if other Linux-based Hypervisors implement something for Linux VMs

A hypothetical question. by drolli · 2010-03-27 15:37 · Score: 1

I appreciate any deduplication solution for linux for sure, but isnt any deplucation creating a lot of shared ressources which could be possibly exploited for attacks (e.g. on the privacy of other users)?

Re:A hypothetical question. by symbolset · 2010-03-27 16:09 · Score: 1

Opendedup is file-based deduplication, much like Microsoft's Single Instance Storage. If I recall correctly there was a security problem with that some time ago, but I don't know if it was fixed.

--
Help stamp out iliturcy.
Re:A hypothetical question. by tlhIngan · 2010-03-27 16:25 · Score: 2, Interesting

I appreciate any deduplication solution for linux for sure, but isnt any deplucation creating a lot of shared ressources which could be possibly exploited for attacks (e.g. on the privacy of other users)?
Most likely in the implementation itself, not the de-duplication process.
Let's say user A and B have some file in common. Without de-duplication, the file exists on both home directories. With de-duplication, one copy of the file exists for both users. Now, if there is an exploit such that you could find out if this has happened, then user A or B will know that the other has a copy of the same file. That knowledge could be useful.
Ditto on critical system files - if you could generate a file and have it match a protected system file, this might be useful to exploit the system. E.g., /etc/shadow (which isn't normally world-readable). If you can find a way to tell the deduplication happens, you can get access to these critical files for other purposes.
Note that you can't *change* the file (because that would just split the files up again), but being able to read the file (when you couldn't before) or knowing that another copy exists elsewhere can be very useful knowledge. But the de-duplication mechanism must inadvertently reveal when this happens.
Re:A hypothetical question. by drolli · 2010-03-27 17:23 · Score: 1

Yes, that was the thing i had in mind. I imagined that you can make timing measurements. So for example two isolated VMs running on the same physical dedup fs can exchange information (unless the underlying os does not intenntionally delay the return from the call). i actually think you can run a programs creating a lot of specially crafted file contents in two VMs, thus circumventing networking restrictions.
Re:A hypothetical question. by drsmithy · 2010-03-27 19:03 · Score: 1

Note that you can't *change* the file (because that would just split the files up again), but being able to read the file (when you couldn't before) or knowing that another copy exists elsewhere can be very useful knowledge.
If you can "generate a file" that can be deduplicated, then by definition you already know about the date in that file.
Re:A hypothetical question. by GNUALMAFUERTE · 2010-03-27 19:03 · Score: 1

Leaving aside vulnerabilities on any particular implementation, the only possible attack vector I see would be a bruteforce approach. Basically, a user in one VM creates random n bytes size files with all possible combinations of files of that size (off course, this would only be feasible for very small files, but /etc/shadow is usually small enough, and so is everything on $HOME/.ssh/). Eventually, the user would create a file that would match a copy on another VM. Off course, this would be useless without a way to check if another file was matched and deduplication took place. If the deduplication solution has any virtual guest software (like vmware tools), and that tool shares this kind of information with other systems, it might be possible, but that's a big might.
Any reasonably implemented deduplication solution should be 100% transparent to the guest, and very secure.
And, to all the people talking about "shared resources", deduplication doesn't create "shared resources". Deduplication is not similar to symbolic links (ln -s). If you want to compare it to links, you have to compare it to hard links, and that would be hard links that automatically dereferenced and created a new copy of the file with all the blocks as soon as the user wanted to write to that file. Remember, as soon as the file changes on any given guest, the information is not the same anymore, and so that file is not de-duplicated anymore. A user can change his copy of the file, not other people's files.

--
WTF am I doing replying to an AC at 5 A.M on a Friday night?
Re:A hypothetical question. by GNUALMAFUERTE · 2010-03-27 19:05 · Score: 1

It's had a vulnerability because microsoft made it. Vulnerabilities are their signature.
And, as I explained before, it was a microsoft product (which means it wasn't fixed).

--
WTF am I doing replying to an AC at 5 A.M on a Friday night?
Re:A hypothetical question. by amorsen · 2010-03-27 20:52 · Score: 1

Covert channels are fairly easy to achieve in a virtualized setup, particularly if you oversubscribe -- and if you don't oversubscribe you generally gain nothing from virtualization. Allocating physical CPU's, memory, network interfaces, and disks for each virtual server is impractical. Therefore I don't think the covert channel attack is much of a threat.
Detecting whether a particular file exists on other machines is interesting though. You can do that with Arkeia (deduplicating backup) I believe, by creating a particular file and checking how much data is actually sent across the network for that backup. If it's less than the compressed size of the file, then someone else on the same backup server has the same file...

--
Finally! A year of moderation! Ready for 2019?
Re:A hypothetical question. by amorsen · 2010-03-27 20:59 · Score: 1

off course, this would only be feasible for very small files, but /etc/shadow is usually small enough,
/etc/shadow is typically >1kB, which is 2^(1000*8) possibilities. A stupid brute force approach isn't going to work. If you can be sure which users exist in the file in which order, and root is the only one with a password, then maybe, but I doubt you could get it fast enough even in that case. If it turns out the be a threat we just need to increase the salt size.

--
Finally! A year of moderation! Ready for 2019?
Re:A hypothetical question. by kitgerrits · 2010-03-27 22:43 · Score: 1

Deduplication often relies on copy-on-write to maintain seperate versions after deduplication.
Once a block is deduplicated between users A, B and C into file Z and user B changes his file, the filesystem will record the change and point user B to block Z instead.
Other security issues (permissions) should be handled by the filesystem table, not the physical file.

--
"I was in love with a beautiful blonde once, dear. She drove me to drink. It's the one thing I am indebted to her for."
Re:A hypothetical question. by DarkOx · 2010-03-27 23:36 · Score: 1

Maybe not, you might be able to fool the dedupe engine with a hash collision, and get it to turn your file full of gobeldy-gook into the actual file contents. I agree though you would need to know an awful amount about the file to pull that off, size, hash of what ever type the dedupe uses, time stamps.
So of that you might be able to control yourself like atime, though other access, but I don't know how you'd get the rest, (thinking about the GP's example of /etc/shadow).

--
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Re:A hypothetical question. by GNUALMAFUERTE · 2010-03-28 07:25 · Score: 1

I agree. My point was precisely showing the other people saying that the system could be exploited how hard and improbable an exploit like that would be.

--
WTF am I doing replying to an AC at 5 A.M on a Friday night?
Re:A hypothetical question. by tlhIngan · 2010-03-29 03:19 · Score: 1

off course, this would only be feasible for very small files, but /etc/shadow is usually small enough,
/etc/shadow is typically >1kB, which is 2^(1000*8) possibilities. A stupid brute force approach isn't going to work. If you can be sure which users exist in the file in which order, and root is the only one with a password, then maybe, but I doubt you could get it fast enough even in that case. If it turns out the be a threat we just need to increase the salt size.
But you can make simplifying assumptions. Firstly, /etc/shadow has a fixed structure so it can be parsed. Secondly, you know the usernames on that list (hey look, /etc/passwd *is* world-readable!). All you have left is to guess the hashes. On a many user system, that's hard, but on a single user system it's a lot easier - maybe a root hash and a user hash. Those other daemon users typically are non-logon and have * for a password. And unsurprisingly, the order in /etc/shadow tends to mirror that in /etc/passwd - new users added to the end of the file. Old users are deleted, which shifts the line up one.
Dumb brute force may not work too well, but a smarter one might. And a timing attack that measures this might be able to detect it. To counteract it, an easy solution is to randomize user order (hard), or simply add a random amount of whitespace to the end of each line, or comments, or other such thing that are ignored by the tools.
Re:A hypothetical question. by amorsen · 2010-03-29 05:12 · Score: 1

You're forgetting about the password salt. If this attacks becomes practical, all you need to do is to use a longer salt. You get exponential attack complexity for a linear amount of space.

--
Finally! A year of moderation! Ready for 2019?

Hasn't this been posted before? by Required+Snark · 2010-03-27 15:40 · Score: 5, Funny

Just wondering...

--
Why is Snark Required?

Re:Hasn't this been posted before? by Anonymous Coward · 2010-03-27 15:54 · Score: 1, Funny

If so, how about we just reference that post?
Re:Hasn't this been posted before? by Hurricane78 · 2010-03-27 16:25 · Score: 1

Well, at least this comment has been posted before.
Dude, you’re only piling it up. Like with trolling: If you react to it, you only make it worse.
And because I’m not better, I’m now gonna end it, by stating that: yes, yes, I’m also not making it better. ^^
Oh wait... now I am! :)

--
Any sufficiently advanced intelligence is indistinguishable from stupidity.

Re:Excellent! by jtownatpunk.net · 2010-03-27 16:02 · Score: 1, Redundant

Yeah, I gave up on bitching about code inefficiency back in the early 90s. Do they even teach assembly any more?

Re:Let's get down to brass tacks. by nystire · 2010-03-27 16:02 · Score: 1

AND it will make sure that all those 60,000 duplicate files no longer take up most of your hard drive space!

Yea, I RTFA, but... by mrsteveman1 · 2010-03-27 16:10 · Score: 2, Interesting

......from what i can tell, this is NOT a way to deduplicate existing filesystems or even layer it on top of existing data, but a new filesystem operating perhaps like eCryptfs, storing backend data on an existing filesystem in some FS-specific format.

So, having said that, does anyone know if there is a good way to resolve EXISTING duplicate files on Linux using hard links? For every identical pair found, a+b, b is deleted and instead hardlinked to a? I know there are plenty of duplicate file finders (fdupes, some windows programs, etc), but they're all focused on deleting things rather than simply recovering space using hardlinks.

Re:Yea, I RTFA, but... by Aluvus · 2010-03-27 16:23 · Score: 1

FSlint's "merge" option will do what you want.

--
Never mistake "can" for "should".
Re:Yea, I RTFA, but... by dlgeek · 2010-03-27 16:23 · Score: 3, Informative

You could easily write a script to do that using find, sha1sum or md5sum, sort and link. It would probably only take about 5-10 minutes to write but you most likely don't want to do that. When you modify one item in a hard linked pair, the other one is edited as well, whereas a copy doesn't do this. Unless you are sure your data is immutable, this will lead to problems down the road.

Deduplication systems pay attention to this and maintain independent indexes to do copy-on-write and the like to preserve the independence of each reference.
Re:Yea, I RTFA, but... by Lorens · 2010-03-27 16:28 · Score: 2, Interesting

I wrote fileuniq (http://sourceforge.net/projects/fileuniq/) exactly for this reason. You can symlink or hardlink, decide how identical a file must be (timestamp, uid...), or delete.
It's far from optimized, but I accept patches :-)
Re:Yea, I RTFA, but... by symbolset · 2010-03-27 16:31 · Score: 1

There are security problems with this. The duplicate files might have different metadata - for example, access privileges.
For real (block level) deduplication, try lessfs or zfs.

--
Help stamp out iliturcy.
Re:Yea, I RTFA, but... by mrsteveman1 · 2010-03-27 17:00 · Score: 1

Sweet! Thanks a lot :)
Re:Yea, I RTFA, but... by mrsteveman1 · 2010-03-27 17:01 · Score: 1

That can be managed for simple use cases, but yea i see your point.
Re:Yea, I RTFA, but... by mrsteveman1 · 2010-03-27 17:02 · Score: 1

Hmm, yea i've used FSLint but I didn't pay close enough attention to the options it seems :)
Thanks
Re:Yea, I RTFA, but... by mrsteveman1 · 2010-03-27 17:04 · Score: 1

If I couldn't find a good tool from responses here I would have written one for sure.
Re:Yea, I RTFA, but... by TarpaKungs · 2010-03-27 21:15 · Score: 1

FSLint is very good. http://www.pixelbeat.org/fslint/

--
Why can't women be like Hedy Lamarr - beautiful, talented and inventors of frequency-hopping spread-spectrum techn
Re:Yea, I RTFA, but... by Ant+P. · 2010-03-28 02:18 · Score: 1

http://code.google.com/p/hardlinkpy/
Re:Yea, I RTFA, but... by jc42 · 2010-03-29 14:59 · Score: 1

So, having said that, does anyone know if there is a good way to resolve EXISTING duplicate files on Linux using hard links?
Yeah; I was a bit disappointed to find that the "dedupe" software talked about here doesn't seem to do that. The intent here seems to be to handle editing one of the "dupes" by splitting it apart into a new file, so that the others don't change. This is pretty much the opposite of what I find that I usually want.
Actually, I've written a couple of programs (in different langauges) to do linking of identical files for some time. One is about 25 years old, and arose in a project where we were having a lot of problems with software that "broke" hard links when changes were made to a file. This was shooting down our use of multiply-linked files to classify files in multiple ways by linking them into several appropriate directories. So we worked on software to hunt down the problems and fix them. Since we conceptualized the problem as "broken links", we called our operation "relinking". We coded up several algorithms to do the job, and pitted them against each other. We were a bit bemused to find that there wasn't really that much difference between them. ;-) I kept a couple of them.
Anyway, I see that a few others have written similar tools. The problem finding them seems to be the different terminology that different developers have used. The "merge" term makes sense if you think about some other reasons you might want to do it.
Does anyone have any knowledge of other terms that might be used to google for such software? It might be interesting to find out how many times people have reinvented this particular wheel under different names.

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Re:Yea, I RTFA, but... by rawler · 2010-03-30 08:32 · Score: 1

There is one danger with hardlinks that should not be forgotten. Hardlinks are not copy-on-write (and AFAIK, can't be made COW?), which means that if files get linked in the de-duplication-process, updates to either file will contaminate the other.
A practical example where this WOULD be a definite problem could be a double-buffered application, that for consistency always keeps a "backup" of it's config. During idle, this file could be identical to the "live" file, and hard-linking them could completely destroy the consistency feature of the app.
Another scenario would be having a file on your desktop of some family photo you want to mess around with, also in archive. Hardlink them, and editing the one on the desktop will overwrite the one in the archive. (Under some conditions, I.E. no move-operations done by the editing app)

Or get inline deduplication by anilg · 2010-03-27 16:11 · Score: 1

with NexentaStor CE, which is based on OpenSolaris b134. It's free.. and has an excellent Storage WebUI. /plug

For a detailed explanation of OpenSolaris dedup see this blog entry.

~Anil

--
http://dilemma.gulecha.org - My philospohical short film.

Re:Or get inline deduplication by anilg · 2010-03-27 16:16 · Score: 1

Grr.. meant inline/kernel dedup.

--
http://dilemma.gulecha.org - My philospohical short film.
Re:Or get inline deduplication by mrsteveman1 · 2010-03-27 16:16 · Score: 1

Plus you get the "real" ZFS, zones, and tightly integrated, bootable system rollbacks using zfs clones :)
Re:Or get inline deduplication by itsme1234 · 2010-03-27 20:05 · Score: 1

Plus you get the "real" ZFS, zones, and tightly integrated, bootable system rollbacks using zfs clones :)
Plus you get the "real" opensolaris experience:
- poor (like really really poor) hardware compatibility. Starting with basic stuff, many on-board Ethernet controllers with flaky or no support, very hard to choose a motherboard that's available and without too many compromises and fully supported. A guy asked if Android pairing is available (to use phone as modem for OpenSolaris), made me spill my coffee...
- doubtful future
- no security patches (yes, you read that right)
- major features like zfs encryption slipping schedule for years (working on it since 2008, last promise was to be in 2010.2 release which in itself slipped to 2010.3 and this one seems to be delayed as well as it was supposed to be released on the 26th and in any case it's quite sure that encryption won't make it anyway)
Thanks, but no thanks.
Re:Or get inline deduplication by anilg · 2010-03-27 20:52 · Score: 1

Hardware compatibility is pretty good. Really. All decent brands (storage controller/NICs) support opensolaris. Doubtful future part is FUD. Oracle made it clear OpenSolaris development, community functions will continue as is. The security patches costing $$ is not for opensolaris, but enterprise Solaris. Encryption is late.. big deal.. some things are set to low priority over others. Dedup is present, and works very well.
If it's a storage box you're looking at.. what's really important? An in-kernel, established, and widely-deployed filesystem like ZFS (without support for android phones), or a new, user-space dedup filesystem, nascent and not in production (but it can pair with your android phone!).
~Anil

--
http://dilemma.gulecha.org - My philospohical short film.
Re:Or get inline deduplication by tbuskey · 2010-03-28 10:46 · Score: 1

ZFS dedupe in OpenSolaris is also Open Source.
I've gotten 11% dedup savings on 1.04 TB of a 1.82 TB volume.
Add compresson savings and ECC (so bad bits don't happen silently).
I'm hoping it will be in btrfs so Linux will have it.

Re:deduplication by deniable · 2010-03-27 16:14 · Score: 1

It's neither acronym or abbreviation. Duplication is making copies. De-duplication is getting rid of the copies.

Re:Let's get down to brass tacks. by SanityInAnarchy · 2010-03-27 16:16 · Score: 1

Well, just how repetitive is your porn collection?

--
Don't thank God, thank a doctor!

How useful is this in realistic scenarios? by marvin2k · 2010-03-27 16:17 · Score: 1

Given that usually most of the disk space is swallowed by the data of an application and that data rarely is identical to the data on another system (why would you have two systems then?) I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical system with 10 Centos VMs running different apps or similar apps with different data? You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.

Re:How useful is this in realistic scenarios? by dlgeek · 2010-03-27 16:31 · Score: 1
It sounds to me like uou have a very narrow view of what constitutes "realistic scenarios".
- A high-availability mail system that has multiple servers handling client mail storage. VMs are used for rapid failover in the case of hardware failure. Sounds pretty realistic to me. Deduplication is extremely helpful when there are many copies of the same attachment as many users forward it around.
- A large set of VMs which used for testing the software you develop with a variety of possible end-user configurations. Sounds pretty realistic to me. Deduplication is extremely helpful to save space storing the base OS libraries and such.
- You have a server (or set of servers) which is/are responsible for backing up a large number of other computers. Sounds pretty realistic to me. Deduplication is extremely helpful when these computers have files that are identical. (Hell, deduplication can make it much easier to do incremental backups of a single computer).
These all sound very realistic to me...
Re:How useful is this in realistic scenarios? by Lorens · 2010-03-27 16:38 · Score: 1

A major use case is NAS for users. Think of all those multi-megabyte files, stored individually by thousands of users.
However, normally deduplication is block level, under the filesystem, invisible to the user. This is implemented by NetApp SANs, for instance. After having RTFA, OpenDedup seems to be file-level, running between the user and an underlying file system. I'm not sure it's a good idea.
Re:How useful is this in realistic scenarios? by snikulin · 2010-03-27 16:45 · Score: 1

Well, a really good and useful "home" scenario is a system backup of multiple computers with the same OS.
OS itself plus common software takes at least 20-30 GB per installation these days.
My WHS (which does support de-dup in form of Single-instance storage) server keeps full backup (3-months worth) of my seven Windows home computers on about 60 GB.
Unfortunately SIS does not work for WHS shared folders, so my two Linux machines' (my version control & gallery servers) rsync backups over SMB are not de-duplicated by WHS.
I could probably save only /etc, /var and /srv of each server, but so far I backup everything.
Re:How useful is this in realistic scenarios? by QuantumRiff · 2010-03-27 17:19 · Score: 3, Informative

If you cut up a large file into lots of chunks of whatever size, lets say 64KB each. Then, you look at the chunks. If you have two chunks that are the same, you remove the second one, and just place a pointer to the first one. Data Deduplication is much more complicated than that in real life, but basically, the more data you have, or the smaller the chunks you look at, the more likely you are to have duplication, or collisions. (how many word documents have a few words in a row? remove every repeat of the phrase "and then the" and replace it with a pointer, if you will).
This is also similar to WAN acceleration, which at a high enough level, is just deduplicating traffic that the network would have to transmit.
It is amazing how much space you can free up, when your not just looking at the file level. This has become very big in recent years, cause storage has exploded, and processors are finally fast enough to do this in real-time.

--

What are we going to do tonight Brain?
Re:How useful is this in realistic scenarios? by jdoverholt · 2010-03-27 17:31 · Score: 1

If you look at the sales materials from any of the big vendors (EMC, I'm looking at you), even a single system image shows reduction in size through block-level deduplication--even more through variable-sized blocks. I can't recall the exact numbers, I'm at the end of a terribly long week, but I think it was somewhere around 10-30% reduction in the day-0 backup size. Subsequent days typically see a >95% reduction.

All sales literature, mind you. My personal experience with it will begin in a few months, when we get our new Celerra installed :-)

P.S. Remember that a project such as this is good because it offers high-dollar features to low-dollar players who enjoy tinkering in their basements. Such was the goal of Linux in the first place. It's how, on a three-figure budget, a dedicated nerd can set up a several-terabyte file server with software RAID-6 protection and (soon) data deduplication--stuff you'd pay EMC 100-1000 times as much for.
Re:How useful is this in realistic scenarios? by mysidia · 2010-03-27 18:03 · Score: 4, Informative

First of all.... one of the most commonly duplicated blocks is the NUL block, that is a block of data where all bits are 0, corresponding with unused space, or space that was used and then zero'd.
If you have a virtual machine on a fresh 30GB disk with 10GB actually in use, you have at least 25GB that could be freed up by dedup.
Second, if you have multiple VMs on a dedup store, many of the OS files will be duplicates.
Even on a single system, many system binaries and libraries, will contain duplicate blocks.
Of course multiple binaries statically linked against the same libraries will have dups.
But also, there is a common structure to certain files in the OS, similarities between files so great, that they will contain duplicate blocks.
Then if the system actually contains user data, there is probably duplication within the data.
For example, mail stores... will commonly have many duplicates.
One user sent an e-mail message to 300 people in your organization -- guess what, that message is going to be in 300 mailboxes.
If users store files on the system, they will commonly make multiple copies of their own files..
Ex... mydocument-draft1.doc, mydocument-draft2.doc, mydocument-draft3.doc
Can MS Word files be large enough to matter? Yes.. if you get enough of them.
Besides they have common structure that is the same for almost all MS Word files. Even documents' whose text is not at all similar are likely to have some duplicate blocks, which you have just accepted in the past -- it's supposed to be a very small amount of space per file, but in reality: a small amount of waste multiplied by thousands of files, adds up.
Just because data seems to be all different doesn't mean dedup won't help with storage usage.
Re:How useful is this in realistic scenarios? by Spad · 2010-03-27 18:37 · Score: 1

All good dedupe systems are block-level, not file-level so you don't just save where whole files are identical but on *any* identical data that's on the disks.
If you're running VMs with the same OS you'll probably find that close to 70% of the data can be de-duplicated - and that's before you consider things like farms of clustered servers where you have literally identical config or fileservers with lots of idiots saving 40 "backup" copies of the same 2Gb access database just in case they need it.
Our deduped backup array is currently storing ~70Tb of backups on 10Tb of raw space and it's only about 40% full - to me, that's useful.
Re:How useful is this in realistic scenarios? by drsmithy · 2010-03-27 19:13 · Score: 3, Informative

I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical system with 10 Centos VMs running different apps or similar apps with different data? You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.
Assume 200 VMs at, say, 2GB per OS install. Allowing for some uniqueness, you'll probably end up using something in the ballpark of 20-30GB of "real" space to store 400GB of "virtual" data. That's a *massive* saving, not only disk space, but also in IOPS, since any well-engineered system will carry that deduplication through to the cache layer as well.
Deduplication is *huge* in virtual environments. The other big place it provides benefits, of course, is D2D backups.
Re:How useful is this in realistic scenarios? by drsmithy · 2010-03-27 19:15 · Score: 1

All sales literature, mind you. My personal experience with it will begin in a few months, when we get our new Celerra installed :-)
As far as I know, Celerras only do file-level dedupe.
Re:How useful is this in realistic scenarios? by Anonymous Coward · 2010-03-27 22:38 · Score: 1, Insightful

Lets put it like this -
Imagine a CentOS box connected to a san - with the Data Mounts on NFS. Now imagine that all of those NFS mounts are deduped at the block level. How does 1:3 savings sound?
So you can store 2.4TB on 800GB ! Now imagine replicating that across a WAN circuit to another SAN for DR.
So not only does dedupe save you in storage costs, thin provisioning etc - it saves you in WAN costs as well. I'll gladly pay for a little more processor/memory up front in order to save those more expensive WAN/storage dollars.
Re:How useful is this in realistic scenarios? by jabuzz · 2010-03-27 23:00 · Score: 1

Yeah, but the problem there is the cost. We run on 17GB boot disks, so your 200VM's would require under 4TB of disk to store. I am sorry but 4TB of storage is peanuts and I can do that easily with a low end DS3400.
Now the million dollar question to ask is how much does your dedupe solution cost? The reason being any dedupe that is supported against a virtualization solution we have looked at costs more than just buying the frigging disk. One then has to question the point of bothering with the extra layer of complexity.
The level of dedupe in bulk storage is likely to be low as well, besides which the cost of dedupe on a couple hundred TB of disks is rediculas. Even for backup one has to wonder as well, tape is again really cheap, and dedupe for hundreds of TB is bloody expensive.
Re:How useful is this in realistic scenarios? by DarkOx · 2010-03-27 23:55 · Score: 1

Ok I'll bite.
Its real rarity in any of the enterprise environments that I have ever seen for minimal OS install installs to be the mode of operation on application servers (Unix and like); and I have never seen in on Windows based application servers. I am not even certain I agree that its such a good idea. Sure all the daemons not in use should not be started and ideally have had their execute bits turned off to avoid mistakes but when things go wrong its often helpful to have full platform availability.
So in lots of SAN based storage scenarios I suspect there is a great deal more than a few blocks to be saved on OS files alone.
Now for an application think about your typical corporate mail server, where users usually send 100 people a copy of the same speadsheet; times a thousand speadsheets, times a few hundred users. Yea it would be nice if you could get them to use the collaboration application or at least the file server but that will never happen. Exchange prior to 2k10 did that type of dedupe in the information store, but not any more. Lets assume you have a 5 or 7TB of online mail storage. An often quoted figure is 30% will be duplicate in the environment I described. SAN storage is still expensive enough that if you can cut that mail store down by a TB that is meaningful savings. If that is not reason enough for you imaging you are doing some kind of SAN level replication to a hot site. The less data you need to move the less connectivity you need to have to do it; at least in the States here D3s are not inexpensive. Even if you are just scratching tape backups every night cutting down the size of the snap shot in anyway possible is a big win, anyone who has even been stressed to figure out backup windows will tell you that.

--
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Re:How useful is this in realistic scenarios? by RAMMS+EIN · 2010-03-27 23:56 · Score: 1

``You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.''
Right, but note the if. Most of the places where I've seen virtualization used have most of the VMs running instances of a proprietary operating system which shall remain unnamed. Together with other components that tend to be common, the amount of data that is common among instances can easily be over 10 GB per instance.
There is certainly a more efficient way to deal with > 10GB of common data per instance than storing the same data multiple times, and deduplication is one way to do things more efficiently.

--
Please correct me if I got my facts wrong.
Re:How useful is this in realistic scenarios? by MMC+Monster · 2010-03-28 00:29 · Score: 1

This doesn't even save a single hard drive at current storage densities. :-(

--
Help! I'm a slashdot refugee.
Re:How useful is this in realistic scenarios? by marvin2k · 2010-03-28 01:47 · Score: 1

Who does 2GB OS installs especially in a 200+ VM environment? That's insane. I agree that deduplication is a nice addition to the virtual tool-set but it only seems to really ad a benefit to very specific environments. If I have optimized OS installs and the VMs run completely different data-sets from different organizations then the cost (both money and system resources) of deduplication seems to outweigh the benefit of saving a few G especially in a world where HDs come in 2TB sizes.
Re:How useful is this in realistic scenarios? by jabuzz · 2010-03-28 01:54 · Score: 1

The point is though even if you save 20GB per OS instance, that only comes to 2TB over 100 virtual machines. You are talking of saving four RAID1 450GB 15k rpm SAS/FC arrays or eight disks. It really is just is not worth the additional complexity. At your 10GB per instance we are talking two arrays, or four disks even less worth the additional complexity.
Then once you look at mature comercial inplementations and you start paying by the TB deduped it becomes utterly pointless. For sure an open source implentation can change that, but not one implemented in Java for crying out loud.
Re:How useful is this in realistic scenarios? by drsmithy · 2010-03-28 02:25 · Score: 1

Now the million dollar question to ask is how much does your dedupe solution cost?
Nothing. Our NetApp has it by default (who charges extra for dedupe these days ?).
The reason being any dedupe that is supported against a virtualization solution we have looked at costs more than just buying the frigging disk.
Except it doesn't cost any more and it saves IOPS, meaning we need to buy less disk not only for space, but for performance as well.
The level of dedupe in bulk storage is likely to be low as well, besides which the cost of dedupe on a couple hundred TB of disks is rediculas. Even for backup one has to wonder as well, tape is again really cheap, and dedupe for hundreds of TB is bloody expensive.
If your dedupe solution has differing costs depending on how much data you have, you've got the wrong solution.
Re:How useful is this in realistic scenarios? by marvin2k · 2010-03-28 02:32 · Score: 1

First of all.... one of the most commonly duplicated blocks is the NUL block, that is a block of data where all bits are 0, corresponding with unused space, or space that was used and then zero'd.
If you have a virtual machine on a fresh 30GB disk with 10GB actually in use, you have at least 25GB that could be freed up by dedup.
But that is a "bug" in the storage management of virtualization environments. If blocks are not used they should not be allocated. Allocating them and then saying "but we can reduce the space they use" sounds like a hack at best. It's more of a workaround rather than a solution.

Second, if you have multiple VMs on a dedup store, many of the OS files will be duplicates.
Even on a single system, many system binaries and libraries, will contain duplicate blocks.
Of course multiple binaries statically linked against the same libraries will have dups.
But also, there is a common structure to certain files in the OS, similarities between files so great, that they will contain duplicate blocks.
If you optimize your OS install you will not gain much by deduplicating all those OS files. If your infrastructure is so small that you can afford to be wasteful to get away with doing default installs then why bother with the complexity and costs of deduplication at all?

Then if the system actually contains user data, there is probably duplication within the data.
For example, mail stores... will commonly have many duplicates.
But to how many people out there does this actually apply? There may be quite a few corporate environments where this is the case but many if not most people will probably have fairly unique data in their VMs. Deduplication doesn't help much in that case.

One user sent an e-mail message to 300 people in your organization -- guess what, that message is going to be in 300 mailboxes.
If users store files on the system, they will commonly make multiple copies of their own files..
Ex... mydocument-draft1.doc, mydocument-draft2.doc, mydocument-draft3.doc
Can MS Word files be large enough to matter? Yes.. if you get enough of them.
Besides they have common structure that is the same for almost all MS Word files. Even documents' whose text is not at all similar are likely to have some duplicate blocks, which you have just accepted in the past -- it's supposed to be a very small amount of space per file, but in reality: a small amount of waste multiplied by thousands of files, adds up.
Again this only applies to very specific environments. The 40GB MySQL db of customer A and the 60GB db of customer B don't really share any data at all and the data-to-OS-files ratio is probably in the ballpark of 100:1 so I see very little gain.

Just because data seems to be all different doesn't mean dedup won't help with storage usage.
I don't doubt there are infrastructures that will benefit hugely from this but I think those infrastructures are the minority. I just see all this undifferentiated hype about how this will reduce peoples storage troubles when it really only applies to a (relatively) small group of people out there. If you do the deduplication on a big chunk level you'll get little overhead but won't find many duplicates. If you do the deduplication more fine-grained then you'll find more duplicates but incur more overhead for the deduplication prozess.
Re:How useful is this in realistic scenarios? by drsmithy · 2010-03-28 02:34 · Score: 1

Who does 2GB OS installs especially in a 200+ VM environment? That's insane.
We certainly do. Why wouldn't we ? Trying to shave a few 10s or hundreds of MB off installation sizes is wasted time when your storage system can deliver similar (and more benefits) without the (expensive) human overheads.
I agree that deduplication is a nice addition to the virtual tool-set but it only seems to really ad a benefit to very specific environments.
"Very specific" ? You mean anyone doing non-trivial virtualisation ?
If I have optimized OS installs and the VMs run completely different data-sets from different organizations then the cost (both money and system resources) of deduplication seems to outweigh the benefit of saving a few G especially in a world where HDs come in 2TB sizes.
Firstly, to get sufficient IOPS, drive sizes are ~500GB, not 2TB (or even smaller for SSDs).
Secondly, the OS part of a large proportion of your VMs is always going to be identical, allowing for large savings.
Thirdly, savings aren't just in raw disk space. You also save IOPS (since the dedupe should be carried through to the cache layer) and bandwidth if you're replicating over a WAN.
Re:How useful is this in realistic scenarios? by mysidia · 2010-03-28 04:06 · Score: 1

If you have 30 full Redhat EL 5 installs in virtual machines on a host, the OS install will use about 10gb out of the box. That's approximately 10gb of duplicate data, 10 x 30 = 300gb.
The same is true of Windows based OSes, for example a minimal install of Windows 2008 R2 or Windows 7 will use approximately 8gb after install is done.
Again this only applies to very specific environments. The 40GB MySQL db of customer A and the 60GB db of customer B don't really share any data at all and the data-to-OS-files ratio is probably in the ballpark of 100:1 so I see very little gain.
Actually... it applies to most environments. It only fails to be useful in very specific environments.
The mysql database server or Oracle DB server with the 40gb database is ONE server in the enterprise, a vast majority of servers are not the database servers. In fact, if you have 40gb customer databases on your DB server, that server is not (right now) a good candidate for compute or storage virtualization in the first place, anyways, due to the I/O penalty.
Virtualization software is currently not suitable for heavy IO workloads, such as large DB servers performing thousands of transactions per second, unless you have one of those fancy new ultra-expensive setups with Nehalem and I/O virtualization, such as Cisco UCS, databases with significant workload suffer under virtualization tremendously.
Plus, you wouldn't want to dedup the huge mysql database servers, using something like the dedup mentioned in the article, since there is probably already a performance bottleneck, with such a large database, and multiple users, there are sure to be times when DB server performance limits the speed of their application.
Unless the mysql DB is extremely full of large duplicate records, dedup won't help much for the database
Re:How useful is this in realistic scenarios? by jabuzz · 2010-03-28 04:23 · Score: 1

Your NetApp!!! Then you are already paying through the nose for storage.
As for IOP's, *very* few machines ever get close to pushing the IOP's for one array, let alone the storage system.
NetApp is about the only storage solution that provides dedup in the box at no extra cost. However per GB of storage it is one of the most expensive solutions out there.
Re:How useful is this in realistic scenarios? by jabuzz · 2010-03-28 04:37 · Score: 1

Large savings! You are never going to get large savings on OS installs through dedup unless your idea of a large amount of storage is a handful of TB. Of the 300+TB of managed storage at work there is less than 6TB in our virtual infrastructure of that less than 500GB is OS installs.
Oh and buying that amount of storage in NetApp's would be hugely prohibative, and there is no point in saving IOPS because even our ancient DS4400 that is soon to be replaced gets no where near it's IOPS.
Re:How useful is this in realistic scenarios? by amorsen · 2010-03-28 04:43 · Score: 1

The real power is when the whole OS knows the file is deduplicated. Virtuozzo supposedly provides this (I can't know for sure, I've had zero luck trying to buy anything from them). Then the OS can get away with only having each program in memory once, even if it's used in a hundred different VM's. That saves a lot of memory and if you're lucky it even gets you higher icache hit rates.

--
Finally! A year of moderation! Ready for 2019?
Re:How useful is this in realistic scenarios? by Eil · 2010-03-28 05:06 · Score: 1

My personal experience with it will begin in a few months, when we get our new Celerra installed :-)
My condolences to you, sir.
Re:How useful is this in realistic scenarios? by Eil · 2010-03-28 05:15 · Score: 1

I think I understand what deduplication is, but... isn't it just an on-the-fly block or file system version of file-based compression? Why don't we just call it "compression" instead of bandying about new terminology?
(Not trying to be critical, just playing devil's advocate. :)
Re:How useful is this in realistic scenarios? by s1acker · 2010-03-28 06:21 · Score: 1

OpenDedup is actually block level - it just uses a file based backing store to hold the chunks.
Re:How useful is this in realistic scenarios? by jabuzz · 2010-03-28 08:54 · Score: 1

A NetApp is not a SAN, it is a fancy dressed up NAS that you pay through the nose for.
Re:How useful is this in realistic scenarios? by drsmithy · 2010-03-28 16:44 · Score: 1

Your NetApp!!! Then you are already paying through the nose for storage.
*shrug* It costs the same as other storage systems in its class.
As for IOP's, *very* few machines ever get close to pushing the IOP's for one array, let alone the storage system.
The OS drive for a Linux VM "idles" at about 7-8 IOPS. 200 of them, then, basically uses up a shelf of 10k disks.
NetApp is about the only storage solution that provides dedup in the box at no extra cost. However per GB of storage it is one of the most expensive solutions out there.
They're a top-tier vendor to be sure, but no more expensive than alternatives like IBM or EMC.
Re:How useful is this in realistic scenarios? by drsmithy · 2010-03-28 16:47 · Score: 1

Large savings! You are never going to get large savings on OS installs through dedup unless your idea of a large amount of storage is a handful of TB.
It's not (directly) about TB, it's about spindle counts.
Oh and buying that amount of storage in NetApp's would be hugely prohibative, and there is no point in saving IOPS because even our ancient DS4400 that is soon to be replaced gets no where near it's IOPS.
It's great you have enough spindles to not be IO-bound, but some of us aren't that lucky.
Re:How useful is this in realistic scenarios? by rtfa-troll · 2010-03-28 18:59 · Score: 1

But that is a "bug" in the storage management of virtualization environments. If blocks are not used they should not be allocated. Allocating them and then saying "but we can reduce the space they use" sounds like a hack at best. It's more of a workaround rather than a solution.
Well spotted; your comment finally made me think. The allocation of blocks for VMs is optional (VMs, under Linux can use sparse files with the zero blocks unallocated). You do it because you want the disk storage to be more contiguous and faster to allocate. De-duplication in this case is pretty much against the entire intent of actually allocating the blocks.

--
=~ s,(.*),<sarcasm>$1</sarcasm>,g if any_point_you_wish();

This just gave me a good idea! by thePowerOfGrayskull · 2010-03-27 16:17 · Score: 3, Interesting

Actually, just the title did it. I've historically had a bad habit of backing things up by taking tar/gzs of directory structures, giving them an obscure name, and putting them onto network storage. Or sometimes just copying directory structures without zipping first. Needless to say, this makes for a huge mess.

Just occurred to me that it would not be difficult to write a quick script to extract everything into its own tree; run sha1sum on all files; and identify duplicate files automatically; probably in just one or two lines.

So in other words -- thanks Slashdot! The otherwise unintelligible summary did me a world of good -- mostly because there was no context as to what the hell it was talking about, so I had to supply my own definition...

Re:This just gave me a good idea! by Hooya · 2010-03-27 17:06 · Score: 3, Informative

try this::
mv backup.0 backup.1
rsync -a --delete --link-dest=../backup.1 source_directory/ backup.0/
see this
Re:This just gave me a good idea! by Anonymous Coward · 2010-03-27 17:11 · Score: 1, Informative

http://en.wikipedia.org/wiki/Venti
Re:This just gave me a good idea! by devent · 2010-03-27 17:36 · Score: 1

Why not just rdiff-backup? rdiff-backup.nongnu.org

--
http://www.mueller-public.de - My site http://www.anr-institute.com/ - Advanced Natural Research Institute
Re:This just gave me a good idea! by CAIMLAS · 2010-03-27 17:37 · Score: 1

Two things to look into:
rsync snapshots
rsnapshot, for a better rsync snapshot

--
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Re:This just gave me a good idea! by thePowerOfGrayskull · 2010-03-27 19:00 · Score: 1

Even better -- thanks!
Re:This just gave me a good idea! by thePowerOfGrayskull · 2010-03-27 19:03 · Score: 1

Most recently, I've been moving most of my documents and source-code level stuff to a LAN-based SVN repository; then periodically dumping that, encrypting, and tossing it onto dropbox. The versioning is good, but it's not so practical for downloaded files and various other content types.
I'll take a look at this - thanks for the post.
Re:This just gave me a good idea! by evilviper · 2010-03-27 20:41 · Score: 1

Why not just rdiff-backup?
Why not just rsync? ...
Now that that's out of the way...
For one thing, rdiff will give you a mess... rsync will give you multiple full filesystem trees...

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:This just gave me a good idea! by Fruit · 2010-03-27 21:31 · Score: 1

here you go :)
Re:This just gave me a good idea! by nacturation · 2010-03-27 22:20 · Score: 1

Deduplicated backups: http://backuppc.sourceforge.net/info.html

--
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Re:This just gave me a good idea! by bokmann · 2010-03-27 23:06 · Score: 1

For more good ideas like this, watch this screencast from pragmatic TV.
http://bit.ly/Pk3z3
Jim Weirich expains how git (the version control tool) works from the ground up, and in doing so, builds a hypothetical system that sounds like what you are trying to do.
Re:This just gave me a good idea! by slashflood · 2010-03-27 23:11 · Score: 1

One word: BackupPC.
Re:This just gave me a good idea! by david.given · 2010-03-27 23:46 · Score: 1

You may want to look at rsnapshot. It's a very small shell script that pretty much duplicates the functionality of Apple's Time Machine. Each backup becomes a timestamped directory containing all the data in the backup; files that haven't changed from backup to backup are hardlinked together, so they only get stored once (per-file deduplication). This makes incremental backups very cheap, while also avoiding the need for specialised backup restoration software. It all works through the magic of rsync.
On my system, each incremental backup of a 24GB dataset occupies about 600MB (depending how many files have changed). And each incremental backup is a complete, uncompressed copy of the dataset, making extracting files trivial!
It'll also backup across the network with ssh, so you can back up remote servers; it'll even back up Windows machines. It does proper backup rotation (I store two weeks' worth of daily backups, then a a couple of weekly backups, then monthly). It's totally awesome.
Re:This just gave me a good idea! by TheRaven64 · 2010-03-28 00:14 · Score: 1

You might like to take a look at Epitome, which supports CAS, DEDUP, SIS and remote backup.

--
I am TheRaven on Soylent News
Re:This just gave me a good idea! by Z8 · 2010-03-28 04:25 · Score: 2, Informative
Yep, and then you don't have to worry about
- Changes in permissions/mtimes/atimes corrupting all your old backups because all of them are hard linked, or alternatively
- Changes in permissions/mtimes/atimes causing an entire file to get copied
There are also other things to worry about. To be fair, the guy who invented --link-dest wrote a backup program called Dirvish so that is a better comparison to rdiff-backup.
Re:This just gave me a good idea! by Z8 · 2010-03-28 04:27 · Score: 1

If you even read the grandparent post you would have seen that the suggestion is to use rsync. And yes, you do get multiple full filesystem trees (but with hardlinking for deduplication).
Re:This just gave me a good idea! by evilviper · 2010-03-28 07:12 · Score: 1

Your comprehension needs work....

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant

Offtopic? by SanityInAnarchy · 2010-03-27 16:18 · Score: 3, Informative

If you'd mentioned the fact that this appears to be written in Java, you might have a point. But despite this, and the fact that it's in userland, they seem to be getting pretty decent performance out of it.

And keep in mind, all of this is to support reducing the amount of storage required on a hard disk, and it's a fairly large programming effort to do so. Seems like this entire project is just the opposite of what you claim -- it's software types doing extra work so they can spend less on storage.

--
Don't thank God, thank a doctor!

User Land? Come on! by Gazzonyx · 2010-03-27 16:30 · Score: 1

[...] Opendedup runs in user space, making it platform independent, easier to scale and cluster, [...]

... and slow, prone to locking issues, etc. There's a reason no one runs ZFS over FUSE, why would we do it with this?

--

If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

Re:redundant if saving large amounts of data to SA by dbIII · 2010-03-27 16:30 · Score: 1

Consider that things may be spread over more than one SAN or that it is a situation where an old style file server makes better sense anyway.

Confusing summary by Brian+Gordon · 2010-03-27 16:33 · Score: 1

The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes

Can anyone offer wisdom on what the volume size is supposed to signify, being different from the maximum size that SDFS is scalable to?

Re:Confusing summary by Spad · 2010-03-27 18:41 · Score: 1

It's the raw capacity of the filesystem compared to the maximum amount of deduplicated data it can handle. So you can have 8Pb of raw disk space, on which you can store up to 8Eb of deduplicated data (depending on the dedupe ratios you get - I think 1000x is a little optimistic, 30x-40x is more common).
Re:Confusing summary by s1acker · 2010-03-28 06:36 · Score: 1

when you create an sdfs volume, you specify an arbitrary volume size - 8EB must be the maximum size you can specify. I'm guessing the current implementation is only able to deal with 32TB worth of deduplicated chunks - so if you have 8EB of data which you can get a 250000:1 deduplication ratio on, then you could fill up the 8EB volume.

Re:Patent 5,813,008 by Lorens · 2010-03-27 16:58 · Score: 1

Good try, but after skimming it, does not seem to apply. Seems to be for deduplicating e-mail attachments.

Re:Let's get down to brass tacks. by Hooya · 2010-03-27 17:01 · Score: 3, Funny

very repetitive. back and fourth. back and fourth. oh wait... that's not what you meant. never mind.

Re:Patent 5,813,008 by snikulin · 2010-03-27 17:03 · Score: 1

SIS is frequently implemented in file systems, e-mail server software, data backup and other storage-related solutions.

It finally happened by thesymbolicfrog · 2010-03-27 17:17 · Score: 1

I stopped being able to read English. WTF does any of that mean? Is it written in moonspeak?

Off-site replication by Dishwasha · 2010-03-27 17:29 · Score: 1

One of the biggest targets for data de-duplication is for efficient off-site replication which you see in the EMC Avamar product line. This is advantageous when your WAN links aren't fast enough so that you can't do synchronous replication and a scheduled asynchronous replication would take too long. I'd like to see the SDSF storage engine be intelligent enough to snapshot the data, then when the next "backup/replication" occurs, it gathers up all the hashes of the blocks that have changed since the snapshot was created, communicates those hashes to the off-site system, and then transfer just the blocks that currently don't have a comparable hash on the target system, the target system receives a complete hash table update of the snapshot block difference from the source, and then both systems merge their snapshots and then take a new snapshot to get ready for the next replication cycle.

Re:Patent 5,813,008 by Lorens · 2010-03-27 17:35 · Score: 1

Which claims apply? I can see no claim that does not reference "information items [...] transferred between a plurality of servers connected on a distributed network". In fact, e-mail attachment dedup is seen as prior art (Background, fourth paragraph). File dedup is simpler than that.

Re:Patent 5,813,008 by pem · 2010-03-27 17:36 · Score: 2, Interesting

A good lawyer could probably argue that this doesn't apply.

Claim 1(a) requires "dividing an information item into a common portion and a unique portion".

It may be that the patent covers the case where the unique portion is empty, but then again maybe not, especially if the computer never takes the step to find out! In other words, if you treat every item as a common item (even if there is only one copy), there is a good chance the patent might not apply.

(There is also a good chance that the patent is written the way it is specifically because it doesn't apply to that case -- it may be that there is prior art in one of the referenced patents.)

See also: LessFS by kb1 · 2010-03-27 17:39 · Score: 1

The LessFS project also deserves mention: http://www.lessfs.com/ . Just think of the effect of combining a deduplication system with an iSCSI shared virtual tape library like http://sites.google.com/site/linuxvtl2/

Re:See also: LessFS by jabuzz · 2010-03-27 23:03 · Score: 1

And given that it is not written in Java is likely to be much better performing.

Re:redundant if saving large amounts of data to SA by afidel · 2010-03-27 18:55 · Score: 1

Not every SAN has dedupe, for instance my HP EVA doesn't. Also many of the lowend Netapp boxes have too anemic processors to be able to do dedupe. Most of the lowend iSCSI boxes also lack dedupe.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.

Look at StoreBackup by bradley13 · 2010-03-27 19:08 · Score: 1

We're a bit off topic here, seeing as this has nothing to do with file systems, but being off-topic is on-topic for /.

Anyhow: StoreBackup is a great backup system that automatically detects duplicates.

--
Enjoy life! This is not a dress rehearsal.

Re:deduplication by GNUALMAFUERTE · 2010-03-27 19:11 · Score: 2, Funny

So, Blade Runner was about de-duplication?

--
WTF am I doing replying to an AC at 5 A.M on a Friday night?

Re:New use for an old algorithm? by SharpFang · 2010-03-27 21:04 · Score: 1

it seems so, but the ordering was always: physical, partition, filesystem, compression (sometimes fs integrating compression) and compression applied to relatively small chunks (blocks).

Now you have compression layer above partition layer, which means two identical files on two different partitions will occupy space of one physically.
So, say, your LAMP server takes up 4GB generic system plus 1GB custom data. One 1TB of storage could fit 200 partition-files of such server. Now you'll fit 995 of them and it will work faster as the commonly used parts of the FS will be read and buffered once for all instances.

--
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2

Great idea by fearlezz · 2010-03-27 21:28 · Score: 1

Too bad it's just another new filesystem. I would have preferred integration into (some future version of) EXTn or BTRFS.
Not only would that mean it gets more widely available, it also means you don't have to miss al the nice functions of these filesystems. You may even be able to use it out of the box.

--
.sig: No such file or directory

Re:deduplication by nacturation · 2010-03-27 22:10 · Score: 4, Funny

What kind of lame recursive acronym is "deduplication"? I'm flummoxed in any attempt to decipher it.

Deduplication Eases Disk Utilization Purposefully Linking Information Common Among Trusted Independent Operating Nodes

--
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.

See Also LESSFS by sharper56 · 2010-03-27 22:37 · Score: 4, Interesting

Another nice OpenSource FS De-Dup project to look into is LESSFS.

Block-level de-dup and good speed. Also offers per block encryption and compression.

I'm using it backup VMs. 2TB of raw VMs plus 60 days of changes store down to 300GB. Write to de-dup FS is > 50MB/s.

Re:See Also LESSFS by phoenix_rizzen · 2010-03-28 14:18 · Score: 2, Interesting

ZFS also offers block-level dedupe support since ZFSv21. You can run it via FUSE on Linux, or natively on OpenSolaris. Hopefully, it'll also be available in FreeBSD 9.0 if not sooner (FreeBSD 7.3/8.0 have ZFSv14).
Since ZFS already checksums every block that hits the disk, dedupe is almost free, as those checksums are re-used for finding/tracking duplicate blocks.

Protected Space by nurb432 · 2010-03-28 01:44 · Score: 1

This is why some vendors protect some duplicated VM data ( like the OS ).

And sure stock DDup is not the end all to be all, but it goes a long way to that goal and the risks are more then worth the gains.

--
---- Booth was a patriot ----

Re:deduplication by binaryspiral · 2010-03-28 02:39 · Score: 1

I prefer SIS (single instance storage) or ASIS (Advanced SIS)

This has saved me loads of time by MarkH · 2010-03-28 04:13 · Score: 1

I had 3 backups of home data of about 300gbytes each.

Each one was almost but not quite the same due to some rather poor backup policies on mypart.

I was able to dedup per backup to get them small enough to combine and dedup the combo.

Left with one pure 150gbytes combo. Rsync is amazing

Re:User Land? Come on! by s1acker · 2010-03-28 06:31 · Score: 1

... and slow, prone to locking issues, etc. There's a reason no one runs ZFS over FUSE, why would we do it with this?

doesn't Luster use a ZFS over Fuse implementation on linux nodes?

anyhow, there are decent alternatives for what ZFS provides (no where near as comprehensive as ZFS, but workable at least). Afaik, there is nothing that provides a deduplicated FS - and if this is able to get 150MB/s then that's a good start.

People are getting off-track by azrider · 2010-03-28 11:46 · Score: 1

This is not virtualization. It is exactly what it says. The product referenced in the article is a separate file system.

The way de-duplication works is the system maintains a hash table for the file system (usually block level). When it detects that two files have a block in common, it sets a flag that says "this block is common to both of these files".

The entry is essentialy an inode entry (linked list) and a reference count.

The effort is more commonly used in virtual tape systems, because you will normally have multiple generations of the same tape file. It is also the way that zones (under Solaris) and virtual systems (under AIX) work, since there is generally a certain amount of static data shared between zones.

It does however have implications for common data between web server instances and/or web+(s)ftp instances. If you should need to restore data to a web server instance where dedup is active, the restore is much faster when you only have to actually write a subset of the data back.

It would be well worth it (if you should have a test system) to experiment with the tech. After all, the product is free.

--
And ye shall know the truth, and the truth shall make you free.
John 8:32(King James Version)

Re:People are getting off-track by fm6 · 2010-03-28 16:18 · Score: 1

Dude, you really need to slow down. Not only did you reply to the wrong post, nobody made the statement you're disputing. If you meant to reply to the post I think you meant to reply to, (go up a couple levels from your own post), go back and read it more carefully. You'll find that it says that deduping is important to people who do a lot of virtualization. And it is; if you have 1,000 VMs running Windows Server, you don't want a thousand copies of all the OS files.

It's called "rsnapshot" by Walles · 2010-03-29 00:14 · Score: 1

There's a program that automates what you describe, and it's called "RSnapshot":
http://rsnapshot.org/

If you have a system that isn't always up you want something like this to launch it:
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=27;filename=run-rsnapshot;att=1;bug=523923

--
Installed the Bubblemon yet?

Re:It's called "rsnapshot" by thePowerOfGrayskull · 2010-03-29 04:48 · Score: 1

I'll give it a try. Most of my work is on a Windows box, but the systems I back up to are Linux. Currently I'm switching over to SVN for a lot of this - thought it's far from ideal.

Re:deduplication by Dr.+Zed · 2010-03-30 19:26 · Score: 1

de- duplication

Re:deduplication by h00manist · 2010-04-04 05:58 · Score: 1

So, Blade Runner was about de-duplication?

It was an early form of it. At the time there was no distinction between 'clones' and 'duplicated'. They mistakenly eliminated clones, while the duplicates escaped - they couldn't told apart from originals. These days clones have their roles and rights better defined, and can usually survive if the commit no illegal operations in the known social memory. Duplicates however are usually found by the Trusted Computing(c) DRM de-duplication techniques. They are rumored to sometimes destroy the originals mistakenly, along with their hosts, though that's safely prevented and handled by the Public Relations (c) BRNWSH technologies. So we have never heard of any such cases.

--
Build your own energy sources from scratch. http://otherpower.com/

for memory by h00manist · 2010-04-04 06:27 · Score: 1

vmware does share memory pages. KSM appears to have that now too, haven't read much about it - unix-linux uses this very well in multiuser, especially in LTSP, where users running the same program share the memory. I don't know if windows terminal server does it nowadays - it didn't when I used it, several versions ago.

--
Build your own energy sources from scratch. http://otherpower.com/

Slashdot Mirror

Open Source Deduplication For Linux With Opendedup

151 of 186 comments (clear)