ZFS Gets Built-In Deduplication
elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."
Duplicate slashdot articles will be links back to the original one?
...and would normally make me happy; except I'm a Mac user. Still good news, but could've been better for a certain sub-set of the population, darn it.
File systems are one area where computer technology is lagging, comparatively speaking, so good to see innovation such as this.
The Mothership
I wrote two first posts, but I guess /. is on ZFS now.
Before we get all excited and look all silly, can somebody confirm with Netcraft first?
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
Surely with high amounts of data (that zfs is supposed to be able to handle), a hash collision may occur? I'm sure a block is > 256bits. Do they just expect this never to happen?
Although I suppose they could just be using it as a way to narrow down candidates for deduplication... doing a final bit for bit check before deciding the data is the same.
"Infecting minds with my own memetic virus, one post at a time." Ultimape
Are there any other filesystems with that feature? If not, I'm very strongly considering writing my own.
I'm wondering how long its going to take for them to do something with ZFS that actually makes me slow down my overwhelming ZFS fanboyism.
I just love these guys.
My virtual machine NFS server is going to have to get this as soon as FBSD imports it, and I'll no longer have to worry about having backup software (like BackupPC, good stuff btw) that does this.
I don't use high end SANs but it would seem to me that they are rapidly losing any particular advantage to a Solaris or FBSD file server.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Dee dupe de dupe!
Drey dupe de drupes!
Dey dook dour dobbs!
Dey took Lou Dobbs!
Dey drook our jobs!
They took our jobs!
Signed,
Slashdot editors
... and then they built the supercollider.
ZFS, from what I can tell, kicks ass. I've played around with it in virtual machines, taking drives off line, recreating them, adding drives, etc.
When I search NewEgg I also search OpenSolaris' compatibility list.
The two areas that Linux is playing catchup is Filesystems (like this) and Sound (OSS, Pulse, Alsa Oh My!). And before you go pointing out the btrfs project, this has been in servers for years. It's tried in an enterprise environment. Your file system is still in beta with a huge "Don't use this for important stuff" warning.
Considering what's going on between NetApp and Sun currently, I wonder what they'll think of this?
-yb
Imagine he amount of stuff you could (unreliably) store on a hard disk if massive de-duplication was built into the drive electronics. It could even do this quietly in the background.
I say unreliably, because years ago we had a Novell server that used an automated compression scheme. Eventually, the drive got full anyway, and we had to migrate to a larger disk.
But since the copy operation de-compressed files on the fly we couldn't copy because any attempt to reference several large compressed files instantly consumed all remaining space on the drive. What ensued was a nightmare of copy and delete files beginning with the smallest, and working our way up to the largest. It took over a day of manual effort before we freed up enough space to mass-move the remaining files.
De-duplication is pretty much the same thing, compression by recording and eliminating duplicates. But any minor automated update of some files runs the risk of changing them such that what was a duplicate, must now be stored separately.
This could trigger a similar situation where there was suddenly not enough room to store the same amount of data that was already on the device. (For some values of "suddenly" and "already").
For archival stuff or OS components (executables, and source code etc) which virtually never change this would be great.
But there is a hell to pay somewhere down the road.
Sig Battery depleted. Reverting to safe mode.
... strategically populate the available space with duplicates of commonly read blocks, for increased fault tolerance and performance?
Where did I hear that one?
IANAL but write like a drunk one.
The amount of resources it reportedly takes makes this not so practical.
What do one would want to have deduplication for? The cost of disk storage has two big elements - speed (latency&throughput) and backup.
It does not seem that this technology would help much in the speed department, it might actually hurt. Managing copy on write has several potential costs. It may help backup if the backup program knows the fine details of deduplication, but that means that old backup software will have to be replaced.
It reminds me the compressed file system I used to have on my old SLS Linux PC which had a small disk (1992 if memory serves me right). It was dog slow to run X11 on it. I have not seen a compressed file system since, there was no need. Disk storage grows much faster than my need for data.
http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.aspx
No need to mod me up.
NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
Use open source, get cutting edge things.
Like a cutting edge CAD packages, games, financial management and office suites? Good thing we had you to tell us that open source will solve our every problem just by virtue of it being open source. I'm sure every print shop is going to dump Photoshop for GIMP, every finance firm will dump Excel for Openoffice Calc and every engineering firm will dump AutoCAD for... what exactly?
Maybe, just maybe open source isn't the answer for everything after all...
I Heard ISPs Were Doing This With Broadband.
Simply duplicate your advertised pipe across 100 subscribers.
If they want to access it at the same time, just shift stuff around.
If they want to access it at the same time, and you don't have room to shift stuff around, just impose caps and bill them progressively out the ass.
There are enough tales of woe in the discussion groups of ZFS file systems that have melted down on people that I would not start shorting the midrange storage companies stock just yet. I myself have an 18TB ZFS filesystem on a X4540 and it was brought to a standstill a few weeks ago by one dead SATA disk. Didn't lose any data, and it might be buggy hardware and drivers, but still, Sun support had no explanation. That should not happen!
I'm still a ZFS fanboy though - for about $1 per GB how can you lose. The host is a backup / virtual tape library server so it's not super high availability, and it's hella fast. No problem stuffing data into it at 2 X 1000baseT wire speed.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
You are loosing reliability. Some hashes will collide on some computer somewhere.
The idea is that if you assume that blocks on HD are random then odds of hitting Hash collision are tiny.
But data is not random - humans and programs make it non-random!
Here is an example:
What are the odds that 256 people going across the street will all be men?
That would be 2 ^ -256 - that will never happen.
But guess what? Image that you see a parade and 300 marines are marching by...
It just happened.
Do you want to bet your server data on that?
Don't mistake in-filesystem deduplication and snapshots for a backup system. It's most certainly not backup and if you treat it as such you will eventually be very sorry. A SAN with ZFS, snapshots, and deduplication features is at best an archive, which is distinct in form and purpose from a backup. Still very useful, though. Ideally you have both archive and backup systems. To get a feel for the difference, consider that an archive is for when a user says, "I overwrote a file last week sometime. Can you recover the version before I made this change or saved over this file?" Whereas a backup is for recovering an entire system from when there's a catastrophic failure (like a SAN dying). Very distinct things. Both are useful.
I get strange looks when I tell people that a Time Capsule is not a backup. Nor is a single Time Machine external disk. Now 2, 3 or even 4 external disks could constitute a backup (and as a bonus with Time Machine an archive also).
First, why would you want it built into a hard drive? Your deduplication ratio would then be limited to what you can store on one drive. The drive would have no way to reference blocks on other drives in the same system. Doing it in software allows you reference (in this case) all data within the entire zpool. That could be petabytes of storage (theoretically it could be far more, but that's probably the realistic limits today due to hardware/performance constraints).
As for your "hell to pay later" that's not true for two reason. First, there is no "modify in place". All data is allocated from new blocks, that's how a copy-on-write filesystem works. If it's "updated" you'd be allocating new blocks. If you're concerned with filling a pool up completely, you can put quota's in place to prevent it.
Second, if you "run out of space", you just add new drives to the raid group and continue on your merry way. You can grow a zpool on the fly.
Just store one 0 and one 1. Then just store references to each from in the bits.
Winkey shortcut mapping for 64bit windows. WinKeyPlus
Any filesystem implementing copy-on-write at all, data dedupe, and/or compression is already a strategy where the risk of exhausting oversubscribed storage due to unanticipated compression ratios or uniqueness is a risk. It's a reason why you have to be pretty explicit to NetApp filers implementing these features that you are accepting the risk of exhausting allocations if you actually make use of these features to the point of advertising more storage capacity than you actually have.
You don't even need a fancy filesystem to expose yourself to this today:
$ dd if=/dev/zero of=bigfile bs=1M seek=8191 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00426769 s, 246 MB/s
jbjohnso@wirbelwind:~$ ls -lh bigfile
8.0G 2009-11-02 20:06 bigfile
~$ du -sh bigfile
1.0M bigfile
This possibility has been around a long file and the world hasn't melted. Essentially, if someone is using these features, they should be well aware of the risks incurred.
XML is like violence. If it doesn't solve the problem, use more.
I tried this on my RAID-1 system and it got converted to RAID-0.
Pssht! Not good enough, Sun. I require bit-level. I won't be satisfied until I can create a zpool wherein all my data are deduped down to one 1 and one 0.
And the modification of a duplicate block will generate not only a copy-on-write fault but also a law suit by whoever owns the COW patent.
I'd argue that file systems should know about and support three types of files:
That's a useful way to look at files. Almost all files are "unit" files; they're written once and are never changed; they're only replaced. A relatively small number of programs and libraries use "managed" files, and they're mostly databases of one kind or another. Those are the programs that have to manage files very carefully, and those programs are usually written to be aware of concurrency and caching issues.
Unix and Linux have the right modes defined. File systems just need to use them properly.
newegg has these on special for 169.00, but the reviews stink. Wait for the SATA version.
> You can tell ZFS to do full byte comparisons rather than relying on the hash if you want full security against hash duplicates:
I once did similar a project with web content caching that replaced some data with a hash of said data with a way to get to the actual data. All sorts of people were worried about hash conflicts, etc. People are always worried about collisions.
It took a lot of convincing that that risk is lower than a nuclear strike on the data center(s).
What finally did convince my team mates was that 2^256 (~10^77) is by some estimates is close to the number of elementary particles in the visible universe (without a few orders of magnitudes at least).
So assuming the hash function is good (there's no evidence to prove otherwise), we'd have to try almost as many inputs as there are particles in the universe. The chances of hitting duplicates are so astronomically small that doing byte comparisons is most certainly useless, and just check mark feature for those types who worry about these things. AFAIK there are no known SHA256 duplicates.
At first, BTRFS started out as an also-ran, trying to duplicate a bunch of ZFS features for Linux (where licensing wasn't compatible to incorporate ZFS into Linux). But then BTRFS took a number of things that were overly rigid about ZFS (shrinking volumes, block sizes, and some other stuff), and made it better, including totally unifying how data and metadata are stored. I'm sure there are a number of ways in which ZFS is still better (RAIDZ), but putting aside some of the enterprise features that most of us don't need, BTRFS is turning out to be more flexible, more expandable, more efficient, and better supported.
So ... any plans on using ZFS on slashdot to help de-duplicate stories?
It's a filesystem. It stores files. efficiently. Uhhh.... that's cool and all
Quit jizzing. Realize the practical benefit to society, meditate on it, and then go back to that righteous ftp client you were writing.
If a hash were a replacement for data. that's all we'd need....goedelize the universe? Sometimes I just want to scream, or weep, or shoot everybody....or just drop to my knees and beg them to think - just a little tiny insignificant bit - think. Maybe it'll add up. Probably not, but it's the best I can do.
I bet EMC is happy they just out-bid NetApp to the tune of $2.4 billion, for basically the same technology that Jeff Bonwick is giving away for free.
ZFS is so sexy, I want a nekkid picture of it.
De dupe has been around for a while and has some advantages and quite a few negatives... First off, i'd be interested to see how many patent trolls this might stumble over. But de-dup has always gone hand in hand with backups and golden images. EMC, HDS and co never did a good job supporting golden images, but other storage have done well with it (3par, compellent, equalogic).
For the uninitiated, golden images usually consist of building a machine on a SAN, and then using that one image to power many machines (i.e. the same blocks on disk). It then usually just stores deltas from the golden image for each machine... its got its advantages and disadvantages much like de-dup.
Now, the reasons for its use are simple "pay less for storage" which sounds dumb in this day and age (with 1tb drives costing virtually nothing), but the reality is in the SAN world 1tb drives cost a fortune and wherever you use de-dup or golden images, you usually use the fastest (and smallest) disk you can get your hands on. (if you dont understand why this is, see the backblaze article from a little while ago - ultimately, putting more space in a bit of SAN storage kit is freeking expensive). In the enterprise world, its almost impossible to step away from SAN storage (unless your google or backblaze).
The big problem with de-dup (and why its primarily used for backups, and primarily only disk-based backups) is how it effects the storage. If you suddenly have one hot spot, even on fast disk, the storage starts grinding to a halt (even when considering caching) because lots of things start accessing the same blocks on the disk. This is not a problem for backups because its usually a once-written, rarely-read scenario. On file servers and databases, its a performance killer (something akin to raid5/6 in software). But de-dup is fantastic for archival storage!... De-dup and performance often tend to be a self-fulfilling prophecy though, simply because data that is duplicated is often duplicated cause its heavily accessed. Take email as a good example. Joe sends out an email with an attachment of some form (perhaps its a document template, but it really doesn't matter so long as he's sending it to a large number of people), all those people save the attachment and probably make some edits. This introduces the next load of pain, fragmentation. All those delta's from the original now need to be saved "somewhere else" and meanwhile all these people are accessing not just the de-dup'ed blocks but the fragmented changes (consider the kernel source for linux as well, tonnes of branches of code that would possible get de-duped and fragmented). Databases are another great example. Often in tablespaces there is quite a load of block-alligned duplicate data, often this is the nature of how databases store data. Sometimes this data can be quite critical to their function and to have a database slamming the same blocks (again with small fragmented changes) is pain personified.
Still, i wonder how many patents sun are likely to trip up on... I see this being non-fun as there are many people who make serious cash from de-dup at various levels....
What if that single block goes bad?
- Zav - Imagine a Beowulf cluster of insensitive clods...
I recently wrote an article about my thoughts on filesystems and operating systems by way of a fictional reference OS mentioning ZFS in a positive light for reasons including the dedupe feature mentioned in today's article:
IRON/Cloud — the outline of what a modern OS should be
I link back to the (yes, slashdot) article wherein I first learned about ZFS, and a rundown of the features I like about ZFS.
But no, I checked and our article texts do not hash to the same value, so I do not believe we would be stored at the same location on disk. ;D
People willing to trade their freedom of expression for temporary entertainment deserve neither and will lose both.
Downloading a chunk with SHA.DEADBEAF...
oh ... looky here... DEADBEAF is already on disk.
Done.
As someone whose got a HUGE amount of data currently in ZFS (and a lot of it is redudant!) I can't wait to get my hands on this! I figure along on my backup server it's going to save me 10's of TB's worth of space.
I just wish there was more details on what release of Open Solaris or Solaris this is going to be in, or patch sets that'll include this!
Yes Francis, the world has gone crazy.
A lot of times these days I use rsync to do hard linked backups, which works mostly well but has some shortcomings. For example, backups across multiple machines don't have their duplicate files hardlinked, and files that are mostly similar can't be hard linked, such as files that grow like log files. More specifically we have some database files that grow with yearly detail information and everything before the newly added records is identical, resulting in gigs of used up space every day during backups when maybe a few megs has changed.
Initially I liked the way BackupPC handled the situation by pooling and compressing all the files, and duplicate files from different backups were automatically linked together. So I wrote a little script that primarily duplicated the the functionality of hardlinking duplicate files together regardless of file stat, running on top of fusecompress to get the compression too. The problem mostly is time though to crawl thousands and thousands of files and relink them. On top of that, rsync will not use those duplicate files for hardlinks in the next backup if the file stat info doesn't match, like mtime/owner/etc which means the next backup contains fresh new copies of files that have to be re-hardlinked by crawling the files again. Plus you don't get elimination of partial file redundancy.
So I looked around some more for a system that would allow you to compress out redundant blocks, and the closest thing I could find is squashfs, but it's read-only. Which sucks because we need to purge daily local backups occasionally to make more room for newer backups. We keep the last 6 month of daily backups available on a server, and do daily offsite backups from that. So once a month we delete the oldest months backups from the local backup server, and using squashfs you'd have to recreate the whole squash archive, which would suck for a terabyte archive with millions of files in it.
At this point I knew what features I wanted but couldn't find anything that did it yet, so I went ahead and wrote a fuse daemon in python that handles block-level deduplication and compression at the same time. I'm still playing around with it and testing different storage ideas, it's available in git if anyone wants to take a look, you can get it by doing:
git clone http://git.hoopajoo.net/projects/fusearchive.git fusearchive
(note the above command might be mangled because of the auto-linking in slashdot, there should be no [hoopajoo.net] in the actual clone command)
Currently it uses a storage directory with 2 sub directories, store/ and tree/. Inside tree/ are files that contain a hash that identifies the block list for the file contents. This way 2 identical files will only consume the size of a hash on disk + inodes. The hash points the the block that contains the file data block list, which is also a list of hashes of the data. This way any files that have identical blocks (on a block boundry) will have the redundant blocks only take up the size of the hash. Blocks are currently 5M, which can be tuned, and the blocks are compressed using zlib. So a bunch of small files get the benefit of compression and entire-file deduplication while large growing files will at most use up an extra block or data + the hash info for the rest of the file. So far this seems to be working pretty well, the biggest issues I have is tracking block references so we can free the block when it's no longer referenced by any files. It works fine currently but since each block contains it's own reference counter a crash could make the ref counts incorrect, and unfortunately I can't think of a better, more atomic way to handle that. The other big drawback is speed, it's about 1/3 the speed of native file copying, and from profiling the code 80-90% of the time seems to be spent passing fuse messages in the main fuse-python library, with a little time being taken up by zlib and actual file writes.
If I could get s
Free Online Woodworking Resources Directory
One of the things I like about my Mac is the lack of cutting edges.
Yep.. Both Macs and duplo blocks are safe like that, and aimed at the same demographic.
Reiser has a method for eliminating unwanted bits, but there
is a bug that chroots you inside a jail.
>>> The probability of a hash collision for a 256 bit hash (or even a 128
>>> bit one) is negligible.
Which means idiots will assume that it never happens. In other
news, real estate prices never go down and o-rings on space
shuttles never leak.
>> I run Linux, where's my ZFS?
Upgrade to FreeBSD.
> Log files Log files can only be appended to.
See OpenBSD.
> Managed files Managed files are random-access files managed by a
> database or archive program.
Such a limited view. I have lots of random access files I
maintain with emacs.
It's unlikely you'll have a collision considering it's a 256-bit hash
Probability and actuality are 2 different things. Just because the probability is low doesn't mean it won't happen with the first 2 blocks encountered. I don't see how this (using a hash) can work given that the results are not guaranteed.