ZFS Gets Built-In Deduplication
elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."
Duplicate slashdot articles will be links back to the original one?
Surely with high amounts of data (that zfs is supposed to be able to handle), a hash collision may occur? I'm sure a block is > 256bits. Do they just expect this never to happen?
Although I suppose they could just be using it as a way to narrow down candidates for deduplication... doing a final bit for bit check before deciding the data is the same.
"Infecting minds with my own memetic virus, one post at a time." Ultimape
Are there any other filesystems with that feature? If not, I'm very strongly considering writing my own.
I'm wondering how long its going to take for them to do something with ZFS that actually makes me slow down my overwhelming ZFS fanboyism.
I just love these guys.
My virtual machine NFS server is going to have to get this as soon as FBSD imports it, and I'll no longer have to worry about having backup software (like BackupPC, good stuff btw) that does this.
I don't use high end SANs but it would seem to me that they are rapidly losing any particular advantage to a Solaris or FBSD file server.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Use open source, get cutting edge things.
# cat
Damn, my RAM is full of llamas.
ZFS, from what I can tell, kicks ass. I've played around with it in virtual machines, taking drives off line, recreating them, adding drives, etc.
When I search NewEgg I also search OpenSolaris' compatibility list.
The two areas that Linux is playing catchup is Filesystems (like this) and Sound (OSS, Pulse, Alsa Oh My!). And before you go pointing out the btrfs project, this has been in servers for years. It's tried in an enterprise environment. Your file system is still in beta with a huge "Don't use this for important stuff" warning.
Imagine he amount of stuff you could (unreliably) store on a hard disk if massive de-duplication was built into the drive electronics. It could even do this quietly in the background.
I say unreliably, because years ago we had a Novell server that used an automated compression scheme. Eventually, the drive got full anyway, and we had to migrate to a larger disk.
But since the copy operation de-compressed files on the fly we couldn't copy because any attempt to reference several large compressed files instantly consumed all remaining space on the drive. What ensued was a nightmare of copy and delete files beginning with the smallest, and working our way up to the largest. It took over a day of manual effort before we freed up enough space to mass-move the remaining files.
De-duplication is pretty much the same thing, compression by recording and eliminating duplicates. But any minor automated update of some files runs the risk of changing them such that what was a duplicate, must now be stored separately.
This could trigger a similar situation where there was suddenly not enough room to store the same amount of data that was already on the device. (For some values of "suddenly" and "already").
For archival stuff or OS components (executables, and source code etc) which virtually never change this would be great.
But there is a hell to pay somewhere down the road.
Sig Battery depleted. Reverting to safe mode.
Use open source, get cutting edge things.
The last time I tried to build an Intel box for Linux work, I lost my grip on the cheap generic case, and sustained a cut that sent me to the emergency room. One of the things I like about my Mac is the lack of cutting edges.
Shoulda gone with a blade server, then you wouldn't have had to worry about the emergency room.
Not sure when you tried building it, but I build cheap computers for friends / family, at least 2 or 3 computers a year. Almost a decade ago... maybe really only 8 years ago, all cheapo generic cases stopped having razor sharp edges. I used to get cuts all the time, but cheap cases, at least in the realm of having sharp edges, haven't been an issue in a long time. (I purchase all my cheapo cases from newegg these days)
If you're running a normal desktop or laptop, this isn't likely to be of great use in any case. There's non-negligible overhead in doing the deduplication process, and drive space at consumer-level sizes is dirt-cheap, so it's only really worth doing this you have a lot of block-level duplicate data. That might be the case if e.g. you have 30 VMs on the same machine each with a separate install of the same OS, but is unlikely to be the case on a normal Mac laptop.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Any filesystem implementing copy-on-write at all, data dedupe, and/or compression is already a strategy where the risk of exhausting oversubscribed storage due to unanticipated compression ratios or uniqueness is a risk. It's a reason why you have to be pretty explicit to NetApp filers implementing these features that you are accepting the risk of exhausting allocations if you actually make use of these features to the point of advertising more storage capacity than you actually have.
You don't even need a fancy filesystem to expose yourself to this today:
$ dd if=/dev/zero of=bigfile bs=1M seek=8191 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00426769 s, 246 MB/s
jbjohnso@wirbelwind:~$ ls -lh bigfile
8.0G 2009-11-02 20:06 bigfile
~$ du -sh bigfile
1.0M bigfile
This possibility has been around a long file and the world hasn't melted. Essentially, if someone is using these features, they should be well aware of the risks incurred.
XML is like violence. If it doesn't solve the problem, use more.
Umm, dia, nethack, perl, emacs?
I'd argue that file systems should know about and support three types of files:
That's a useful way to look at files. Almost all files are "unit" files; they're written once and are never changed; they're only replaced. A relatively small number of programs and libraries use "managed" files, and they're mostly databases of one kind or another. Those are the programs that have to manage files very carefully, and those programs are usually written to be aware of concurrency and caching issues.
Unix and Linux have the right modes defined. File systems just need to use them properly.
But if it breaks, or doesn't work, or you've hit a deadline on a project and can't deliver because Wine or the application broke, who are you going to call for support exactly? Not the people who made the software. Are you going to email the Wine mailing list and then, when they fail to deliver a timely solution for free, tell the client that open source is to blame?
At least when I buy software, or make purchasing decisions from a business standpoint, knowing that the company will stand behind the product and our implementation of it is more important than that trying to pursue some ideal about information and it's anthropomorphized desire to be free.
At first, BTRFS started out as an also-ran, trying to duplicate a bunch of ZFS features for Linux (where licensing wasn't compatible to incorporate ZFS into Linux). But then BTRFS took a number of things that were overly rigid about ZFS (shrinking volumes, block sizes, and some other stuff), and made it better, including totally unifying how data and metadata are stored. I'm sure there are a number of ways in which ZFS is still better (RAIDZ), but putting aside some of the enterprise features that most of us don't need, BTRFS is turning out to be more flexible, more expandable, more efficient, and better supported.
Use open source, get cutting edge things.
I run Linux, where's my ZFS? No, FUSE doesn't count.
Time Machine has by far the easiest to use interface of any backup solution that is at least as powerful. And because it does file-level deduplication using hardlinks, the backups themselves are standard directory trees, browsable in every way that the rest of the filesystem is. If the block-level deduplication is part of the backup software and not the filesystem, the the archives will be opaque files that usually can only be manipulated by the backup software itself. This means the user has little or no choice between UIs for restoring from the archive, and it usually prevents the archives from being indexed by something like Spotlight.
By adding one small filesystem feature (hardlinks to directories), Apple made it possible to trivially implement a good incremental backup system. (The under-the-hood parts of Time Machine could be implemented in a fairly short shell script run as a cron job.) They then proceeded to put the slickest UI ever around their backup system, but still left it open for other programs. If Apple added block-level deduplication to their filesystem, they wouldn't even have to touch the Time Machine code and it would become the best personal backup software in history.
My company used a X4500 and we discovered the bug that caused Sun to make the X4540 - the Marvell SATA chipset in the X4500 had a serious bug in firmware that was exacerbated by the Solaris X86 Marvell chipset driver.
Under heavy small block random IO intermingled with heavy sequential large block IO, the box would kernel panic and hang - only a power cycle would reset the box.
Sun ended up refunding us the cost of the servers and providing us exceptionally large incentives to purchase Sun StorageTek storage.
It wouldn't surprise me if the X4540 would have similar issues because they were rushing to replace the X4500 to try and minimize the possibility about bad PR over the X4500 being amazingly unstable.
This is why I'll be waiting for FreeBSD to support this because they will probably have better SATA chipset drivers and the chances of the system hanging because the Solaris kernel drivers for the SATA chipset (nevermind that it's a SATA chipset that Sun put into their own board).
If a hash were a replacement for data. that's all we'd need....goedelize the universe? Sometimes I just want to scream, or weep, or shoot everybody....or just drop to my knees and beg them to think - just a little tiny insignificant bit - think. Maybe it'll add up. Probably not, but it's the best I can do.
De dupe has been around for a while and has some advantages and quite a few negatives... First off, i'd be interested to see how many patent trolls this might stumble over. But de-dup has always gone hand in hand with backups and golden images. EMC, HDS and co never did a good job supporting golden images, but other storage have done well with it (3par, compellent, equalogic).
For the uninitiated, golden images usually consist of building a machine on a SAN, and then using that one image to power many machines (i.e. the same blocks on disk). It then usually just stores deltas from the golden image for each machine... its got its advantages and disadvantages much like de-dup.
Now, the reasons for its use are simple "pay less for storage" which sounds dumb in this day and age (with 1tb drives costing virtually nothing), but the reality is in the SAN world 1tb drives cost a fortune and wherever you use de-dup or golden images, you usually use the fastest (and smallest) disk you can get your hands on. (if you dont understand why this is, see the backblaze article from a little while ago - ultimately, putting more space in a bit of SAN storage kit is freeking expensive). In the enterprise world, its almost impossible to step away from SAN storage (unless your google or backblaze).
The big problem with de-dup (and why its primarily used for backups, and primarily only disk-based backups) is how it effects the storage. If you suddenly have one hot spot, even on fast disk, the storage starts grinding to a halt (even when considering caching) because lots of things start accessing the same blocks on the disk. This is not a problem for backups because its usually a once-written, rarely-read scenario. On file servers and databases, its a performance killer (something akin to raid5/6 in software). But de-dup is fantastic for archival storage!... De-dup and performance often tend to be a self-fulfilling prophecy though, simply because data that is duplicated is often duplicated cause its heavily accessed. Take email as a good example. Joe sends out an email with an attachment of some form (perhaps its a document template, but it really doesn't matter so long as he's sending it to a large number of people), all those people save the attachment and probably make some edits. This introduces the next load of pain, fragmentation. All those delta's from the original now need to be saved "somewhere else" and meanwhile all these people are accessing not just the de-dup'ed blocks but the fragmented changes (consider the kernel source for linux as well, tonnes of branches of code that would possible get de-duped and fragmented). Databases are another great example. Often in tablespaces there is quite a load of block-alligned duplicate data, often this is the nature of how databases store data. Sometimes this data can be quite critical to their function and to have a database slamming the same blocks (again with small fragmented changes) is pain personified.
Still, i wonder how many patents sun are likely to trip up on... I see this being non-fun as there are many people who make serious cash from de-dup at various levels....
The same people you call when your proprietary system breaks and you discover that the official tech support people can't find their posterior with both hands and a map. Most cities have a number of grief councilors ready to support you in your time of need. If it was really critical, try the suicide hotline.