ZFS Gets Built-In Deduplication
elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."
I'd argue that file systems should know about and support three types of files:
That's a useful way to look at files. Almost all files are "unit" files; they're written once and are never changed; they're only replaced. A relatively small number of programs and libraries use "managed" files, and they're mostly databases of one kind or another. Those are the programs that have to manage files very carefully, and those programs are usually written to be aware of concurrency and caching issues.
Unix and Linux have the right modes defined. File systems just need to use them properly.