Data Deduplication Comparative Review
snydeq writes "InfoWorld's Keith Schultz provides an in-depth comparative review of four data deduplication appliances to vet how well the technology stacks up against the rising glut of information in today's datacenters. 'Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year,' Schultz writes. 'If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.' Under review are dedupe boxes from FalconStor, NetApp, and SpectraLogic."
I think it depends on which scheme you're talking about.
Basic de-duplication techniques might focus only on blocks being identical. That would work for eliminating actual duplicated files, but would be nearly useless for eliminating portions of files unless those files happen to be block-structured themselves (e.g. two disk images that contain mostly the same files at mostly the same offsets).
De-duplicating the boilerplate content in two Word documents, however, requires not only discovering that the content is the same, but also dealing with the fact that the content in question likely spans multiple blocks, and more to the point, dealing with the fact that the content will almost always span those blocks differently in different files. Thus, I would expect the better de-duplication schemes to treat files as glorified streams, and to de-duplicate stream fragments rather than operating at the block level. Block level de-duplication is at best a good start.
What de-duplication should ideally not be concerned with (and I think this is what you are asking about) are the actual names of the files or where they came from. That information is a good starting point for rapidly de-duplicating the low hanging fruit (identical files, multiple versions of a single file, etc.), but that doesn't mean that the de-duplication software should necessarily limit itself to files with the same name or whatever.
Does that answer the question?
Check out my sci-fi/humor trilogy at PatriotsBooks.