New 25x Data Compression?
modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.
1. There can be no algorithm that can compress every stream by a constant factor, let alone by 25. Whoever says otherwise is mistaken or lying.
2. Achievable compression depends on the nature of the input material. Big files (music, movies) these days are already compressed by their respective codecs, so they compress really badly.
3. While there are algorithms that, on average, compress better than others, usually this is paid for by running slower, often much, much slower.
Mmmmmmh, salt.
So they're not exactly lying about the compression ratios, they're just redefining the term to describe compression not of data-sets but of data-sets-over-time.
http://alternatives.rzero.com/
You hit the nail right on the head. No compression can ever make a statement that it can compress anything by ANY set value, unless the value your talking about is zero :) This would imply that you could compress the output of a compression process and compress it 25 times more. Then take that output and comress it 25 times more. Then take that output... See where I'm going? You could say that MOST files of DATATYPE_X will compress UP TO 25x, but there will always be the exception to the rule. There is no such thing as a free lunch. You can't have infinite compression... but it'd sure be a lot cooler if ya did :)
That's called the law of large numbers.
Systems like this bank on the fact that most enterprise backup systems (that is... Veritas) can't tell when a file is changed slightly between backups. They use a coarser-grained whole-file approach (which is very reliable though, and already only stores one copy of each file). But people who know about the magic of rsync understand the speedups that can be obtained by leveraging rolling hashtables and other tricks to get binary deltas of large files, and only transmitting those changes.
Given a large enough set of backups and enough time, the potential size savings is enormous.
Veritas should really be implementing this themselves, though.
And I have a feeling this is what's behind the 25x claims of the article. The key is the mention "enterprise"... large data sets... lots of potential redundancy to exploit.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON