New 25x Data Compression?

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Wednesday April 5, 2006 @08:23AM from the make-sure-to-give-it-to-more-than-just-the-corporate-monkies dept.

modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.

4 of 438 comments (clear)

Min score:

Reason:

Sort:

Sad truths about data compression. by k.a.f. · 2006-04-05 08:38 · Score: 5, Informative

1. There can be no algorithm that can compress every stream by a constant factor, let alone by 25. Whoever says otherwise is mistaken or lying.

2. Achievable compression depends on the nature of the input material. Big files (music, movies) these days are already compressed by their respective codecs, so they compress really badly.

3. While there are algorithms that, on average, compress better than others, usually this is paid for by running slower, often much, much slower.

Mmmmmmh, salt.
Re:What kind of data? by tverbeek · 2006-04-05 09:00 · Score: 5, Informative

I just fed Diligent Technology some bogus personal data and downloaded their brochure, and as far as I can tell from a quick gleaning, they achieve these impossible compression ratios across multiple versions of the same data set. So your initial full backup will be compressed at mathematically-possible-in-this-universe ratios, and your subsequent incremental backups - which only store the changes compared to the previous backup - will (with typical data scenarios) be much smaller. It's incremental backups on the byte level, basically.
So they're not exactly lying about the compression ratios, they're just redefining the term to describe compression not of data-sets but of data-sets-over-time.

--
http://alternatives.rzero.com/
Re:What kind of data? by fyndor · 2006-04-05 09:16 · Score: 5, Informative

You hit the nail right on the head. No compression can ever make a statement that it can compress anything by ANY set value, unless the value your talking about is zero :) This would imply that you could compress the output of a compression process and compress it 25 times more. Then take that output and comress it 25 times more. Then take that output... See where I'm going? You could say that MOST files of DATATYPE_X will compress UP TO 25x, but there will always be the exception to the rule. There is no such thing as a free lunch. You can't have infinite compression... but it'd sure be a lot cooler if ya did :)
Well that's not surprising. by Ayanami+Rei · 2006-04-05 12:15 · Score: 5, Informative

That's called the law of large numbers.
Systems like this bank on the fact that most enterprise backup systems (that is... Veritas) can't tell when a file is changed slightly between backups. They use a coarser-grained whole-file approach (which is very reliable though, and already only stores one copy of each file). But people who know about the magic of rsync understand the speedups that can be obtained by leveraging rolling hashtables and other tricks to get binary deltas of large files, and only transmitting those changes.
Given a large enough set of backups and enough time, the potential size savings is enormous.

Veritas should really be implementing this themselves, though.

And I have a feeling this is what's behind the 25x claims of the article. The key is the mention "enterprise"... large data sets... lots of potential redundancy to exploit.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON