New 25x Data Compression?

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Wednesday April 5, 2006 @08:23AM from the make-sure-to-give-it-to-more-than-just-the-corporate-monkies dept.

modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.

11 of 438 comments (clear)

Min score:

Reason:

Sort:

Sad truths about data compression. by k.a.f. · 2006-04-05 08:38 · Score: 5, Informative

1. There can be no algorithm that can compress every stream by a constant factor, let alone by 25. Whoever says otherwise is mistaken or lying.

2. Achievable compression depends on the nature of the input material. Big files (music, movies) these days are already compressed by their respective codecs, so they compress really badly.

3. While there are algorithms that, on average, compress better than others, usually this is paid for by running slower, often much, much slower.

Mmmmmmh, salt.
Vist the Diligent WebSite and learn.... by sherpajohn · 2006-04-05 08:54 · Score: 4, Informative

....I mean jeez. They are not in the file compression business, they are in the "data protection" business. Specifically disk based backup. They make NO cliam regarding "data compression" - the 25X claim is explicitly in regards to the disk space required to backup data. What they say is that using their solution can lead to a 25x less disk space requirement for backups. It may involve some new compression algorithms, but appears to be more based on never backing up the same data more than once.

--

Going on means going far
Going far means returning
Re:100X - 1000X by irritating+environme · 2006-04-05 09:00 · Score: 3, Informative

This is completely false. There are fundamental mathematical limits to the amount you can compress data in a lossless format. In fact, each compression format ususally has overhead on the file to store the mapping data to decode/decompress it. That overhead+the compressed file is usually less than the original file, until you run the compressor once or twice. Then the file doesn't compress at all, and the compression record overhead actually increases the overall file size.

--

Hey, I'm just your average shit and piss factory.
Re:What kind of data? by tverbeek · 2006-04-05 09:00 · Score: 5, Informative

I just fed Diligent Technology some bogus personal data and downloaded their brochure, and as far as I can tell from a quick gleaning, they achieve these impossible compression ratios across multiple versions of the same data set. So your initial full backup will be compressed at mathematically-possible-in-this-universe ratios, and your subsequent incremental backups - which only store the changes compared to the previous backup - will (with typical data scenarios) be much smaller. It's incremental backups on the byte level, basically.
So they're not exactly lying about the compression ratios, they're just redefining the term to describe compression not of data-sets but of data-sets-over-time.

--
http://alternatives.rzero.com/
Re:Heard this before - OWS by CAR912 · 2006-04-05 09:10 · Score: 2, Informative

This seems good, otherwise Google for "ows compression OR compress OR compressor", and according to this, OWS stands for the author's initials.

--
- Move "Sig". For great justice!
Re:What kind of data? by fyndor · 2006-04-05 09:16 · Score: 5, Informative

You hit the nail right on the head. No compression can ever make a statement that it can compress anything by ANY set value, unless the value your talking about is zero :) This would imply that you could compress the output of a compression process and compress it 25 times more. Then take that output and comress it 25 times more. Then take that output... See where I'm going? You could say that MOST files of DATATYPE_X will compress UP TO 25x, but there will always be the exception to the rule. There is no such thing as a free lunch. You can't have infinite compression... but it'd sure be a lot cooler if ya did :)
Might work for typical back-up by porttikivi · 2006-04-05 09:53 · Score: 2, Informative

The article talks about backup. The idea could be, that instead of managing incremental backups you just optimize compression of data that is similar to old data. In that way you can do "full" backups, but actually save only incremental backup worth of data.

See http://en.wikipedia.org/wiki/Venti for similar ideas in a system that easily achives 25x compression for typical archival storage. When a file has been changed only those 512 kbyte blocks that are really new are saved, other blocks are just mapped by their SHA1 hashes to existing blocks. So files with small changes, very similar files and files sharing common parts will all compress very nicely. In a multi-user system the files of different users tend to also have lots of similar parts: same emails, same office documents with perhaps minor changes, same reference material / tools / libraries as personal copies etc.

My guess is TFA refers to a re-invention of this wheel, most likely in an inferior way.

--
Anssi Porttikivi / app@iki.fi
Entirely possible by Coward+Anonymous · 2006-04-05 10:18 · Score: 2, Informative

This is entirely possible and they are not the only ones doing it, for example http://www.datadomain.com/ has been doing it for a while. The big storage vendors do it to some extent as well.
The idea is based on "de-duplication" of data and is only really practical for backups (where most data from backup to backup is identical) or central repositories of data for a large organization that has multiple similar data sets, for example, many installations of Windows that are often similar.
From my experience x25 is a bold claim for general data. I've seen small scale tests that showed x30 compression over backup sets but those implementations had performance issues.
From the description in their white-paper, despite their claims, it appears they are performing some kind of hash by definition (e.g. mapping a space to a smaller space).
Well that's not surprising. by Ayanami+Rei · 2006-04-05 12:15 · Score: 5, Informative

That's called the law of large numbers.
Systems like this bank on the fact that most enterprise backup systems (that is... Veritas) can't tell when a file is changed slightly between backups. They use a coarser-grained whole-file approach (which is very reliable though, and already only stores one copy of each file). But people who know about the magic of rsync understand the speedups that can be obtained by leveraging rolling hashtables and other tricks to get binary deltas of large files, and only transmitting those changes.
Given a large enough set of backups and enough time, the potential size savings is enormous.

Veritas should really be implementing this themselves, though.

And I have a feeling this is what's behind the 25x claims of the article. The key is the mention "enterprise"... large data sets... lots of potential redundancy to exploit.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Re:You geek! by Anonymous Coward · 2006-04-05 18:46 · Score: 2, Informative

Last time he was at your mom's house
it's a CVS!! by TheLoneCabbage · 2006-04-05 20:47 · Score: 3, Informative

This is a back up system, not a single file compression (although for framed data like video, email, etc.. the compression scheme is still clever).

Basically it's a CVS, if your backing up multiple computers, or user directories your going to see tons of repeate files, heck they'll even be the same name. Saving the diffs is a good idea. And not at all dificult to duplicate.

For instance what if you were doing back up for a team of animators. Their files are HUGE, but 90% of the frames will be identical between the individual systems. (indeed the frames between one another will likely be very similar) You could get far more than 25x compression that way. The big downside of this idea is the memmory & CPU vs Speed trade off. You can't use this kind of system to back up to a tape or DVD system, it needs to be random access media.

You could probably get nearly the same results by hacking rsync and diffing identical file names in different directories. Possible bonus for diffing files of similar file type.

It's a clever idea, not a radical new technology.

--
I would rather be ashes than dust!