New 25x Data Compression?

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Wednesday April 5, 2006 @08:23AM from the make-sure-to-give-it-to-more-than-just-the-corporate-monkies dept.

modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.

32 of 438 comments (clear)

Min score:

Reason:

Sort:

What kind of data? by Short+Circuit · 2006-04-05 08:24 · Score: 4, Insightful

I can create a compression algorithm that compresses my 2GB of data to 1 bit. But it would be crap for any other datastream fed to it.

--
tasks(723) drafts(105) languages(484) examples(29106)
1. Re:What kind of data? by ivan256 · 2006-04-05 08:27 · Score: 5, Insightful
  
  The article says:
  
  it can compress anything: email, databases, archives, mp3's, encrypted data or whatever weird data format your favorite program uses.
  
  In other words, they're full of crap.
2. Re:What kind of data? by slimey_limey · 2006-04-05 08:33 · Score: 4, Insightful
  
  So it can compress its own output? Sweet....
  
  --
  ☠
3. Re:What kind of data? by swimboy · 2006-04-05 08:40 · Score: 5, Funny
  
  It can compress anything! At the demo, I saw them compress 25 oz. of snake oil so that it all fit in a 1 oz. jar!
  
  --
  Ask me how the Heisenberg Principle may or may not have saved my life.
4. Re:What kind of data? by devjoe · 2006-04-05 08:52 · Score: 5, Insightful
  
  Well, there's an idea here that might hold some truth. Note that they are marketing it to data centers, people with LOTS and LOTS of files. Because people tend to have multiple copies of the same files, they can achieve great compression by eliminating the duplicate copies in the archive -- or likewise, any files with large sections that are the same among various files.
  
  20 email accounts subscribed to the same mailing list? Store the bodies of those e-mails only once, and you save a big chunk of disk space. A bunch of people downloaded the same MP3 file? We only need one copy in the archive. As long as there are multiple copies of the same data, it can compress any type of data.
  
  The difference here is that they are taking advantage of the redundancy of files across an entire filesystem (and a HUGE one), rather than the redundancies within an individual file. (I would assume they also do the latter type of compression with a conventional algorithm.) 25x compression seems extreme, but I am sure they can achieve some extra compression here.
5. Re:What kind of data? by tverbeek · 2006-04-05 09:00 · Score: 5, Informative
  
  I just fed Diligent Technology some bogus personal data and downloaded their brochure, and as far as I can tell from a quick gleaning, they achieve these impossible compression ratios across multiple versions of the same data set. So your initial full backup will be compressed at mathematically-possible-in-this-universe ratios, and your subsequent incremental backups - which only store the changes compared to the previous backup - will (with typical data scenarios) be much smaller. It's incremental backups on the byte level, basically.
  So they're not exactly lying about the compression ratios, they're just redefining the term to describe compression not of data-sets but of data-sets-over-time.
  
  --
  http://alternatives.rzero.com/
6. Re:What kind of data? by fyndor · 2006-04-05 09:16 · Score: 5, Informative
  
  You hit the nail right on the head. No compression can ever make a statement that it can compress anything by ANY set value, unless the value your talking about is zero :) This would imply that you could compress the output of a compression process and compress it 25 times more. Then take that output and comress it 25 times more. Then take that output... See where I'm going? You could say that MOST files of DATATYPE_X will compress UP TO 25x, but there will always be the exception to the rule. There is no such thing as a free lunch. You can't have infinite compression... but it'd sure be a lot cooler if ya did :)
7. Re:What kind of data? by networkBoy · 2006-04-05 09:35 · Score: 4, Funny
  
  1.
  I can compress anything you give me by a factor of at least 1 (inclusive of my own output).
  
  "-1 pedantic", I know.
  -nB
  
  --
  whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
8. Re:What kind of data? by TheNetAvenger · 2006-04-05 10:06 · Score: 5, Funny
  
  In other words, they're full of crap.
  
  But the Slashdot Post says that is all runs on Linux. And knowing the infinite power of Linux, I believe them.
  
  In addition to being the best OS in the world, Linux is also the most secure, does everything better than every other OS, and if given the right developers it is the ONLY os that could do something as impressive as compress data past the limits of possiblity.
  
  I'm sure with the right developer, Linux could also be used to harness zero point energy, create wormholes for travel in your basement, and possibly cure most diseases... /wink
*sniff* by bryanp · 2006-04-05 08:25 · Score: 4, Insightful

*sniff* *sniff* *sniff*

I smell ... vapor.

--
"An unarmed man can only flee from evil, and evil is not overcome by fleeing from it." Col. Jeff Cooper
Limited application by Locke2005 · 2006-04-05 08:25 · Score: 4, Funny

Yes, it can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.

--
I've abandoned my search for truth; now I'm just looking for some useful delusions.
1. Re:Limited application by Bull999999 · 2006-04-05 08:27 · Score: 5, Funny
  
  I, too, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.
  
  --
  1f u c4n r34d th1s u r34lly n33d t0 g37 l41d
2. Re:Limited application by sprag · 2006-04-05 09:00 · Score: 4, Funny
  
  I, as well, welcome our 1/25th of original size overlords... but it only works on hot grits articles, which are highly compressable due to the large amount of petrified data.
3. Re:Limited application by tshak · 2006-04-05 09:08 · Score: 4, Funny
  
  I, wanting cheap karma, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.
  
  --
  
  There is no longer anything that can be done with computers that is nontrivial and clearly legal. -- Paul Phillips
4. Re:Limited application by complete+loony · 2006-04-05 12:36 · Score: 4, Funny
  
  I, forgetting that funny doesn't give karma, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.
  
  --
  09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
Heard this before by Jordan+Catalano · 2006-04-05 08:27 · Score: 5, Interesting

Does anyone else remember a "state-of-the-art" fractal compression program that appeared back around 95 or so? It was very impressive at first - you'd compress a four meg file down into a few kilobytes, and it would decompress just fine afterwards... until you deleted the original file. Turns out the program only stored a pointer to the location of the original file on the drive in its output file. I bet more than one person, after thinking they had verified it worked, lost some valuable data.
1. Re:Heard this before by Orgasmatron · 2006-04-05 11:09 · Score: 4, Interesting
  
  Yup, that was OWS. You actually could delete the original file, but once it got overwritten, or if it wasn't available, you couldn't deOWS it any more.
  
  Back in the day, I figured out what was going on when I took a disk to another machine, couldn't restore the file. I then tested the disk in the machine I had made the archive on, and it worked fine. It was a good hoax. We all got a good laugh out of it.
  
  --
  See that "Preview" button?
The proof... by jforest1 · 2006-04-05 08:28 · Score: 5, Funny

It's true! It compressed my 10GB collection of ASCII PR0N into 1 meg!
/dev/zero ? by slimey_limey · 2006-04-05 08:31 · Score: 5, Funny

dd if=/dev/zero bs=1m count=1m | lzop - | gzip -f -| gzip -f - | gzip -f - | wc

gives about three kilobytes for a terabyte of data.

--
☠
Incomplete Article Summary by bigtallmofo · 2006-04-05 08:31 · Score: 5, Funny

The summary should have read...

StorageMojo is reporting that a company named Practical Nano Cold Fusion Duke Nukem Forever at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital...

--
I'm a big tall mofo.
Dubious by pilkul · 2006-04-05 08:32 · Score: 4, Insightful

Stuff like new compression algorithms generally comes out in academic papers, which are then applied in practice by regular programmers. That's what happened with the Burrows-Wheeler algorithm at the core of bzip2. Some company concerned with mostly implementation rather than theory wouldn't come up with a revolutionary advance. The writeup is very vague, but it sounds to me like they're just using a simple LZ type algorithm, and they're only claiming 25x compression if the data is mostly the same already. Well duh.
sounds like a O(n^n^n) problem. by Ancient_Hacker · 2006-04-05 08:33 · Score: 4, Interesting
Couple "issues":
- The cost of disk space versus the cost in computer time in finding all the matching substrings. Disk space gets bigger a whole lot faster and easier than CPUs speed up, so even if this idea is economically feasible today, it can only get worse from here.
- This scheme may work just swell with some data streams, but probably pathologically awful with others. A good example: a billion empty records in a database might be compressed to a very few bytes. The system operator relaxes, and lets a log file fill up the rest of the disk. Then a bunch of database records need to be added, or the existing records need some sequential numbering added and guess what? There's no space for the new records, or to expand the existing ones. Argh.
Re:Breaking news! by nizo · 2006-04-05 08:35 · Score: 4, Funny

Maybe it is lossy compression, which would be really nice when compressing executables and old spreadsheets.

--
I Am My Own Worst Enemy
Sad truths about data compression. by k.a.f. · 2006-04-05 08:38 · Score: 5, Informative

1. There can be no algorithm that can compress every stream by a constant factor, let alone by 25. Whoever says otherwise is mistaken or lying.

2. Achievable compression depends on the nature of the input material. Big files (music, movies) these days are already compressed by their respective codecs, so they compress really badly.

3. While there are algorithms that, on average, compress better than others, usually this is paid for by running slower, often much, much slower.

Mmmmmmh, salt.
I've always imagined this conversation by jfengel · 2006-04-05 08:43 · Score: 5, Funny

Developers: We've got some really good ideas for reducing backup space by using compression and incremental backups.

Marketing: How much in the best conceivable case?

Developers: Oh, I dunno, maybe 25x.

Marketing: 25x? Is that good?

Developers: Yeah, I suppose, but the cool stuff is...

Marketing: Wow! 25x! That's a really big number!

Developers: Actually, please don't quote me on that. They'll make fun of me on Slashdot if you do. Promise me.

Marketing: We promise.

Developers: Thanks. Now, let me show you where the good stuff is...

Marketing (on phone): Larry? It's me. How big can you print me up a poster that says "25x"?
This definitely works by All+Names+Have+Been · 2006-04-05 08:45 · Score: 5, Funny

I can tell you, this technology definitely works. I've seen them compress random data streams to 1/25th (even 1/30th!!) their size. This works *TODAY*. Coming out real soon now is the software that allows you to decompress your data. This is still in development.
Vist the Diligent WebSite and learn.... by sherpajohn · 2006-04-05 08:54 · Score: 4, Informative

....I mean jeez. They are not in the file compression business, they are in the "data protection" business. Specifically disk based backup. They make NO cliam regarding "data compression" - the 25X claim is explicitly in regards to the disk space required to backup data. What they say is that using their solution can lead to a 25x less disk space requirement for backups. It may involve some new compression algorithms, but appears to be more based on never backing up the same data more than once.

--

Going on means going far
Going far means returning
Re:Breaking news! by Austerity+Empowers · 2006-04-05 08:54 · Score: 5, Interesting

His point is that the Shannon limit provides a mathematical upper bound for how good a lossless compression algorithm can be for arbitrary data sets. gzip gets 98% of that maximum bound, so any algorithm that claims to be 12x that is either not lossless, or not generic. Gzip etc. are all based on several related algorithms known generally as "entropy coders" (http://en.wikipedia.org/wiki/Entropy_coding).

Lossy compression and compression of particular data sets do not have to obey this. With lossy compression you can compress down as far as you can tolerate.

Coding particular sets gets some extra compression by coding some of the data in the compress/decompress utility. For example if all your files have a 1MB standard header and 1KB of data, you can omit the 1MB of header because it's always there, and just send the 1KB of data! Truly amazing compression! Of course it only works under those conditions.
TFA by pcosta · 2006-04-05 09:01 · Score: 4, Insightful

If everybody stopped laughing and actually RTFA, they aren't claiming 25x compression on anything. The algorithm is targeted at data backup, i.e. very large files and works by comparing incoming data patterns to patterns already stored. Looks like a modification of LZH that uses the compressed file as the pattern table. I'm not saying that it works or that is a breakthrough, but they are not claiming impossible lossless compression on anything. It might actually be interesting for the application it was designed for.
MOD PARENT DOWN by gEvil+(beta) · 2006-04-05 09:35 · Score: 4, Funny

Mod parent down! Nobody needs to see goatse again...

--
This guy's the limit!
Well that's not surprising. by Ayanami+Rei · 2006-04-05 12:15 · Score: 5, Informative

That's called the law of large numbers.
Systems like this bank on the fact that most enterprise backup systems (that is... Veritas) can't tell when a file is changed slightly between backups. They use a coarser-grained whole-file approach (which is very reliable though, and already only stores one copy of each file). But people who know about the magic of rsync understand the speedups that can be obtained by leveraging rolling hashtables and other tricks to get binary deltas of large files, and only transmitting those changes.
Given a large enough set of backups and enough time, the potential size savings is enormous.

Veritas should really be implementing this themselves, though.

And I have a feeling this is what's behind the 25x claims of the article. The key is the mention "enterprise"... large data sets... lots of potential redundancy to exploit.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Actually, I once tried that. by MickLinux · 2006-04-05 12:28 · Score: 5, Interesting

I once used a Huffman data compression algorithm, recursively, in order to see just how much compression I could get. The first round, I got maybe 75% compression on the data I was using. The second round, I got 10%. The third round, I got 3%. The fourth, I got 1%; and after that, I'd typically actually increase the size of the data slightly. Let's not forget that I am including the size of the initial data table.

So then I tried it with LZW compression, and it still eventually grew in size.

The neat thing about doing this, though, is that it taught me something about the mathematical basis for entropy. You see, I couldn't believe that I was getting the diminishing returns, so I wrote some algorithms to output the histogram curves.

What I saw was that the best Huffman compression came when the Histogram was farthest from what I'll call a "perfect bell curve". I don't know if that is the same curve or not, but it looks a lot like one half of a perfect bell; or maybe like the radiation output of a blackbody in physics.

Anyhow, as I successively compressed the data, the data moved towards a tighter bell curve in general, and always towards that perfect bell, in specific (so long as the data would compress, that is.) I didn't do the calculation, but it would be interesting to calculate what the closest bell curve was, and then do a standard deviation of the histogram from the bell curve, and correlate it to compression.

So then I thought "well, I'll compress only a portion of the data, the part that is compressible". But any typical portion of the data still seemed to follow that pesky bell curve. So then I thought to intercept the data, and see if I could visually spot any patterns.

Indeed, I could. Wow -- look at that string of zeros here; and that repeated series 1001001001001, *four times*, there. Surely I could get compression out of that. Funny thing, though. Every time I tried, I could get compression for that data set, but then lousy compression for anything else. When I tried to generalize the compression to include every possibility, I again couldn't get compression. In other words, truly entropic data does have repetition. It does have some item that shows up more commonly than others. It does have patterns. But the patterns are no more than what you would expect, (or actually, if you want to be correct but confusing, only an expectable percentage of the patterns are more than what you would expect, by any given amount.) And when you include all the patterns of length n, including patterns of length n=1, then there just isn't any more entropy possible for the data.

And just as it takes an increase in entropy to drive a heat engine (2nd law of thermo), it also takes an increase in data entropy to get compression.

--
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's