New 25x Data Compression?

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Wednesday April 5, 2006 @08:23AM from the make-sure-to-give-it-to-more-than-just-the-corporate-monkies dept.

modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.

13 of 438 comments (clear)

Min score:

Reason:

Sort:

Heard this before by Jordan+Catalano · 2006-04-05 08:27 · Score: 5, Interesting

Does anyone else remember a "state-of-the-art" fractal compression program that appeared back around 95 or so? It was very impressive at first - you'd compress a four meg file down into a few kilobytes, and it would decompress just fine afterwards... until you deleted the original file. Turns out the program only stored a pointer to the location of the original file on the drive in its output file. I bet more than one person, after thinking they had verified it worked, lost some valuable data.
1. Re:Heard this before by Orgasmatron · 2006-04-05 11:09 · Score: 4, Interesting
  
  Yup, that was OWS. You actually could delete the original file, but once it got overwritten, or if it wasn't available, you couldn't deOWS it any more.
  
  Back in the day, I figured out what was going on when I took a disk to another machine, couldn't restore the file. I then tested the disk in the machine I had made the archive on, and it worked fine. It was a good hoax. We all got a good laugh out of it.
  
  --
  See that "Preview" button?
right. sure. by Doktor+Memory · 2006-04-05 08:30 · Score: 2, Interesting

Number of companies claiming a breakthrough in compression technology since the release of bzip2: too many to count.

Number of them which were anything other than complete bullshit: 0

I'm not holding my breath.

--
News for Nerds. Stuff that Matters? Like hell.
hmmn by Bizzeh · 2006-04-05 08:30 · Score: 1, Interesting

doesnt colinux have 2kb compessed files that open up to around 10gb? since they are just all null files. also, such a compression where your doing so much is gonna eat into time and cpu usage, and if 1 thing goes wrong in any of it, you loose all that data.

--
portfolio
sounds like a O(n^n^n) problem. by Ancient_Hacker · 2006-04-05 08:33 · Score: 4, Interesting
Couple "issues":
- The cost of disk space versus the cost in computer time in finding all the matching substrings. Disk space gets bigger a whole lot faster and easier than CPUs speed up, so even if this idea is economically feasible today, it can only get worse from here.
- This scheme may work just swell with some data streams, but probably pathologically awful with others. A good example: a billion empty records in a database might be compressed to a very few bytes. The system operator relaxes, and lets a log file fill up the rest of the disk. Then a bunch of database records need to be added, or the existing records need some sequential numbering added and guess what? There's no space for the new records, or to expand the existing ones. Argh.
Calgary / Canterbury corpus? by Spy+der+Mann · 2006-04-05 08:36 · Score: 3, Interesting

If they can't compress the canterbury corpus or calgary corpus beyond 3X, then it's a SCAM.
Re:Breaking news! by Austerity+Empowers · 2006-04-05 08:54 · Score: 5, Interesting

His point is that the Shannon limit provides a mathematical upper bound for how good a lossless compression algorithm can be for arbitrary data sets. gzip gets 98% of that maximum bound, so any algorithm that claims to be 12x that is either not lossless, or not generic. Gzip etc. are all based on several related algorithms known generally as "entropy coders" (http://en.wikipedia.org/wiki/Entropy_coding).

Lossy compression and compression of particular data sets do not have to obey this. With lossy compression you can compress down as far as you can tolerate.

Coding particular sets gets some extra compression by coding some of the data in the compress/decompress utility. For example if all your files have a 1MB standard header and 1KB of data, you can omit the 1MB of header because it's always there, and just send the 1KB of data! Truly amazing compression! Of course it only works under those conditions.
Re: proposed compression method by Anonymous Coward · 2006-04-05 09:47 · Score: 0, Interesting
I propose the following algorithm:

Compression:
1. Search pi until you find your data. (*)
2. Record the length and offset of the match. (**)
3. Prophet.
Decompression:
1. Use the base-16 digit extraction method to recover the stored data.
2. Profit.
Obvious variants:
- Split the data into chunks and encode a list of offsets (i.e. trade off encoding time for compression ratio).
- Search the compression stream (PI) and the data for compression ratios of a preset threshold (e.g. >=1000x); optionally attempting to span the gaps created by previously compressed data (e.g. if you compressd "cdef" in "abcdefghi", then you might try to match subsets of "abghi" next) -- it just requires a slightly more complex encoding format.
* = You'll need step 3 to find step 1 in a reasonable amount of time.
** = It would be fscking hilarious if someone were to prove that you can't always find a given chunk of data within {original-size} bytes of pi, so the offset might be bigger than the original data, and this algorithm wouldn't even be guaranteed to actually compress your data.

p.s. If I catch anyone trying to patent this, I'll refer the patent office to this post as prior art and also reveal the value that yields this hash: cb775b9b061b03e8666819ede2181d2e. Anyone that cracks this will get a chuckle. ;-)
Re:Breaking news! by evilviper · 2006-04-05 11:02 · Score: 2, Interesting

If gzip gets 98% of what's possible, then what the hell are bzip2 and 7zip doing?

Despite the obvious answer (he's simply wrong), 7zip is somewhat "cheating" in this 3-way comparison, as it uses a much, much, much larger block-size (memory). You can set it to use hundreds of MBs of RAM, whereas gzip and bzip2 are both limited to 9KB max.
.

Off-topic Rant:
I was actually quite impressed with 7zip and it's lzma/ppmd compression methods when I first saw it compressing better than bzip2. However, once the novelty wore off, I began to realize it just takes far too-much memory. There is no possible chance of using them on an embedded system, a handheld computer, or even just a fairly old PC with less-than around 64MBs of RAM (or much higher, depending on requested block-size). It also takes a serious ammount of extra time over gzip/bzip2, while being only a trivial compression improvement in the large majority of cases. The exceptional cases are... neat... but they don't make LZMA/PPMd practical for normal use.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:What kind of data? by smerkel · 2006-04-05 11:10 · Score: 2, Interesting

About a year ago, we went through an evaluation of different data protection technologies to replace a tape based platform we were using at the time. I wanted to get away from tape if I could - I simply had a problem buying 5x more media then data I was protecting.
I came across a company called Avamar (http://www.avamar.com/). They do something similar, except they deduplicate the data on the client side, before it ever traverses the network. Needless to say, I was a bit skeptical with their claims. I was able to con them into letting me eval the platform for 3 months. As it turns out - it works as advertised.
I was able to consolidate 2 large tape libraries (L700's) into 8 x Dell 2850's - all running Redhat Enterprise, all with 6 x 300GB SCSI drives in them. We are currently protecting 7TB of data (Note: The hardware is currently 58% utilized). We process 20,000 backup jobs a month. And as an example, pulled directly from last night's activity log, we performed a full backup of a Windows box with 100GB of data on it in less than 15 minutes. (Note: The science behind this is very practical.
Actually, I once tried that. by MickLinux · 2006-04-05 12:28 · Score: 5, Interesting

I once used a Huffman data compression algorithm, recursively, in order to see just how much compression I could get. The first round, I got maybe 75% compression on the data I was using. The second round, I got 10%. The third round, I got 3%. The fourth, I got 1%; and after that, I'd typically actually increase the size of the data slightly. Let's not forget that I am including the size of the initial data table.

So then I tried it with LZW compression, and it still eventually grew in size.

The neat thing about doing this, though, is that it taught me something about the mathematical basis for entropy. You see, I couldn't believe that I was getting the diminishing returns, so I wrote some algorithms to output the histogram curves.

What I saw was that the best Huffman compression came when the Histogram was farthest from what I'll call a "perfect bell curve". I don't know if that is the same curve or not, but it looks a lot like one half of a perfect bell; or maybe like the radiation output of a blackbody in physics.

Anyhow, as I successively compressed the data, the data moved towards a tighter bell curve in general, and always towards that perfect bell, in specific (so long as the data would compress, that is.) I didn't do the calculation, but it would be interesting to calculate what the closest bell curve was, and then do a standard deviation of the histogram from the bell curve, and correlate it to compression.

So then I thought "well, I'll compress only a portion of the data, the part that is compressible". But any typical portion of the data still seemed to follow that pesky bell curve. So then I thought to intercept the data, and see if I could visually spot any patterns.

Indeed, I could. Wow -- look at that string of zeros here; and that repeated series 1001001001001, *four times*, there. Surely I could get compression out of that. Funny thing, though. Every time I tried, I could get compression for that data set, but then lousy compression for anything else. When I tried to generalize the compression to include every possibility, I again couldn't get compression. In other words, truly entropic data does have repetition. It does have some item that shows up more commonly than others. It does have patterns. But the patterns are no more than what you would expect, (or actually, if you want to be correct but confusing, only an expectable percentage of the patterns are more than what you would expect, by any given amount.) And when you include all the patterns of length n, including patterns of length n=1, then there just isn't any more entropy possible for the data.

And just as it takes an increase in entropy to drive a heat engine (2nd law of thermo), it also takes an increase in data entropy to get compression.

--
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
Linux has something related.. RZIP by Convergence · 2006-04-05 14:10 · Score: 2, Interesting

Most compression programs uses a very limited context. gzip cannot identify and exploit redundancy if it occurs more than 32kb or 64kb apart. bzip2 uses a blocksize of 900kb, and it too cannot identify redundancy more than 900kb apart. rzip however uses a context of 900MB, so it can exploit redundancy within a file, even if it occurs hundreds of megabytes apart.
Although its not for every file, some times, this can be a huge win. In my case, backing up 60 versions of a 700kb XML file, I get 500:1 compression, 30 times better than what bzip2 gives me. Anytime you have a file where you know that it will have redundancy across more than 900kb, but less than 900mb, rzip can win big.
It sounds that this company's program is a variation of this idea, designed with backups in mind and identify redundancy across tens or hundreds of gigabytes.
Re:What kind of data? by forgetful_ca · 2006-04-05 14:28 · Score: 2, Interesting

I believe he's going to say something like % lost to overhead like the file name, filesize, index, etc...
but that would be wrong. Anything compressed by a factor of 1 wouldn't need those things. You would just spit out the original file again with no changes whatsoever. In fact, I have already encoded such a compression algorithym. (and have patented the process. oh, and um, copyrighted it. and stuff.)

cat my1file > my2file

ta da!