New 25x Data Compression?
modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.
Does anyone else remember a "state-of-the-art" fractal compression program that appeared back around 95 or so? It was very impressive at first - you'd compress a four meg file down into a few kilobytes, and it would decompress just fine afterwards... until you deleted the original file. Turns out the program only stored a pointer to the location of the original file on the drive in its output file. I bet more than one person, after thinking they had verified it worked, lost some valuable data.
Number of companies claiming a breakthrough in compression technology since the release of bzip2: too many to count.
Number of them which were anything other than complete bullshit: 0
I'm not holding my breath.
News for Nerds. Stuff that Matters? Like hell.
doesnt colinux have 2kb compessed files that open up to around 10gb? since they are just all null files. also, such a compression where your doing so much is gonna eat into time and cpu usage, and if 1 thing goes wrong in any of it, you loose all that data.
portfolio
If they can't compress the canterbury corpus or calgary corpus beyond 3X, then it's a SCAM.
His point is that the Shannon limit provides a mathematical upper bound for how good a lossless compression algorithm can be for arbitrary data sets. gzip gets 98% of that maximum bound, so any algorithm that claims to be 12x that is either not lossless, or not generic. Gzip etc. are all based on several related algorithms known generally as "entropy coders" (http://en.wikipedia.org/wiki/Entropy_coding).
Lossy compression and compression of particular data sets do not have to obey this. With lossy compression you can compress down as far as you can tolerate.
Coding particular sets gets some extra compression by coding some of the data in the compress/decompress utility. For example if all your files have a 1MB standard header and 1KB of data, you can omit the 1MB of header because it's always there, and just send the 1KB of data! Truly amazing compression! Of course it only works under those conditions.
Compression:
Decompression:
Obvious variants:
* = You'll need step 3 to find step 1 in a reasonable amount of time.
** = It would be fscking hilarious if someone were to prove that you can't always find a given chunk of data within {original-size} bytes of pi, so the offset might be bigger than the original data, and this algorithm wouldn't even be guaranteed to actually compress your data.
p.s. If I catch anyone trying to patent this, I'll refer the patent office to this post as prior art and also reveal the value that yields this hash: cb775b9b061b03e8666819ede2181d2e. Anyone that cracks this will get a chuckle.
Despite the obvious answer (he's simply wrong), 7zip is somewhat "cheating" in this 3-way comparison, as it uses a much, much, much larger block-size (memory). You can set it to use hundreds of MBs of RAM, whereas gzip and bzip2 are both limited to 9KB max.
.
Off-topic Rant:
I was actually quite impressed with 7zip and it's lzma/ppmd compression methods when I first saw it compressing better than bzip2. However, once the novelty wore off, I began to realize it just takes far too-much memory. There is no possible chance of using them on an embedded system, a handheld computer, or even just a fairly old PC with less-than around 64MBs of RAM (or much higher, depending on requested block-size). It also takes a serious ammount of extra time over gzip/bzip2, while being only a trivial compression improvement in the large majority of cases. The exceptional cases are... neat... but they don't make LZMA/PPMd practical for normal use.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
I came across a company called Avamar (http://www.avamar.com/). They do something similar, except they deduplicate the data on the client side, before it ever traverses the network. Needless to say, I was a bit skeptical with their claims. I was able to con them into letting me eval the platform for 3 months. As it turns out - it works as advertised.
I was able to consolidate 2 large tape libraries (L700's) into 8 x Dell 2850's - all running Redhat Enterprise, all with 6 x 300GB SCSI drives in them. We are currently protecting 7TB of data (Note: The hardware is currently 58% utilized). We process 20,000 backup jobs a month. And as an example, pulled directly from last night's activity log, we performed a full backup of a Windows box with 100GB of data on it in less than 15 minutes. (Note: The science behind this is very practical.
I once used a Huffman data compression algorithm, recursively, in order to see just how much compression I could get. The first round, I got maybe 75% compression on the data I was using. The second round, I got 10%. The third round, I got 3%. The fourth, I got 1%; and after that, I'd typically actually increase the size of the data slightly. Let's not forget that I am including the size of the initial data table.
So then I tried it with LZW compression, and it still eventually grew in size.
The neat thing about doing this, though, is that it taught me something about the mathematical basis for entropy. You see, I couldn't believe that I was getting the diminishing returns, so I wrote some algorithms to output the histogram curves.
What I saw was that the best Huffman compression came when the Histogram was farthest from what I'll call a "perfect bell curve". I don't know if that is the same curve or not, but it looks a lot like one half of a perfect bell; or maybe like the radiation output of a blackbody in physics.
Anyhow, as I successively compressed the data, the data moved towards a tighter bell curve in general, and always towards that perfect bell, in specific (so long as the data would compress, that is.) I didn't do the calculation, but it would be interesting to calculate what the closest bell curve was, and then do a standard deviation of the histogram from the bell curve, and correlate it to compression.
So then I thought "well, I'll compress only a portion of the data, the part that is compressible". But any typical portion of the data still seemed to follow that pesky bell curve. So then I thought to intercept the data, and see if I could visually spot any patterns.
Indeed, I could. Wow -- look at that string of zeros here; and that repeated series 1001001001001, *four times*, there. Surely I could get compression out of that. Funny thing, though. Every time I tried, I could get compression for that data set, but then lousy compression for anything else. When I tried to generalize the compression to include every possibility, I again couldn't get compression. In other words, truly entropic data does have repetition. It does have some item that shows up more commonly than others. It does have patterns. But the patterns are no more than what you would expect, (or actually, if you want to be correct but confusing, only an expectable percentage of the patterns are more than what you would expect, by any given amount.) And when you include all the patterns of length n, including patterns of length n=1, then there just isn't any more entropy possible for the data.
And just as it takes an increase in entropy to drive a heat engine (2nd law of thermo), it also takes an increase in data entropy to get compression.
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
Although its not for every file, some times, this can be a huge win. In my case, backing up 60 versions of a 700kb XML file, I get 500:1 compression, 30 times better than what bzip2 gives me. Anytime you have a file where you know that it will have redundancy across more than 900kb, but less than 900mb, rzip can win big.
It sounds that this company's program is a variation of this idea, designed with backups in mind and identify redundancy across tens or hundreds of gigabytes.
I believe he's going to say something like % lost to overhead like the file name, filesize, index, etc...
but that would be wrong. Anything compressed by a factor of 1 wouldn't need those things. You would just spit out the original file again with no changes whatsoever. In fact, I have already encoded such a compression algorithym. (and have patented the process. oh, and um, copyrighted it. and stuff.)
cat my1file > my2file
ta da!