New 25x Data Compression?
modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.
I can create a compression algorithm that compresses my 2GB of data to 1 bit. But it would be crap for any other datastream fed to it.
tasks(723) drafts(105) languages(484) examples(29106)
Company breaks Shannon Limit. Debunking at 11!
Seriously though. Gzip can compress down to 98%... if your data is mostly redundant. The chance that they're doing this on the random data they claim in the article is nil.
*sniff* *sniff* *sniff*
... vapor.
I smell
"An unarmed man can only flee from evil, and evil is not overcome by fleeing from it." Col. Jeff Cooper
Yes, it can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.
I've abandoned my search for truth; now I'm just looking for some useful delusions.
Does anyone else remember a "state-of-the-art" fractal compression program that appeared back around 95 or so? It was very impressive at first - you'd compress a four meg file down into a few kilobytes, and it would decompress just fine afterwards... until you deleted the original file. Turns out the program only stored a pointer to the location of the original file on the drive in its output file. I bet more than one person, after thinking they had verified it worked, lost some valuable data.
It's true! It compressed my 10GB collection of ASCII PR0N into 1 meg!
Number of companies claiming a breakthrough in compression technology since the release of bzip2: too many to count.
Number of them which were anything other than complete bullshit: 0
I'm not holding my breath.
News for Nerds. Stuff that Matters? Like hell.
dd if=/dev/zero bs=1m count=1m | lzop - | gzip -f -| gzip -f - | gzip -f - | wc
gives about three kilobytes for a terabyte of data.
☠
The summary should have read...
StorageMojo is reporting that a company named Practical Nano Cold Fusion Duke Nukem Forever at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital...
I'm a big tall mofo.
Stuff like new compression algorithms generally comes out in academic papers, which are then applied in practice by regular programmers. That's what happened with the Burrows-Wheeler algorithm at the core of bzip2. Some company concerned with mostly implementation rather than theory wouldn't come up with a revolutionary advance. The writeup is very vague, but it sounds to me like they're just using a simple LZ type algorithm, and they're only claiming 25x compression if the data is mostly the same already. Well duh.
Posted by ScuttleMonkey on Wed Apr 05, '06 03:23 PM
from the make-sure-to-give-it-to-more-than-just-the- corporate-monkies dept.
You would think that an editor called Scuttle Monkey would know that the correct plural of "Monkey" is "Monkeys", not "Monkies".
"Monkies" would be the plural of "Monkie", which I guess is what you'd call a baby Monk Seal, or if you knew him really well, a resident of a Monastery. "Hey, Monkie, nice robe!"
Of course, if you were talking to Michael Nesmith, the singular form would be "Monkee". But that's neither here nor there.
Stressed? Me? Of course not. Stress is what a rubber band feels before it breaks, silly.
If they can't compress the canterbury corpus or calgary corpus beyond 3X, then it's a SCAM.
1. There can be no algorithm that can compress every stream by a constant factor, let alone by 25. Whoever says otherwise is mistaken or lying.
2. Achievable compression depends on the nature of the input material. Big files (music, movies) these days are already compressed by their respective codecs, so they compress really badly.
3. While there are algorithms that, on average, compress better than others, usually this is paid for by running slower, often much, much slower.
Mmmmmmh, salt.
Yup, let me just add to others saying that 25x compression is impossible for arbitrary data. It's just an indexing problem, if you have a 2 kbyte files (2^12288 possible permutation) it is impossible to map all to the (2000/25=) 82 byte files (2^656 possible permutations). Good thing the article talks about what data this applies to...(sarcasm)
A cow-orker asked if it could be used on its own ouput.
Lacking <sarcasm> tags,
Developers: We've got some really good ideas for reducing backup space by using compression and incremental backups.
Marketing: How much in the best conceivable case?
Developers: Oh, I dunno, maybe 25x.
Marketing: 25x? Is that good?
Developers: Yeah, I suppose, but the cool stuff is...
Marketing: Wow! 25x! That's a really big number!
Developers: Actually, please don't quote me on that. They'll make fun of me on Slashdot if you do. Promise me.
Marketing: We promise.
Developers: Thanks. Now, let me show you where the good stuff is...
Marketing (on phone): Larry? It's me. How big can you print me up a poster that says "25x"?
I can tell you, this technology definitely works. I've seen them compress random data streams to 1/25th (even 1/30th!!) their size. This works *TODAY*. Coming out real soon now is the software that allows you to decompress your data. This is still in development.
....I mean jeez. They are not in the file compression business, they are in the "data protection" business. Specifically disk based backup. They make NO cliam regarding "data compression" - the 25X claim is explicitly in regards to the disk space required to backup data. What they say is that using their solution can lead to a 25x less disk space requirement for backups. It may involve some new compression algorithms, but appears to be more based on never backing up the same data more than once.
Going on means going far
Going far means returning
This is completely false. There are fundamental mathematical limits to the amount you can compress data in a lossless format. In fact, each compression format ususally has overhead on the file to store the mapping data to decode/decompress it. That overhead+the compressed file is usually less than the original file, until you run the compressor once or twice. Then the file doesn't compress at all, and the compression record overhead actually increases the overall file size.
Hey, I'm just your average shit and piss factory.
If everybody stopped laughing and actually RTFA, they aren't claiming 25x compression on anything. The algorithm is targeted at data backup, i.e. very large files and works by comparing incoming data patterns to patterns already stored. Looks like a modification of LZH that uses the compressed file as the pattern table. I'm not saying that it works or that is a breakthrough, but they are not claiming impossible lossless compression on anything. It might actually be interesting for the application it was designed for.
If you're wondering why this is pure bullshit, this might help.
Lossless compression is nothing more than an algorithmic lookup table. It's a substitution cipher like what you find in famous quote puzzles.
Take two different messages. Compress each. When you decompress them, you have to get two different messages back, right? So you need two different messages in compressed form. If your compressed message uses the same symbolic representation as the uncompressed message--and, since we're talking ones and zeros here with computers, that's exactly the case--then it should quickly be apparent that, for any given length message, there're so many possible permutations of symbols to create a message...and you need exactly that same number of permutations in compressed form to be able to re-create any possible message.
Compression is handy because we tend to restrict ourselves to a tiny subset of the possible number of messages. If you have a huge library but only ever touch a small handful of books, you only need to carry around the first drawer of the first card cabinet. You can even pretend that the other umpteen hundred drawers don't even exist.
It's the same with text. You only need six bytes to store most of the frequently-used characters in text, but we sometimes use a lot more than just the standard characters so they get written on disk using eight bytes each. English doesn't even use every permutation of two-letter words, let alone twenty-letter ones, so there's a lot of wasted space there. You only need about eighteen bits to store enough positions for every word in the dictionary. A good compression algorithm for text will make that kind of a look-up table optimized for written English at the expense of other kinds of data. ``The'' would be in the first drawer of the cabinet, but ``uyazxavzfnnzranghrrt'' wouldn't be listed at all. If you actually wrote ``uyazxavzfnnzranghrrt'' in your document, the compression algorithm would fall back to storing it in its uncompressed form.
Also, don't overlook the overhead of the data of the algorithm itself. If you've got a program that could compress a 100 Mbyte file down to 1 Mbyte...but the compression software itself took several gigabytes of space, that ain't gonna do you much good. It's sometimes helpful to think of it in terms of the smallest self-contained program that could create the desired output. An infinite number of threes is easy; just divide 1 by three. Pi is a bit more complex, but only just. The complete works of Shakespeare is going to have a lot more overhead for a pretty short message. And ``uyazxavzfnnzranghrrt'' might even have so much overhead for such a short message that ``compression'' just makes it bigger.
Cheers,
b&
All but God can prove this sentence true.
This seems good, otherwise Google for "ows compression OR compress OR compressor", and according to this, OWS stands for the author's initials.
- Move "Sig". For great justice!
Mod parent down! Nobody needs to see goatse again...
This guy's the limit!
The article talks about backup. The idea could be, that instead of managing incremental backups you just optimize compression of data that is similar to old data. In that way you can do "full" backups, but actually save only incremental backup worth of data.
See http://en.wikipedia.org/wiki/Venti for similar ideas in a system that easily achives 25x compression for typical archival storage. When a file has been changed only those 512 kbyte blocks that are really new are saved, other blocks are just mapped by their SHA1 hashes to existing blocks. So files with small changes, very similar files and files sharing common parts will all compress very nicely. In a multi-user system the files of different users tend to also have lots of similar parts: same emails, same office documents with perhaps minor changes, same reference material / tools / libraries as personal copies etc.
My guess is TFA refers to a re-invention of this wheel, most likely in an inferior way.
Anssi Porttikivi / app@iki.fi
This is entirely possible and they are not the only ones doing it, for example http://www.datadomain.com/ has been doing it for a while. The big storage vendors do it to some extent as well.
The idea is based on "de-duplication" of data and is only really practical for backups (where most data from backup to backup is identical) or central repositories of data for a large organization that has multiple similar data sets, for example, many installations of Windows that are often similar.
From my experience x25 is a bold claim for general data. I've seen small scale tests that showed x30 compression over backup sets but those implementations had performance issues.
From the description in their white-paper, despite their claims, it appears they are performing some kind of hash by definition (e.g. mapping a space to a smaller space).
That's called the law of large numbers.
Systems like this bank on the fact that most enterprise backup systems (that is... Veritas) can't tell when a file is changed slightly between backups. They use a coarser-grained whole-file approach (which is very reliable though, and already only stores one copy of each file). But people who know about the magic of rsync understand the speedups that can be obtained by leveraging rolling hashtables and other tricks to get binary deltas of large files, and only transmitting those changes.
Given a large enough set of backups and enough time, the potential size savings is enormous.
Veritas should really be implementing this themselves, though.
And I have a feeling this is what's behind the 25x claims of the article. The key is the mention "enterprise"... large data sets... lots of potential redundancy to exploit.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
I once used a Huffman data compression algorithm, recursively, in order to see just how much compression I could get. The first round, I got maybe 75% compression on the data I was using. The second round, I got 10%. The third round, I got 3%. The fourth, I got 1%; and after that, I'd typically actually increase the size of the data slightly. Let's not forget that I am including the size of the initial data table.
So then I tried it with LZW compression, and it still eventually grew in size.
The neat thing about doing this, though, is that it taught me something about the mathematical basis for entropy. You see, I couldn't believe that I was getting the diminishing returns, so I wrote some algorithms to output the histogram curves.
What I saw was that the best Huffman compression came when the Histogram was farthest from what I'll call a "perfect bell curve". I don't know if that is the same curve or not, but it looks a lot like one half of a perfect bell; or maybe like the radiation output of a blackbody in physics.
Anyhow, as I successively compressed the data, the data moved towards a tighter bell curve in general, and always towards that perfect bell, in specific (so long as the data would compress, that is.) I didn't do the calculation, but it would be interesting to calculate what the closest bell curve was, and then do a standard deviation of the histogram from the bell curve, and correlate it to compression.
So then I thought "well, I'll compress only a portion of the data, the part that is compressible". But any typical portion of the data still seemed to follow that pesky bell curve. So then I thought to intercept the data, and see if I could visually spot any patterns.
Indeed, I could. Wow -- look at that string of zeros here; and that repeated series 1001001001001, *four times*, there. Surely I could get compression out of that. Funny thing, though. Every time I tried, I could get compression for that data set, but then lousy compression for anything else. When I tried to generalize the compression to include every possibility, I again couldn't get compression. In other words, truly entropic data does have repetition. It does have some item that shows up more commonly than others. It does have patterns. But the patterns are no more than what you would expect, (or actually, if you want to be correct but confusing, only an expectable percentage of the patterns are more than what you would expect, by any given amount.) And when you include all the patterns of length n, including patterns of length n=1, then there just isn't any more entropy possible for the data.
And just as it takes an increase in entropy to drive a heat engine (2nd law of thermo), it also takes an increase in data entropy to get compression.
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
Guess what? It is IMPOSSIBLE to create a generic compression algorithm. Gzip operates by doing exactly what you mention: operating on a particular set of data: that being data with some exploitable redundancy. There are plenty of files that will get bigger when you give them to Gzip.
Entropy coders work by making assumptions about the probability distribution of the data they recieve. They assume they are working on a set of data in which certain types of data are more likely than others, so they store those more compactly, but as a result they HAVE to store others less compactly. No matter how you slice it, you can not store more than 2^n unique strings in n bits. The only gains you can make are by assuming that you aren't going to be dealing with all possible strings, and compacting the ones that you care about.
That may have actually been what you meant, but I really didn't want anyone reading that to get the impression that there was something magical about entropy that made it a different approach than narrowing the set of data you are storing. The two are fundamentally the same thing.
Although its not for every file, some times, this can be a huge win. In my case, backing up 60 versions of a 700kb XML file, I get 500:1 compression, 30 times better than what bzip2 gives me. Anytime you have a file where you know that it will have redundancy across more than 900kb, but less than 900mb, rzip can win big.
It sounds that this company's program is a variation of this idea, designed with backups in mind and identify redundancy across tens or hundreds of gigabytes.
News media around the world carried the "news" of the Raelians cloning a little girl. The vast majority of intelligent people knew it was crap, most average peopel assumed it was crap, the news media all said to take it with a grain of salt and that they could secure no no proof. News is news, whether it's news of a real advance, or news that a potentially reliable source is making astounding claims. Only through the analysis of these claims can knowledge grow.
jX [ Make everything as simple as possible, but no simpler. - Einstein ]
Sheesh...when did you last get laid?
Obligatory Soundbite Catchphrase
This is a back up system, not a single file compression (although for framed data like video, email, etc.. the compression scheme is still clever).
Basically it's a CVS, if your backing up multiple computers, or user directories your going to see tons of repeate files, heck they'll even be the same name. Saving the diffs is a good idea. And not at all dificult to duplicate.
For instance what if you were doing back up for a team of animators. Their files are HUGE, but 90% of the frames will be identical between the individual systems. (indeed the frames between one another will likely be very similar) You could get far more than 25x compression that way. The big downside of this idea is the memmory & CPU vs Speed trade off. You can't use this kind of system to back up to a tape or DVD system, it needs to be random access media.
You could probably get nearly the same results by hacking rsync and diffing identical file names in different directories. Possible bonus for diffing files of similar file type.
It's a clever idea, not a radical new technology.
I would rather be ashes than dust!
This is my personal White Paper on lossless compression. Note this is no joke thread like that newb who posted he can get his to 1 bit. I affect random binary data. It achieves approximately, per cycle, a 81% remaining size of the origional file. I theorize the end limit to the size is in a range of 10 bytes to 10 kb. It will be different for every file type. Note this is an EXCEL filed that has been RARed. It was to big to upload normally. It is MEMORY intensive. I would prefer not to do it in excel except I lack the proper software to replace that crappy program. Here is the link to my website: http://www.security1.free2host.net/Compress.php
I am ready for the big jump in life, who will jump with me?