New 25x Data Compression?
modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.
I can create a compression algorithm that compresses my 2GB of data to 1 bit. But it would be crap for any other datastream fed to it.
tasks(723) drafts(105) languages(484) examples(29106)
Company breaks Shannon Limit. Debunking at 11!
Seriously though. Gzip can compress down to 98%... if your data is mostly redundant. The chance that they're doing this on the random data they claim in the article is nil.
*sniff* *sniff* *sniff*
... vapor.
I smell
"An unarmed man can only flee from evil, and evil is not overcome by fleeing from it." Col. Jeff Cooper
Yes, it can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.
I've abandoned my search for truth; now I'm just looking for some useful delusions.
Does anyone else remember a "state-of-the-art" fractal compression program that appeared back around 95 or so? It was very impressive at first - you'd compress a four meg file down into a few kilobytes, and it would decompress just fine afterwards... until you deleted the original file. Turns out the program only stored a pointer to the location of the original file on the drive in its output file. I bet more than one person, after thinking they had verified it worked, lost some valuable data.
Wow, imagine the Beowulf cluster that WON'T be needed to store this!
Any technology distinguishable from magic is insufficiently advanced.
It's true! It compressed my 10GB collection of ASCII PR0N into 1 meg!
Or atleast with 1/25th a grain of salt.
An Indian-American Hindu committed to non-violent thought/speech/action alarmed by the global explosion of radical Islam
Number of companies claiming a breakthrough in compression technology since the release of bzip2: too many to count.
Number of them which were anything other than complete bullshit: 0
I'm not holding my breath.
News for Nerds. Stuff that Matters? Like hell.
doesnt colinux have 2kb compessed files that open up to around 10gb? since they are just all null files. also, such a compression where your doing so much is gonna eat into time and cpu usage, and if 1 thing goes wrong in any of it, you loose all that data.
portfolio
They say it will work on anything? Sorry, I don't think so. I can take 2 gigs of straight 0's and compress it into a file with table and it only be maybe kilobyte in size. But, given technology and greed today, I doubt we're breaking the Shannon Limit anytime soon.
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
dd if=/dev/zero bs=1m count=1m | lzop - | gzip -f -| gzip -f - | gzip -f - | wc
gives about three kilobytes for a terabyte of data.
☠
Don't really mind them presenting this, with a little luck they may even get funded. I recall an issue in Holland where we had our "Internet guru" Roel Pieper who has invested massively in a compression patent allowing a movie the size of Star Wars Ep. 3 be compressed onto 1 1.44" floppy disk. Ofcourse this played a few years ago, the algorithm was never mind strangely enough. Mr. Pieper still believes in the idea.
I'm not claiming that this story is bull, I'm only saying that they're absolutely right to present it at this stage.
25 times what? A 25th of the original file? Does it matter if it's already compressed or is it the same on anything? How does bzip stack up on a text file, yo?
The summary should have read...
StorageMojo is reporting that a company named Practical Nano Cold Fusion Duke Nukem Forever at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital...
I'm a big tall mofo.
Stuff like new compression algorithms generally comes out in academic papers, which are then applied in practice by regular programmers. That's what happened with the Burrows-Wheeler algorithm at the core of bzip2. Some company concerned with mostly implementation rather than theory wouldn't come up with a revolutionary advance. The writeup is very vague, but it sounds to me like they're just using a simple LZ type algorithm, and they're only claiming 25x compression if the data is mostly the same already. Well duh.
compress that :)
The Raven
Yeah, I've worked with this before. It's just lossy data compression. It eliminates the data you won't care about a couple of weeks from now. If you used it on a collection of your boss's memos, for example, the compression ratio is 100%. Kind of like how lossy audio compression eliminates the parts your ear doesn't care about.
Posted by ScuttleMonkey on Wed Apr 05, '06 03:23 PM
from the make-sure-to-give-it-to-more-than-just-the- corporate-monkies dept.
You would think that an editor called Scuttle Monkey would know that the correct plural of "Monkey" is "Monkeys", not "Monkies".
"Monkies" would be the plural of "Monkie", which I guess is what you'd call a baby Monk Seal, or if you knew him really well, a resident of a Monastery. "Hey, Monkie, nice robe!"
Of course, if you were talking to Michael Nesmith, the singular form would be "Monkee". But that's neither here nor there.
Stressed? Me? Of course not. Stress is what a rubber band feels before it breaks, silly.
Seriously. I hear that they are going to use it with Duke Nukem Forever to fit all the map and texture data onto only 22 DVD's.
Obviously nothing concrete or released yet so take with the requisite grain of salt.
Actually, I'd say take the news of this "breakthrough" with a Salt Lick.
I hope it's true, but I'm not holding my breath.
If "disco" means "I learn" in Latin, does "discothèque" mean "I learn technology"?
If they can't compress the canterbury corpus or calgary corpus beyond 3X, then it's a SCAM.
Sounds like an application where you want some speed, which kind of rules out PPM*/PPMd + aritmetic coding -- which is among the best general compressors we have today. As if it needed another good debunking.
1. There can be no algorithm that can compress every stream by a constant factor, let alone by 25. Whoever says otherwise is mistaken or lying.
2. Achievable compression depends on the nature of the input material. Big files (music, movies) these days are already compressed by their respective codecs, so they compress really badly.
3. While there are algorithms that, on average, compress better than others, usually this is paid for by running slower, often much, much slower.
Mmmmmmh, salt.
Did they even get 1.0 working yet?
It sounds like the backup volume in this system is essentially a .zip file that you keep stuffing data into. If some copy of the data you're stuffing in is already there, you don't need to store it again. 25x is believable if you're backing up the same data over and over again, I guess.
I can take 2 gigs of straight 0's and compress it into a file with table and it only be maybe kilobyte in size.
Without putting much thought into it, I can even do that. 2 gigs of straight 0's with a real-world algorithm pretty easily compresses down to 12 bytes, far fewer than the kilobyte you quote. You could store it in just: 2000000000x0
Use an abbreviation for 2 billion or other byte-saving tricks and you could compress it down even more.
I suspect such smoke and mirrors is something similar to what this company has done to achieve their reported compression results.
I'm a big tall mofo.
de-duplication and calculating and storing only the changes between similar byte streams is apparently the key
Maybe you want to tale a gander at RLE
"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
Yup, let me just add to others saying that 25x compression is impossible for arbitrary data. It's just an indexing problem, if you have a 2 kbyte files (2^12288 possible permutation) it is impossible to map all to the (2000/25=) 82 byte files (2^656 possible permutations). Good thing the article talks about what data this applies to...(sarcasm)
A cow-orker asked if it could be used on its own ouput.
Lacking <sarcasm> tags,
Developers: We've got some really good ideas for reducing backup space by using compression and incremental backups.
Marketing: How much in the best conceivable case?
Developers: Oh, I dunno, maybe 25x.
Marketing: 25x? Is that good?
Developers: Yeah, I suppose, but the cool stuff is...
Marketing: Wow! 25x! That's a really big number!
Developers: Actually, please don't quote me on that. They'll make fun of me on Slashdot if you do. Promise me.
Marketing: We promise.
Developers: Thanks. Now, let me show you where the good stuff is...
Marketing (on phone): Larry? It's me. How big can you print me up a poster that says "25x"?
RTFA.
of course you can do this. Look at datadomain.com.
they expect 20-80x compression because they're marketing themselves as backup to disk (doing repetitive full backups). you get the same patterns over and over again.
and whoever posted the RLE wikipedia article, thank you for understanding the solution.
and no, everything isn't going to compress 25x, but everything will compress some. There are repeated bitstreams in everything. a 64bit string has a finite number of patterns. I don't know how small they chunk it up, but it's beliveable.
"We are not tolerant people. We prefer drastically effective solutions"
Further, since the software operates on byte-streams, it can compress anything: email, databases, archives, mp3's, encrypted data or whatever weird data format your favorite program uses.
It can compress anything!1111 Even already compressed mp3s and encrypted data, both of which have a high degree of data entropy, and are essentially uncompressible!
Magical compression for everyone!!
I can tell you, this technology definitely works. I've seen them compress random data streams to 1/25th (even 1/30th!!) their size. This works *TODAY*. Coming out real soon now is the software that allows you to decompress your data. This is still in development.
I'm not sure if you imply that bzip2 is actually good at compression, but just in case you were: it is bad, slow and bad compression ratios. Some of the good common programs are 7zip and (win)rar. A benchmark can be found for example in http://www.maximumcompression.com/data/summary_mf. php.
I can't remember what the full name was, but it's initials were OWS.
A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key
Yes, for that and every other compression system.
Sigh, this is nothing more than a non-redundant store. Very similar to stuff already offered by a number of vendors, even Microsoft. The "fast way to know what's already on disk" is just to store hashes of the data in an index. Move along, nothing to see here......
sigs are a waste of space
....I mean jeez. They are not in the file compression business, they are in the "data protection" business. Specifically disk based backup. They make NO cliam regarding "data compression" - the 25X claim is explicitly in regards to the disk space required to backup data. What they say is that using their solution can lead to a 25x less disk space requirement for backups. It may involve some new compression algorithms, but appears to be more based on never backing up the same data more than once.
Going on means going far
Going far means returning
>The system operator relaxes, and lets a log file fill up the rest of the disk.
If your logs are on the same partition (let alone _disk_) as your database files, you deserve this kind of fate.
Sitting Walrus Blog
This is completely false. There are fundamental mathematical limits to the amount you can compress data in a lossless format. In fact, each compression format ususally has overhead on the file to store the mapping data to decode/decompress it. That overhead+the compressed file is usually less than the original file, until you run the compressor once or twice. Then the file doesn't compress at all, and the compression record overhead actually increases the overall file size.
Hey, I'm just your average shit and piss factory.
"diligent technologies": 0 patents.
If you were blocking sigs, you wouldn't have to read this.
If everybody stopped laughing and actually RTFA, they aren't claiming 25x compression on anything. The algorithm is targeted at data backup, i.e. very large files and works by comparing incoming data patterns to patterns already stored. Looks like a modification of LZH that uses the compressed file as the pattern table. I'm not saying that it works or that is a breakthrough, but they are not claiming impossible lossless compression on anything. It might actually be interesting for the application it was designed for.
If you're wondering why this is pure bullshit, this might help.
Lossless compression is nothing more than an algorithmic lookup table. It's a substitution cipher like what you find in famous quote puzzles.
Take two different messages. Compress each. When you decompress them, you have to get two different messages back, right? So you need two different messages in compressed form. If your compressed message uses the same symbolic representation as the uncompressed message--and, since we're talking ones and zeros here with computers, that's exactly the case--then it should quickly be apparent that, for any given length message, there're so many possible permutations of symbols to create a message...and you need exactly that same number of permutations in compressed form to be able to re-create any possible message.
Compression is handy because we tend to restrict ourselves to a tiny subset of the possible number of messages. If you have a huge library but only ever touch a small handful of books, you only need to carry around the first drawer of the first card cabinet. You can even pretend that the other umpteen hundred drawers don't even exist.
It's the same with text. You only need six bytes to store most of the frequently-used characters in text, but we sometimes use a lot more than just the standard characters so they get written on disk using eight bytes each. English doesn't even use every permutation of two-letter words, let alone twenty-letter ones, so there's a lot of wasted space there. You only need about eighteen bits to store enough positions for every word in the dictionary. A good compression algorithm for text will make that kind of a look-up table optimized for written English at the expense of other kinds of data. ``The'' would be in the first drawer of the cabinet, but ``uyazxavzfnnzranghrrt'' wouldn't be listed at all. If you actually wrote ``uyazxavzfnnzranghrrt'' in your document, the compression algorithm would fall back to storing it in its uncompressed form.
Also, don't overlook the overhead of the data of the algorithm itself. If you've got a program that could compress a 100 Mbyte file down to 1 Mbyte...but the compression software itself took several gigabytes of space, that ain't gonna do you much good. It's sometimes helpful to think of it in terms of the smallest self-contained program that could create the desired output. An infinite number of threes is easy; just divide 1 by three. Pi is a bit more complex, but only just. The complete works of Shakespeare is going to have a lot more overhead for a pretty short message. And ``uyazxavzfnnzranghrrt'' might even have so much overhead for such a short message that ``compression'' just makes it bigger.
Cheers,
b&
All but God can prove this sentence true.
I remember years ago there was this horrible "joke" program. It claimed to compress files down to some amazingly small sizes. You could "compress" the file, then erase it, and "expand" the compressed file and it seemed to work just fine! It was done by recording the sectors on disk that a file occupied. So yeah, you can delete it and "restore" it... but try emailing that compressed file? Or expanding it a week later!
The description of the process sounds pretty good, but then again, so too does the medicinal properties of snake oil.
There's a very simple way to get much better compression - simply store the SHA-256 hash of every file instead. My average file size is about 126 Kbyte, so that's a 4000:1 compression.
OK, OK, you still have to store a full version of each file (or a traditionally compressed version). So for a single PC it doesn't make sense. But for an enterprise there are thousands of copies of those Windows OS files, tens or hundreds of those Powerpoint presentations, scatter-gun emails, etc - so why not just store them just once, and replace with the SHA-256 hash for every other version?
Andrew Yeomans
I call this the "del" compression algorithm
Why are you letting these clowns ruin our country?
It's certainly possible (for some types of data) to perform LOSSY compression down to 25:1 - but this system is a backup system...you don't want lossy compression in a backup system!! So let's assume these guys are talking lossless compression.
The best current compression algorithms for English text come close to 10:1 lossless compression - so there is hope that their system could do that good.
Even simple run-length encoding will manage spectacular compression ratios well over 100:1 on images that are diagrams...but they typically manage zero compression at all for most photographs.
Most notably, if your files have ALREADY been compressed it is unlikely in the extreme that any lossless scheme will encode them further.
There is mathematical PROOF that you can't losslessly compress a generalized stream of random numbers at all.
So examining this claim, we have to deduce that - yes, a well implemented scheme using basic known technology would be able to get into the 25:1 range for SOME files. However, we know for sure that it won't get close to 25:1 for files full of essentially random numbers - notably, files already compressed by some other scheme.
We know then that this cannot be a bold, sweeping claim like "No matter what - you'll get 25:1 compression" - that's simply not possible - and you can prove it using math.. So if that *IS* what they are saying - then we must yell "BULLSHIT".
However, if instead they are claiming "We can compress a typical PC user's file system by 25:1" - then maybe so. In a community of PC users, there will be lots of copies of the same files on lots of PC's - there will be lots of easy-to-compress text files and images of simple diagrams and such. If every PC has a copy of WORD installed on it - then large compression ratios are possible by merely noting this fact. Perhaps that's enough to overcome the likely 1:1 non-compression of that guy's copy of the first billion digits of PI, all of those ZIP and JPG files that are already well compressed. MAYBE we believe their claim for "typical" situations. However, there are no programs out there that can RELIABLY get better lossless compression than 10:1 for text or better than 2:1 for photos. There has to be an awful lot of easy to compress stuff to counteract the effect of a bunch of large photos and ZIP files. One 1MB JPG file has to be accompanied by 50MB of stuff that can be 50:1 compressed in order to average out to 25:1 overall. So their 'magic' compression scheme would have to be able to compress easy-to-compress files by a factor of maybe 100:1 or more in order to allow room for all of those JPG's and ZIP's.
That's a tall order indeed. I think that even for a typical PC's hard drive, this claim is BS.
What's for sure is that they are being a little dishonest by not qualifying their claim in some way.
www.sjbaker.org
This seems good, otherwise Google for "ows compression OR compress OR compressor", and according to this, OWS stands for the author's initials.
- Move "Sig". For great justice!
... you get 625x compression. Woohoo!!
You forgot:
6. Watch me pimp my shitty porn sites.
Way to go asswipe, heap mockery upon your potential customers, that'll get those hit counters moving!
Pardon my ignorance, but what does the Shannon Limit have to do with compression? From WikiPedia, the Shannon Limit describes the maximum bandwidth on a channel, comparing signal to noise, but nothing about compression.
- Content-addressed storage
- Stores diffs instead of full files where possible
(1) removes duplicate entries, which is good for repeated backups to the same place. (2) is good for storing similar files, *if* you can find them. It sounds almost like their storage is addressed with some form of non-crypto hash that only changes slightly between similar files ("no disk I/O", so they must be able to match things without actually looking at them).Overall, it sounds like what they do is very similar to git packs, except they claim to be able to do it without lots of I/O, which claim sounds like the specialized hashes. If this is the case, it'd be good for never-deleted nightly backups to the same disk and for systems with lots of similar files. It would get very good "compression" in those cases compared to dated .tbz files, but it wouldn't be (as) significantly better than other tools designed for that kind of usage.
Of course, if used to just compress your 100GB movies folder it still wouldn't be able to do much of anything. Implying that it would, as TFA does, sounds totally bogus. I doubt it would be more than 2-3x better than any other compressor designed for what it's being used for. (Supposedly git packs are really good like this, because of the "similar files" thing.)
So, basicaly it's compressed incremental backups, since almost every tape drive compresses it's data stream before writing it to tape, and almost every backup software offers incremental backups where only what's changed since the last backup gets backed up...
Whats the biggie?
Tar/gzipped rcs repository... so original.
That's nothing. I can compress a 1-terabyte truly-random one-time pad to one bit. So I can sell you two amazing products: Unbreakable encryption and unbeatable compression.
(I'd tell you whether the bit is "1" or "0", but then I'd have to kill you.)
"storing only the changes between similar byte streams"
"as far as I can tell from a quick gleaning, they achieve these impossible compression ratios across multiple versions of the same data set."
Right, so, this claim is no big deal. This is called delta compression and it has been around for a long time. Online games use this method to compress updates sent to clients based on the previous updates received. So instead of sending kilobytes of info each update, the server sends, oh, about 25x less data. I believe it was Quake III that first used general delta compression for online games.
This is not a novel technique... which means they will get awarded a US patent and start suing willy-nilly.
Your humor toggle is broken. How the hell did you get modded up?
Shouldn't that be 25/ compression?
"1. Reduces backup storage capacity 25X or more!"
So, after you buy their product, you'll have less backup storage capacity than you had before.
There are 0x40000000 types of people: those who understand 32-bit IEEE 754 floating point, and those who don't.
How is that "humor"? What is the punch line?
Lord High Crapflooder The Right Honourable Vlad Craig Esther McDavenpherson III
Destroyer of Mercatur.Net
As Confucious said: It's easy to achieve high compression ratios by piping to /dev/null. Recovering the data back is the tricky thing.
Or it might have been someone else.
Get your own free personal location tracker
From what I can see, all they're doing is running differential backups. As long as you have the preceding files, you can get it all back. I've done this with database backups, both by using differentials and by manually creating a diff-style file.
On to the next article.
"Sometimes a woman is a kind of religion, she can save your soul & set you free from all your sins" - Bad Examples
This technique has been used by e-mail servers for a great many years. You scan an incoming email to a given individual to see if it is already stored (ie, sent to another customer); if so, you don't save the whole email, only the different 'To' and 'Date' lines. It's simple, trivial, and very effective. It also isn't new.
Mod parent down! Nobody needs to see goatse again...
This guy's the limit!
All I see in the replies is mathematical Shannon limits and how this is snake oil. It's not about compressing my 650MB ISO image down to 5MB. Say it with me people, de-duplication. This works especially well in the backup to disk space. Think about it, I'm doing incremental backups every day and full backups on the weekends. The vast majority of my second, third, and nth full backups are comprised of data I've already stored. Why store it again? Perhaps compression is the wrong word for it, but essentially you're storing many times what you could store without a de-duplication appliance.
This applies best to backup-to-disk scenarios but it's not limited to them. Another example, say an email with an 10MB Word/OpenOffice document attachment goes out to the whole company. 200 people save the attachment to their H: drive (sorry, /home/user around here). That's 2GB of space. With the method employed here I store the document once and then only store pointers to it. Your effective compression ratio is 200:1.
A step further, this can be applied at the block level rather than the file level. One of the 200 people above could change 1MB of the document. I only need to store that 1MB of changed data.
This stuff works, and it uses methods that have been around a long time. Don't be so quick to yell bullshit without understanding what's going on underneath the covers.
I use a pair of similar devices at work, and they do get the job done for what they're designed to do--backups. If you're not saving the same data over and over again, then you'll never see any better compression than what you'd get with gzip.
The drives are typically set up over NFS of CIFS as a disk based storage addition to your backup software (think NetBackup or Networker here). Our environment is currently seeing about 15x compression over our retention period. Increasing that retention period would increase our ratio.
If you're thinking about getting one of these, keep in mind that an initial full backup of your environment will have to fit into the native storage on the device. The savings are seen when you do your next full backup, and the next, and the next. But if you're trying to fit 5TB native onto a 3TB storage device, you'll never even get off the ground.
This is old news. Capacity Optimized Storage has been out and in use for several years now.
I just got through evalutating de-duplication products for the company I work for, and contrary to what the article states, this technology is indeed being used today. Three main companies are vendors of de-duplication compression: Diligent, Data Domain, and Avamar. After looking at this stuff for 3 months, I can tell you it ISNT smoke and mirrors and the technology DOES work. Some other posters have stated correctly however -- the compression factor increases based on how often the devices see the same data. For example, you back up a SQL server to the device, you might get a more reasonable 4:1 compression ration but subsequent backups you get a 98% compression on that same data.
Diligent is a fibre-attached gateway that will use whatever LUN is presented to it in a SAN. Its weakness is that the powerhouse features such as replication/open file system seem to be vaporware, limiting its usefulness for any serious applications.
Avamar is a rip-n-replace soup-to-nuts replacement for your existing backup infrastructure. That is its strength and its weakness. No company with a Veritas Netbackup infrastructure that it has built over the course of a decade is going to tear it out overnight to replace it. However it is a VERY cool product as it does all the compression on the client end so backing up a 20gb server might take 6 hours the first time but subsequent times the same server will back up in like 6 MINUTES.
Data Domain basically makes hardware appliances that do the same type of compression on the hardware end, they make devices that work with existing infrastructure. Basically you target the Data Domain box instead of your tape library. They have the advantage of basically being a 'snap-in' replacement for tape. The disadvantage is that unlike Avamar you are still piping ALL the data off of the server to the device where the compression happens. They have a good replications system, and seem to be a neck ahead of the other vendors in he COS horse race.
COS is great technology and it solves a lot of backup headache.
Make it a malt liquor. I want to be as clever and handsome as possible.
I do this all the time. I have a 23 megabyte tar file on my computer. Boss calls and says he needs it right away. I drag it to my USB flash memory drive. Copies over in a FLASH! I rush to work, stumble in to the boss's office, plug in my memory stick, and voila! Windows had dragged a 1.2kb shortcut .lnk file instead of the 23 megabyte original. Much grumbling.
Now THATS's compression!
Network Redundancy Adminsitration here.
When I raised a pig for agricultural purposes, it could compress 2.5 pounds of feed into 1 pound of weight gain.
It's too bad the pig was sold; raised for FFA, but assimilated to a 4H member.
Only problem with all that compression is the mice were always in the feed to spread virus, and the pig had to be de-wormed every 3 months or so (If IRC).
No different than what we at NRA deal with on house-calls; just a bunch of lazy pigs that want their OS cleaned and their Computer assembly smelling like lemon.
your friend,
Network Redundancy Administration, dude!
Gregory-Thomas
without prejudice
Check out Data Domain for a similar product. There are other people doing this stuff.
Look, there is a nearly trivial theorem that says you can't put more than N pigeons into N pigeon holes with no more than one pigeon per hole. And from this it can be deduced that there is no algorithm that is guaranteed to compress any N-bit stream into one with fewer than N bits. But a useful compression algorithm doesn't need to compress every single bitstream. It just needs to be able to compress the kinds of streams that come up in real life. This is a tiny fraction of the total number of streams that could possibly appear. So the standard no-go theorem does nothing whatsoever to prove that there isn't a useful 25x compression algorithm.
Having said that, this article is pure BS simply because it implies the existence of an algorithm that does an amazing job at characterizing the kinds of strings that might come up in real life. I don't believe that anyone can do that job as well as this story implies. And that's why I don't believe it, not because of some oh-so-smart-but-ultimately-useless theorem that people are bandying around to show how clever they are.
"The White House is not an intelligence-gathering agency," -- Scott McClellan, Whitehouse spokesman.
They are doing CVS style for backups. For instance instead of storing 100 times the system state you get 1 system state and 100 diffs for it. Of course some compression on basic state and diffs are applied. And it looks like they also compress across multiple machines. So they are just applying compression in scale and location that isn't normally done. You normally don't compress across multiple backup generations, nor multiple workstations. When considering 30 backups of 25 developer workstations the dataset is having so much redundancy in data that I'm surprised if the compression ratio would be only 25x. Here's a good one. How much multiple backups help after that compressor. Perhaps they help if you need to get to a specific stage to undo some things that happened after certain backup. Also there is problem that if ONE set goes bad backups on *ALL* backups on all workstations go bad. Good new is that they probably have some redundancy duplicate raid1 style system below this compression layer. And taking tape backups every now and then on the compressed dataset would make it reasonable to have on tape backup of ALL data on 100+ workstations at end of every day they are ran depending on amount of data that is different between workstations and amount of changes that happen on the workstation.
Emacs is good operating system, but it has one flaw: Its text editor could be better.
Currently the sourceforge site is down, but LZIP allows you to specify an arbitrary compression level that you want, and the algorithm picks through the data set until it reaches it. Further discussion is here.
I want to delete my account but Slashdot doesn't allow it.
You think I'm making this up, don't you.
Jim
As always, all IMO. Insert "I think" everywhere grammatically possible.
clearly fud hype. its not possible. its been proven.
Compression:
Decompression:
Obvious variants:
* = You'll need step 3 to find step 1 in a reasonable amount of time.
** = It would be fscking hilarious if someone were to prove that you can't always find a given chunk of data within {original-size} bytes of pi, so the offset might be bigger than the original data, and this algorithm wouldn't even be guaranteed to actually compress your data.
p.s. If I catch anyone trying to patent this, I'll refer the patent office to this post as prior art and also reveal the value that yields this hash: cb775b9b061b03e8666819ede2181d2e. Anyone that cracks this will get a chuckle.
The Canterbury Corpus Compression Test only measures primarily the compression ratio, not the speed.
They are still in business if they have an algorithm that is faster when receiving repeat data over long periods of time, that is faster than the two obvious algorithms of e.g.
a) uncompressing all related backups.tar.bz2 received so far, appending the new backup.tar, then compressing to new backups.tar.bz2
b) uncompressing a vcs like subversion.tar.bz2, committing the update, then compressing it to new subversion.tar.bz2 again.
Also, the methods above only work well on data that is somewhat sorted between users submitting it to backup.
I'm still trying to figure out what people mean by 'social skills' here.
Sure, I don't see 25:1 happening for arbitrary data types, but in the corporate market there is a lot of redundancy if you're clever enough to be able to identify it, especially for corporations that are large Microsoft Office + Microsoft Outlook users (which is to say "most large corporations".) A lot of the documentation is the same file attachments getting sent around to multiple people, often kept in Exchange mail servers as opposed to individual desktops, or documents that substantially re-use previous documents. Depending on how granular you want to be and how entrenched in the more bloated Microsoft formats you want your code to be, you may be able to find most of your document already in storage, as long as you've got indexing capabilities to look for it. Maybe you just look for hashes of whole documents, or maybe you look for documents with similar names and internal tags and start comparing pages.
Video compression is well-known to use this kind of approach - you've got an initial frame with reasonably-high resolution picture, then you track the changes, usually by some model that breaks the picture down into objects that move a bit. ... And then there's music compression "It's the same old country song with the same three chords in G, she's left him and she ain't coming back, except there's this little 6-note riff at the end of the chorus when he says she took his dawg with her too."
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
The article talks about backup. The idea could be, that instead of managing incremental backups you just optimize compression of data that is similar to old data. In that way you can do "full" backups, but actually save only incremental backup worth of data.
See http://en.wikipedia.org/wiki/Venti for similar ideas in a system that easily achives 25x compression for typical archival storage. When a file has been changed only those 512 kbyte blocks that are really new are saved, other blocks are just mapped by their SHA1 hashes to existing blocks. So files with small changes, very similar files and files sharing common parts will all compress very nicely. In a multi-user system the files of different users tend to also have lots of similar parts: same emails, same office documents with perhaps minor changes, same reference material / tools / libraries as personal copies etc.
My guess is TFA refers to a re-invention of this wheel, most likely in an inferior way.
Anssi Porttikivi / app@iki.fi
One thing I've always wondered about with compression is if Quantum Computers ever become a reality. Now people always say a quantum computer can try every permutation for a bitset simultaneously cracking encryption ciphers instantly. So my question is if you had a large piece of data (such as an image taken by a voyager probe), could you generate a bunch of different checksums/hashes using algorithms and transmit those to a computer. Then on the computer try every possibility until you find the possibilities which all generate the transmitted checksums. Finally then run some AI on possible photos and rule out the obviously garbage ones until you find the correct one. (Figuring random data won't show a meaningful picture). So could that work if the all-powerful quantum computer becomes a reality?
Correct, the two best compression programs in the world are PAQ8H (command-line) and WinRK 3.0.3 (GUI). But both are very, very slow. Compressing 300 MB takes over 6 hour on a AMD 2800+
I'm not here to defend Diligent Technologies. Their claim of 25x is well worded marketing crap. That being said, they make no claims towards 25x compression. That was done by the author Robin at the StorageMojo. Diligent claims to enables the effective capacity increase of disk systems by 25 times or more. A very weak claim when you look into the specifics, but not at all the claim of 25x compression being spread by StorageMojo. This is more of an example of a lie being spread by someone who did not check their facts.
If you look at the comments at that site, someone has already pointed this out. Robin's weak reply was:
"Well, I'm looking at a document from them that says "Reduce Required Backup Storage Capacity by 25X With 100% Data Integrity." Whether that is better compression, or better backup, I'll leave to others to decide. But if they can really do it, even if it is only 10x in practice, it is still huge compared to existing technology."
This is just another example of the bad side of the blogosphere. This is starting to piss me off.
This is entirely possible and they are not the only ones doing it, for example http://www.datadomain.com/ has been doing it for a while. The big storage vendors do it to some extent as well.
The idea is based on "de-duplication" of data and is only really practical for backups (where most data from backup to backup is identical) or central repositories of data for a large organization that has multiple similar data sets, for example, many installations of Windows that are often similar.
From my experience x25 is a bold claim for general data. I've seen small scale tests that showed x30 compression over backup sets but those implementations had performance issues.
From the description in their white-paper, despite their claims, it appears they are performing some kind of hash by definition (e.g. mapping a space to a smaller space).
The tradeoff is always performance when creating either compression or redundancy, as well as reliability.
Usually with more advanced compression comes more information dependency, lessening the chances of recovering a partial archive, a partial file, or dealing with any damage. This could make it bad for tape storage or anything that can have small portions damanged (CDs, etc)
Additionally there's performance. Of course we can all compress and compress based on dictionary sizes and algorithms specific to any application, but as those dictionary sizes get bigger, the huge amount of memory and processor power is needed. Think about an algorithm that depends on 50% of a file... Huge calculations, and having to load the whole file into memory in order to work on only a small part of it.
We keep pushing the data transfer envelope, so why are we caring so much about packing compression? Think the past 10 years- 10Mbit, 100Mbit, 1Gbit, 10Gbit have all had their day and many gone. Circut speeds are increasing as well as wireless transmission speeds. Disc subsystems have enough trouble keeping up with Ethernet! You can't compress data as fast as you need to send it out, nor even read it from the disk!
Compression was super in a day of floppy disks and 9600bps modems, but it hasn't evolved much. ZIP and RAR are still what they were. Other formats from the early 90's have mostly disappeared (LHA, ARC, etc) as despite better compression just aren't needed.
Want proof? You can download your latest movie in XVid (700MB) or DVD-R (4.3GB) with similar quality- why are so many people downloading the DVD? Bandwidth at the consumer level is cheap and abundant. Broadband is everywhere. [note: I know the argument is that this compression is lossy and file compression isn't-- but the point is that bandwidth makes pulling extra data something of non-concern, compared to the user's processing time and interest].
-M
when you see the word 'Linux', drink!
So back in 1998 I started work for a company that had an interest in video streaming. There was some company that claimed to have a system that could broadcast streaming video and audio in realtime over a 28Kbps modem link with no visible degredation of quality and no lag. All built around hype such as "we've acquired the brightest signal processing engineer in the business who has made an astounding break through."
So a colleague and I flew up to San Jose to research the product and attend a by-invitation-only demonstration. There were two black box endpoints set up across a serial modem. One end point was just an NTSC stream digitized by their system and provided to the serial link. The other end point supposedly uncompressed the stream and built an NTSC stream.
To test lag I asked them to yank the video cable from the digitizer and sure enough the receiving end instantly tracked the change.
Everybody at the demonstration was under full draconian non-disclosure agreements. I asked for them to take the cover off the boxes so that we could verify that there did indeed exist some sort of reasonable computation processing going on and not some sort of standard RF video transceiver link hidden in the boxes. They said absolutely not on the basis of trade secrets.
Of course we went home and never thought about their company again. Funny, I never saw any news about any adoption of their "fantastic, revolutionary" product.
Another story from the "If it's too good to be true..." department.
I will never live for sake of another man, nor ask another man to live for mine.
Didn't you know? "A combination of de-duplication and calculating and storing only the changes between similar byte streams" IS a breakthrough. No previous compression algorithm ever did that sort of thing before...
Procrastination -- because good things come to those who wait.
That is all.
I [may] disapprove of what you say, but I will defend to the death your right to say it.
>> I'm sure with the right developer, Linux could also be used to harness zero point energy, create wormholes for travel in your basement, and possibly cure most diseases... /wink
...
:-)
Well, you're just looking for karma there, but actually
If you think that those things you mention are impossible (or even just unlikely), then you haven't been paying attention to the way in which science continually reinvents the meaning of "impossible".
Here's a rather more likely worldview: nothing is impossible (and I do mean *nothing*, even logical impossibilities), given enough time. Even logical impossibilities just need a re-examining from a different angle. (Don't forget Godel -- the rules change when you examine a logical system from outside its domain.)
The vanquishing of "impossible" is actually a consequence of the fact that we cannot observe the structure of reality or nature directly, but only her behaviour, ie. the way she responds to our stimuli. As a result, the "impossibilities" that our primitive theories conjure up tend to evaporate over time, as new expanded theories replace them. And this will never end it seems. There is no reason to believe that any particular observation is a fundamental one.
So, pretty much nothing is impossible.
That 25x compression is based on disk storage on archived media. Which makes sense, you back up a database, how much of it actually changes? I didnt see what their method is, but it is possible, and for backup purposes.
Good example, I have 20 webservers, the OS is the same on each server, but the configuration is different. (see where im going?) The software is smart enough that all the servers are the same, but the configurations are different. So it doesnt have to back up the entire same directories.
But for a long time, I've always thought the method of key'ed compression would lead to a better than 2x compression rate. I remember RLE and Bignum ascii compression that would lead entire ansi sites to less than 2k compressed, perfect for modems. All using Keys.
But thats a preset key, not a cpu crunching processed key like bzip.
I think what they're saying here is that if you're backing up an entire hosting site, or an entire company set of documents, information, etc, that you will find a lot of redundant content. Then you add normal streaming compression on top of that.
So I can believe the 25x (as a generous/marketing figure) in this specific use case. It wouldn't work at all for compressing single files for distribution elsewhere because it requires that you have all the other documents as context.
This would be very annoying to do on the fly as well (what if your 'base' document that 12000 other documents are similar to changes?), but again is well suited for backup or read-only media.
Obviously nothing concrete or released yet so take with the requisite grain of salt.
Come on, editors. There are people who believe the world is flat and stars are little candles in the air who are shaking their heads in disbelief over this article.
No Digg!!!
You are in a maze of twisty little passages, all alike.
You are absolutely right. If I compress my disk into a simple .tar and transfer it dailty by rsync, it's more than 1/25th comression. Not too much change every day, most of it is static data.
You cannot (losslessly) compress data beyond its entropy (-sum_i {p(i)log2 p(i), p(i) probability of ith symbol -- Rate in bits per input symbol}. From this, we know that we cannot compress equiprobable random bits *at all* and a highly 'deterministic' data stream to 0 outbits/inbit (in the limit as input stream size goes to infinity).
The amount you can theoretically compress depends on the input data.
Looking at the company's website, I can't see any mention of patents -- either issued or pending. If they really don't have any patents, I don't think they're going to get very far: Compression is one of the most over-patented fields around.
There aren't many details about how their product operates, but unless they've been extremely careful they probably infringe either the rsync patents (Pyne) or the blocklet patent (Williams).
Tarsnap: Online backups for the truly paranoid
I was gonna make some kinda 4th dimension joke about it using time to achieve it's compression ratio, rather than just compressing the amount of space used... but it sounds like that's actually pretty much true!
:-p
Oh well, saves my brain power trying to word the joke
The revolution will not be televised... but it will have a page on Wikipedia
This is the third time that I can recall that a Slashdot editor has accepted this same hoax.
--
Before, Saddam got Iraq oil profits & paid part to kill Iraqis. Now a few Americans share Iraq oil profits, & U.S. citizens pay to kill Iraqis. Improvement?
Here is a chunk of data that cannot be compressed with your algorithm:
Hilarious, eh?
I mean, come on! Can we please stop with the stupid /. articles and get on with nerd news? The past week's been rediculous.
“Our opponent is an alien starship packed with nuclear bombs. We have a protractor.” — Neal Stepnenso
Now, if this were true in any way shape or form, I personally know several people who would have access to BILLION of dollars in development for video applications and television broadcasting applications, not to mention feature film distribution and/or production.
Currently, uncompressed High Definition video requires enormous storage, as well as massive bandwidth to play in realtime.
With a 25:1 data compression scheme (no image degradation), any laptop would be able to store hours of High Definition video.
If this can truly compress anything, then encoded video shouldn't be a problem. Which means a 25mbit video (DV video format) could be downloaded as a 1mbit stream...
25x compression would allow lossless compression of 4k video to be stored on regular miniDV tapes...
Now, having said that, I think I can say with a lot of certainty that the entire story is BULLSHIT, therefore none of what I just wrote will happen anyway, so...
-- This sig for rent.
There have been many posts criticizing this as vaporware, and only 2 posts explaining why it doesn't have to be.
The problem is more in the summary article (both the slashdot summary and the linked article) than in the feasiblitity of the technology. Rather than compressing a dataset 25:1, the company reduces the amount of space needed to backup a dataset by eliminating some redundancy.
Repeat: not data compression- backup technique. That's why its not for home users.
It bothers me how many modpoints the trolls have gotten.
That's called the law of large numbers.
Systems like this bank on the fact that most enterprise backup systems (that is... Veritas) can't tell when a file is changed slightly between backups. They use a coarser-grained whole-file approach (which is very reliable though, and already only stores one copy of each file). But people who know about the magic of rsync understand the speedups that can be obtained by leveraging rolling hashtables and other tricks to get binary deltas of large files, and only transmitting those changes.
Given a large enough set of backups and enough time, the potential size savings is enormous.
Veritas should really be implementing this themselves, though.
And I have a feeling this is what's behind the 25x claims of the article. The key is the mention "enterprise"... large data sets... lots of potential redundancy to exploit.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
OK, doing the "check the incoming stream againt what we already have and only store the difference" thing might just work. But it could be dangerous. Imagine losing just *one* of those differences along the way that you'll need to reconstruct the "uncompressed" file. Think RAID striping without parity or redundancy -- lose one drive and...
Their spiel here (I've visited their booth at SNW as well) is that it is primarily used in part of a virtual tape solution. Their software sits on a Linux box (which they recommend a quad-Opteron with lots of RAM) emulates a tape library, then passes data to your backend SAN storage.
The compression they use for compression/data de-duplication seems to be in a similar vein to stuff used by Data Domain and other WAFS type solutions, just on a higher-bandwidth model.
If I recall correctly, Diligent is made up of some spinoff guys from EMC. (correct me if I'm wrong)
I once used a Huffman data compression algorithm, recursively, in order to see just how much compression I could get. The first round, I got maybe 75% compression on the data I was using. The second round, I got 10%. The third round, I got 3%. The fourth, I got 1%; and after that, I'd typically actually increase the size of the data slightly. Let's not forget that I am including the size of the initial data table.
So then I tried it with LZW compression, and it still eventually grew in size.
The neat thing about doing this, though, is that it taught me something about the mathematical basis for entropy. You see, I couldn't believe that I was getting the diminishing returns, so I wrote some algorithms to output the histogram curves.
What I saw was that the best Huffman compression came when the Histogram was farthest from what I'll call a "perfect bell curve". I don't know if that is the same curve or not, but it looks a lot like one half of a perfect bell; or maybe like the radiation output of a blackbody in physics.
Anyhow, as I successively compressed the data, the data moved towards a tighter bell curve in general, and always towards that perfect bell, in specific (so long as the data would compress, that is.) I didn't do the calculation, but it would be interesting to calculate what the closest bell curve was, and then do a standard deviation of the histogram from the bell curve, and correlate it to compression.
So then I thought "well, I'll compress only a portion of the data, the part that is compressible". But any typical portion of the data still seemed to follow that pesky bell curve. So then I thought to intercept the data, and see if I could visually spot any patterns.
Indeed, I could. Wow -- look at that string of zeros here; and that repeated series 1001001001001, *four times*, there. Surely I could get compression out of that. Funny thing, though. Every time I tried, I could get compression for that data set, but then lousy compression for anything else. When I tried to generalize the compression to include every possibility, I again couldn't get compression. In other words, truly entropic data does have repetition. It does have some item that shows up more commonly than others. It does have patterns. But the patterns are no more than what you would expect, (or actually, if you want to be correct but confusing, only an expectable percentage of the patterns are more than what you would expect, by any given amount.) And when you include all the patterns of length n, including patterns of length n=1, then there just isn't any more entropy possible for the data.
And just as it takes an increase in entropy to drive a heat engine (2nd law of thermo), it also takes an increase in data entropy to get compression.
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
Keep compressing till you have a couple of bytes?
Yes, you provided a base 10 counterexample: the string 1234 occurs at position 13,807 counting from the first digit after the decimal point. But then again, we all already knew that such examples exist, because we've all been to the base-10 pi-searching webpages; I could have saved you the trouble and listed some in my original post, but I was thinking base 2, not base 10 (Hint: the base 10 search space is less dense than the base 2 search space).
Anyway, (1) the counterexamples in base 10 do not prove that the numbers can't be compressed in say base 2 , and (2) even if you were to disprove it in base 2, all we have to do is throw in a few more common transcendental numbers and encode which number we're indexing. To fully disprove the method, you'd have to show that there are counterexamples in an arbitrary number of transcendental numbers.
For example, can you also find an example that doesn't appear in the first 1234 digits of e, e^pi, and ln2? That would only add 2 bits. If we're willing to accept more overhead or include a delimiter, I could create a gigantic table of trascendental numbers to be used. Then just encode the smallest value of log2(index-of-transcendental-number) + delimiter + log2(offset-in-the-chosen-number).
p.s. Besides, nobody really cares about compressing small files. They're only important when there are a lot of them, and then you just tar 'em together.
$ echo 1234 | wc -c
5
$ echo 1234 | gzip --best | wc -c
25
$ echo 1234 | bzip2 --best | wc -c
42
Using the 1234 counterexample we might be tempted to throw out the baby with the bathwater and say that gzip and bzip2 are horrible compression algorithms. hehe.
Guess what? It is IMPOSSIBLE to create a generic compression algorithm. Gzip operates by doing exactly what you mention: operating on a particular set of data: that being data with some exploitable redundancy. There are plenty of files that will get bigger when you give them to Gzip.
Entropy coders work by making assumptions about the probability distribution of the data they recieve. They assume they are working on a set of data in which certain types of data are more likely than others, so they store those more compactly, but as a result they HAVE to store others less compactly. No matter how you slice it, you can not store more than 2^n unique strings in n bits. The only gains you can make are by assuming that you aren't going to be dealing with all possible strings, and compacting the ones that you care about.
That may have actually been what you meant, but I really didn't want anyone reading that to get the impression that there was something magical about entropy that made it a different approach than narrowing the set of data you are storing. The two are fundamentally the same thing.
What is the current max?
Imagine storing a terabyte of data on a single disk, and it all runs on Linux
Why can't the same concept be used to compress on Mac, BSD, Windows, and Solaris?
Sure, everyone knows there's no way to mash an arbitrary file down 25x. There's a trivial proof for that. But in this case it sounds like they're talking about 25x compression across multiple files. That is, if you store two identical files, then the second is a pointer to the first. If you have a bunch of jpegs, then you cat them all together into a new file (while keeping the originals around), the new file is super small. At least that's how I read the article.
To rephrase what I was getting into at the end: Small numbers are obvious exceptions. You could have said 2 and made the same point: "the string 2 occurs at position 6 counting from the first digit after the decimal point." ;-)
:)
The effective size of the "useful" search space is related to n/lgn. So I'm basically saying it's possible to conceive that there is some point after which there are no counterexamples, or where the counterexamples are so sparse that the probability of finding a matching counterexample in more than one transcendental number is unlikely: n/lgn eventually starts growing fast as n gets large, but when n is small, lgn is large compared to n, so the search space is unnecessarily restricted.
If I set my level of significance to 64 digits, then you're going to have a heck of a time finding a counterexample >=64 digits long unless you can mathematically prove that such counterexamples exist. But why stop at 64 digits? That's not even worth compressing. Let's talk 2^20 or more digits!
To further compound this, you'll also have to prove that the intersection of significant counterexamples in k different search spaces is non-empty; however, it's not even clear that you could hope to prove such with a 10-digit level of significance (before you run out and quote 0123456789 which occurs first at 17387594880, let me remind you that I mean base 2 and I've raised the ante and required that you show it for at least 4 transendentals).
In short, this is not a simple problem to prove/disprove.
This is a virtual tape library vendor. If you backup your database (email server, home directories, etc) 25 times to their system and the data doesn't change much their software will find identical blocks and eliminate the redundancy creating a 25x compression. There is another vendor or two that do the same thing.
Although its not for every file, some times, this can be a huge win. In my case, backing up 60 versions of a 700kb XML file, I get 500:1 compression, 30 times better than what bzip2 gives me. Anytime you have a file where you know that it will have redundancy across more than 900kb, but less than 900mb, rzip can win big.
It sounds that this company's program is a variation of this idea, designed with backups in mind and identify redundancy across tens or hundreds of gigabytes.
I love how people make "claims" of stuff like this, and then there's never anything done already. It's like when you find a page that says:
And then you scroll down and it shows the most recent news posting was made on November of 1998, the only code is a semi-operational bootloader and nothing else has been written yet.Believe it or not, there's a ton of open source vaporware out there with fancy web descriptions like the one simulated above.
... running in Windows Whatever on my Intel Mac concurrently with OSX.
Don't we go through this every year or so?
- There is no such thing as a universal compression algorithm.
- Compression algorithms are specific to the data they are compressing.
Anyone claming to be able to compress everything by a uniform amount is lying, period.
Just to make sure that people don't get the wrong idea here from your post, the snake oil that you refer to wasn't the fractal compression scheme by Iterated Function Systems.
That company didn't survive, but their compression system works fine. I did some projects with it back at university.
In a nutshell, it involved finding recursive algorithms to generate output (and storing just the coefficients of the equations), so the more fractal self-similarity that there was in a scene, the better the compression achieved. An image containing (for example) leaves or trees with their highly self-similar structure would achieve absolutely enormous compression, and in general, most things in nature would do reasonably well.
As always, the amount of compression was highly dependent on the input, but there's nothing unusual in that.
That's not entirely true though. You can compress random data, but only given two assumptions:
The data has a non-uniform probability distribution.
You know that distribution.
The trick behind designing compression algorithms is coming up with intuitions about the probability distributions of useful classes of real data, and then coming up with computationally tractible ways of exploiting them.
..or at least built a system in which identical blocks of data would only ever be stored...once.
Only 25%? Last year I figured out how to compress 1,000,000TB down to a floppy; They are so behind the times.
Tip: Compression is fast, but decompression is very slow.
And yes, I have mailed myself my notes as a form of prior art.
- d
That's great! But I'm wondering, if it can compress ANYTHING 25x, if I feed it with 1000110100111011101011101 will it give me a 1 or a 0? [/sarcasm]
You just got troll'd!
Yeah, but imagine a beowulf cluster of these..
Concievably, it you had enough time on your hands to you get almost anysize file down to just a few dozen bytes
Actually, if you let it run recursively for about 257 years, you'll eventually shrink it to a couple of bytes, meaning that basically a fourth of all the files of the world can be compressed to this : 01
I trully hope you were trying to be sarcastic tho.
You just got troll'd!
I don't know about these guys but the idea of 25x compression is not in itself a problem. Depends on your definitions, data, and time and computing resources. For example wavelet based "fractal" compression IIRC gave a 400:1 ratio for certain features, plus actually generating data to make zoomed in photos look realistic. Fractals and other functions can also be used to compress data losslessly if you have a hairy enough library and computer, from what I remember. But when they start talking about MP3s etc then it starts sounding like BS. And the post? What does "a terabyte on a disk" mean anyway? ++ to more slashdot meaningless posts.
Sheesh...when did you last get laid?
Obligatory Soundbite Catchphrase
There is no bijective mapping from any finite set to a smaller finite set. QED. The only way to create a good compression scheme is to restrict the domain of "likely" strings; exploit relative frequencies. Although it's creative and amusing, your compression scheme cannot work in general, and I'd expect it to actually inflate file sizes significantly in general.
There is a field of science called information theory. It studies "information content" and things around that like datatransmission and ECC codes.
If I have a 10Mbyte file, it usually contains way less than 80 million bits of information. So, compression programs like "zip" and "gzip" can make the file smaller.
The theoretical limit however is the actual information content. Suppose an information theorist analyses your file and conlcudes that your file contains 40 million bits, then gzip or any other compression program will have a hard time compressing the file beyond that. (unless the compression program "cheats" and compresses the file as: "Rogers 10Mb file #1", and has the original file elsewhere)
Now, in practise I have a 440Mb spam-archive which compresses to 108Mb. This is only a factor of 4. If you realize that most spams are delivered tens of times, it must be possible to do a lot better. So if someone claims to be able to compress my spam mailbox a lot better, I can believe them.
Information content in mp3's and images is near 100%. If anybody claims to be able to compress more than 20% out of one of these, they are full of crap on theoretical grounds.
Or just use the pigeon-hole principle...
Easy as pie. Suppose that there is an algorithm that can (reversibly) compress any string of n bits to a string of n-1 bits. There are 2^n strings of n bits. There are 2^(n-1) strings of n-1 bits. No function from a set of 2^n elements to a set of 2^{n-1} elements is injective, hence not bijective. Contradiction. If you really must, proceed by induction to prove cases where the algorithm maps from sets of cardinality 2^n to sets 2^(n - j).
The same reasoning explains why you can't make a constant size cryptographic hash function that never repeats itself.
After all, I am strangely colored.
Both de-duplication and diffing at the file system level are useful. If done intelligently, they could probably save lots of space on a standard Linux or Windows file system. Of course, they are nothing new; the reason they aren't in the file systems of today is mostly that it's hard to implement them sufficiently efficiently; right now, file system authors are still struggling with just keeping their various tables and data strctures consistent.
By the way, if you want a de-duplicating data backup solution, there are a bunch of them around; faubackup is a simple example.
given a sufficiently large N?
This is a back up system, not a single file compression (although for framed data like video, email, etc.. the compression scheme is still clever).
Basically it's a CVS, if your backing up multiple computers, or user directories your going to see tons of repeate files, heck they'll even be the same name. Saving the diffs is a good idea. And not at all dificult to duplicate.
For instance what if you were doing back up for a team of animators. Their files are HUGE, but 90% of the frames will be identical between the individual systems. (indeed the frames between one another will likely be very similar) You could get far more than 25x compression that way. The big downside of this idea is the memmory & CPU vs Speed trade off. You can't use this kind of system to back up to a tape or DVD system, it needs to be random access media.
You could probably get nearly the same results by hacking rsync and diffing identical file names in different directories. Possible bonus for diffing files of similar file type.
It's a clever idea, not a radical new technology.
I would rather be ashes than dust!
If you tracked deltas within files, you could look to xdelta as a filesystem, or possibly CVS.
If you were just tracking changed files, you could look to Plan 9 filesystem or Dirvish.
What might be up: Picture backing up a number of fairly similar machines (say, a group of Windows machines built from a common image), & noting duplicated files, only saving each once. You could count the space saved by a link as compression. If you have a homogeneous sample, you save lots of space & claim ridiculous compression.
Wow... storing a terrabyte of data on a single disk !! All you need is a terrabyte disk - ground breaking.
Try this: write a program to output to a file the integers from 1 to a million using some universal code. Then try compressing the file using (eg) gzip. I bet that comes close to what you refer to as the mathematical limit. But as you can see it's nowhere near. The program itself is the optimum compression.
So really, it all depends on how much structure is inherent in the data and how easy it is to detect that structure.
Great... but I don't think home-users would need this.. most have a lot of space left on their hard disks even after storing everything they like... however, media companies might find this very useful... -- http://www.kudige.blogspot.com/
http://www.kudige.blogspot.com/
I can easily imagine having a terabyte on a single disk. In fact, LaCie, Hitachi and Seagate already sell such, among others. Disks are cheap and getting cheaper. Flash memory is more expensive, but getting cheaper even faster. I'm waiting for when the savings in mechanical breakdowns, power, heat and space makes flash memory more economical for petabyte storage than tapes and harddisks. Mechanical storage is for wusses:)
-Lars
As the input gets large, there are exponentially more pigeon holes than pigeons. Read on if that doesn't make sense. I'll explain!
/dev/random and get a string of 2^31 bits (256 MegaBytes if encoded RAW). If I get *really* lucky and find that it exactly matches the 0 offset, then I can encode the length in only 32 bits (4 bytes), and I'd have 64 million:1 compression. However, if I don't get so lucky then I'll have to search a bit for that string. If I find its start at 2^31 bits, not to worry: I encode its offset in 32 bits and its length in 32 bits, and my compression ratio is 32million:1. Using the pigeonhole principle, I can search until the 2^(2^31)- 2^31 - 8th bit and still maintain a 1:1 compression ratio. Let me rephrase that: I can search roughly 2^2147483647 potential pigeon holes before I have to put more than one pigeon in a hole.
:)
We're essentially dealing with a pseudorandom sequence where all inputs are equally likely, and we can define an arbitrary origin within the sequence to create a new pseudorandom seed. The chance of seeing some given binary pattern are 1/2^length; this is why you can find small counterexamples: the length of the number is close to the number itself, so the probability of finding the number within the pseudorandom sequence up to 2^length from the offset is rather low (hit or miss). However, this changes when N/lgN gets large.
Let's suppose I consult
Note: the pigeons are only 2^31 bits large in this example, so we could just say 2^2147483616 holes and ignore potential matches that aren't aligned to 2^31 bits for now. Since the random probability of the transcendental pseudorandom generator (pi) matching a string of 2^31 bits is roughtly 1:2^2147483648, and there are only 2^2147483616 holes, one could argue that there could be up to 32 pigeons per hole -- or rather that there's only a 3% chance that we'd see a compression ratio better than 1:1. If we consider the unaligned matches, naive counting of the first 32 offsets pushes our statistical odds back to ~100% (note: 100% in this case does not mean it's guaranteed to happen; rather that on average it's likely to happen N times for N numbers).
In short, I'd argue that it's statistically almost guaranteed that you could get at least a 1:1 compression ratio, and it's statistically "likely" that you could get a compression ratio on the order of 1,000,000x for any random string over 256Megabytes. And if pi:0 doesn't give you a good enough compression ratio for your given input, pick a different transcendental or pick some other origin within pi and report back.
p.s. Remember the first post where I said finding the answer would require a prophet? There is no way to search pi through 2^2147483647 digits without an oracle. [HINT: You might find all the written works created by humans, plus all the works ever created by monkeys trowing feces at a typewriter before you find the value you're looking for.]
There are ways to really compress any type of data though.
Take this number 141592653589793238462643383279. I can compresses it very well:
[the first 30 digits of pi after the comma]
Ok, not so much compression there. But lets keep in mind that the definition of pi says that there are no repeating sequences in pi. This also means that ANY sequence can be found in pi. That means that the ISO from Vista is hidden somewhere in that sequence. The problem is knowing where.
Say that I have a databank here at Compression-U Inc. This mighty database holds the number pi up to bazzillion digits. Via a very easy and quick algorithm I can find certainly sequences in that number. Now let me find the ISO for Vista for you.
[4.2gb of digits, starting from digit 2^383715-1 of pi after the comma]
There you go. I might even include the formula of how to calculate pi with it, and still retain amazing compression.
So in short, it _can_ be done, it's just very time intensive.
Well-encrypted data should look absolutely random. In that case, there are no patterns, hence compression algorithms won't be able to compress anything. Try to compress an encrypted file, and you will get something greater in size [the overhead used by the compression tool].
Note: applies to _well_-encrypted data
The saddest poem
This is my personal White Paper on lossless compression. Note this is no joke thread like that newb who posted he can get his to 1 bit. I affect random binary data. It achieves approximately, per cycle, a 81% remaining size of the origional file. I theorize the end limit to the size is in a range of 10 bytes to 10 kb. It will be different for every file type. Note this is an EXCEL filed that has been RARed. It was to big to upload normally. It is MEMORY intensive. I would prefer not to do it in excel except I lack the proper software to replace that crappy program. Here is the link to my website: http://www.security1.free2host.net/Compress.php
I am ready for the big jump in life, who will jump with me?
Of course the amount of 'compression' you get is firmly under YMMV.
It looks like somebody discovered the immense data storage capability of ProTracker modules and PostScripts...
http://outcampaign.org/
What these guys are doing is not compression. It's commonality factoring. No piece of data is ever stored more than once. Typically, this is done by a hashing algorithm that starts at a high level and indexes everything down to a discrete block size of a few K.
Each block gets an index checksum, then each file, each subdirectory, each parent directory, and so forth until the entire disk volume has a cumulative hash. Then, it's very easy to determine a) what has changed (and where), and b) what has been seen before.
When a backup starts, the client compares the volume hash signature to that on the backup system. If it matches - nothing has changed. Backup over. If it doesn't, then you walk the indexes to find out exactly what has changed, and then only prepare to send those dirs, files, or discrete blocks - whatever's the smalles object that expresses the delta. When those objects are queued to send to the repository, the client first generates a hash of the object and asks the repository if it's seen it before. If not, it sends the index and the data. If so, it sends nothing, since the repository's already got that particular chunk stored somewhere. There's some re-hashing and index reverification on the other end to make sure that all is consistent.
Therefore, each backup appears a "full" backup, not a file-level diff, since the entire image is comprised of a map of every object in the volume. In reality, each backup is a set of pointers into a hashed data store (commonly called a CAS, or Content Addressed Store) from which is is reconsitituted as needed.
Having tested and deployed one of these types of systems, I can say that a) it's great for desktops, where most of the data between boxes is identical - the OS, the core apps, etc, and only the user data and localization is different, and b) it's awful for pre-compressed data like streaming audio, video, JPEGs, PDFs, etc. Since compressed data is entropic data, there can be no commonality within the file or versions of it, unless the file itself is identical and present from multiple sources. Change one byte of a file and recompress it, and all the blocks are unique.
However, this is not new. Giggle Avamar Technologies and Arsenal Digital. BTW - this tech is pretty good for remote backups over low-bandwidth links, since it vastly reduces the amount of data that needs to traverse the wire.
At 3 A.M. you can see people's auras; at five you can see their contrails...
. . . so I might be able to clear up some confusion. The word 'compression' is probably not the right choice. 'De-duplication' is probably a better word. Try this: "ProtecTIER can achieve a 25:1 de-duplication ratio." That sounds more accurate to me. Currently it works as a virtual tape engine. Take 10+ TB of disk and attach to a Linux server (x86_64 only). ProtecTIER makes that disk look like a tape library and tape drives filled with tape cartridges for use by an enterprise backup system like Veritas NetBackup, IBM/Tivoli TSM, Legato NetWorker, etc. Most large companies today use a pretty similar backup strategy: Fulls once a week, incrementals the other days; weekly fulls are kept for 2-8 weeks, 'monthly' fulls are kept 2-6 months, daily incrementals are kept for 7-21 days. Depending on the retentions chosen, that's 10-30 or more copies of the same data, plus the maybe 5-10% that actually changed. ProtecTIER gets the 25:1 ratio by eliminating the redundent copies.
The algorithm is pretty elegant, actually. It holds a meta data index in RAM. As data comes in (at rates up to 200MB/s) it looks for a similar data set already stored. It reads the old data in, does a diff against the new data, stores the unique data untouched and uses pointers to refer to the duplicate data. With this method even if the system is completely wrong about which existing data set to match with, the data will be safely stored (with a low de-duplication ratio in this instance).
Yes, the product works as advertised. If you don't have several terabytes of data to protect in an enterprise environment, it's probably not for you. But, if you do have a large environment and are tired of dealing with tape, this product rocks.
Try this: write a program to output to a file the integers from 1 to a million using some universal code. Then try compressing the file using (eg) gzip. I bet that comes close to what you refer to as the mathematical limit. But as you can see it's nowhere near. The program itself is the optimum compression.
Some search strings for you to try: "Claude Shannon" "information theory" "Shannon limit" "Lempel-Ziv compression"
So really, it all depends on how much structure is inherent in the data and how easy it is to detect that structure.
Yes, it depends on the characteristics of the data, but no a priori knowledge of possible structures is assumed. You can come up with all kinds of ways to generate randomness, but your compression algorithm would need a lot of overhead to be able to utilize all of them. Real data is also unlikely to perfectly match any given type of specific randomness, so now you'll have to add complexity to the algorithm if you want to make use of these structures; you'll need to figure out how to optimally correlate data segments with known structures or slight offsets from these structures. At some point, you'll use more data for the overhead that describes each segment's structure type than you had in source data.
The situation you describe works for very specific cases, but it isn't particularly useful in reality.
OK, my bad.
Diligent is not using the term compression AFAIK, but neither are they really deploying this approach yet outside of initial testbeds. Data Domain has been selling a product like this for years, has hundreds of happy customers using it and more than a thousand units in the field. And we came up with a brand, Global CompressionTM, in 2003 to mean the combination of finding long sequences and storing them uniquely across many TB's of stored data (see below) + traditional LZ-style compression.
We sell our system only as a target for backup data, which is extremely redundant. On a first full, we tend to see 2x-4x compression effect. Subsequent file incrementals, 6x-8x. Subsequent fulls, 50x-60x. Aggregate compression effect across a couple months of retention tends toward 20x in a weekly full / daily incremental policy. Exchange or Oracle fulls-daily can be 50x, short retention can be 10x. Mileage varies especially by backup policy, but also (within the 2x factor) by data type. And as mentioned in the postings, the challenge is to get it to go fast; our implementation does this. Early alternatives, such as the Venti filesystem in Plan 9, don't.
Should it be called compression? In lieu of a better term, at least compression is descriptive to a user -- the effect is to compress the backup data. In network equipment they call this technology Wide Dictionary Compression, but it has a half dozen other names. The mechanism of finding a sequence and referring to the original the next time it comes up is pretty much the same as traditional compression, it's just harder to put into silicon because of the size of the referencing window. But it wasn't anticipated by the seminal compression papers many years ago, so there's some debate. In storage, lately, it's starting to get called Deduplication, despite the existing use of that term in databases, and despite another half-dozen vendor terms. Examples of alternatives include capacity optimization, factoring, data coalescensce and sequence reduction. It's only starting to settle down.
Full disclosure: I was at VA Linux in the team that acquired Andover, thus Slashdot, back in the day. Hope that worked out OK.
I belive Hamilton 95 had sub-Heisenberg compression a long time ago. Sub-Heisenberg compression can be used iteratively to compress compressed data again, using irrational numbers and advanced quantum mechanics. It could store the whole OS in 1 bit! You can still get this great software by FTP to 127.0.0.1.
These people are just offering a cheap rip-off that is limited to 25x compression. Don't be fooled!
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
... In fact, you not only don't win, you don't even break even.
/no/ way around Shannon and the laws of thermodynamics.
... I think you see the pattern.
/exactly/ balanced out by the small increases in size that get applied to the (vast vast vast majority of) strings that you can't compress.
And what's more, you can't even get out of the game.
Listen, there really is
"This mighty database holds the number pi up to bazzillion digits. Via a very easy and quick algorithm I can find certainly sequences in that number. Now let me find the ISO for Vista for you.
[4.2gb of digits"
Stop there for a minute.
Everything you've said is true, but let's just consider how big "a bazzillion" has to be before it would stand a reasonable chance of containing any given 4.2gb sequence of bytes. After all, if your database holds 4.2 billion digits of pi, then you'll only have one sequence that long to offer to compress for people. You have two possible sequences of length (4.2billion minus one) and three possible sequences of length (4.2billion minus two) and four possible sequences of length (4.2billion minus three) and
So how big would your database have to be in order for it to have a (for example) 50-50 chance of being able to represent any given 4.2gb of data?
Umm, the maths is beyond me actually, but it's going to be somewhere of the order of 4.2 billion factorial.
You haven't beaten the rules. Sure, with a database of size 'N', you can get great compression ratios on those subset of the strings of length less than 'N' that occur in your database. But that's only a tiny tiny tiny fraction of all possible strings of length less than 'N'. All those other ones get longer. The incredible, huge, massive, ginormous compression ratio that you get on the (tiny tiny tiny fraction of) strings that you can compress is
Beyond that, there's also the problem that, as you say, pi contains no repeating sequences. If the data you want to compress is a repeating sequence, pi isn't gonna help any.
my experiance is that bzip2 is great for things like source archives but on a lot of other data rar beats it.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
For every compression algorithm, there exists a data stream which, when "compressed", will actually grow in size.
This is pretty easy to prove. Compression maps a string of N bytes to a string of M bytes. If you consider that there are 256^N strings of that length, and (-1+256^N)/255 strings shorter than N, then there will have to be strings which when "compressed", stay the same size. Moreover, an algorithm also needs to map strings smaller than N into the same space, and you can't have collisions if you expect to restore the original string via decompression. This means that some strings will need to get bigger.
If you look at it from the other side, a compressed string of M bytes can only represent 256^M possible uncompressed strings. For an example of M=1, you could design an algorithm that compresses the entire works of William Shakespeare, and 254 other works, into a single byte [1], but it would only be useful for those 255 data sets. Any other data set would need more bytes, and an additional byte at the start to indicate it was not one of the stock data sets.
The point is that a compression algorithm is only useful for the kinds of data it was designed for. Most compression is designed for data with repeating patterns. Data without patterns cannot be compressed by these algorithms. If they claim 25x compression, then they need to tell us which specific kinds of data they expect it to work with, because there exist many files which would get an actual ratio of 0.9999~.
[1] Shakespeare et. al. Compression Algorithm If first bytes is 0, return entire works of shakespeare. If {1..254} then return Work{1..254}. If 255, read all following bytes as entire contents of original file.
Mark of the Coder fades from you. You perform Opening on World of Warcraft. Warcraft crits GPA for 4. GPA dies.
I already proved that it won't work. Your argument is obviously flawed.
After all, I am strangely colored.
poopdeville wrote:
:P
:-)
> I already proved that it won't work. Your argument is obviously flawed.
Yeah, but I proved that your proof is flawed, so nyyyyyyaaah.
Seriously though, please read this and respond to anything you disagree with.
Step 1:
Input = N bits. Therefore, we're permitted N bits for output before compression ratio is over 1. That means in loose terms we can scan through 2^N bits of pi looking for N bits, except we need to reserve space for the size of the input and some sort of delimiter, so really we're only allowed to search through 2^(N-lgN-k) bits of pi looking for N bits. (* It might be more practical to assume that k is O(lglgN) or even O(lgN), but that won't change our calculations.)
Step 2:
If we search on 1-bit boundaries, we're going to find at most N-lgN-k unique patterns, so *at best* we can only get 1:1 compreession on (N-lgN-k)/N of the inputs if we use only pi as the encoder, or in other words we won't compress (lgN + k)/N of the inputs. Even if we conservatively estimate k as 2*lgN + 8, this still gives us likelihood of compressing 99.91% of the inputs in the range of 65536 bit long (in fact, it makes you really wonder which 56 inputs wouldn't be compressed -- we might even consider hardcoding them somewhere. hehe).
Step 3:
The previous step says that (lgN+k)/N won't be compressed. To combat this, we select some other transcendental number. We pay one extra bit in penalty to doubles our odds of finding a compression, and now statistically speaking everything has more than 100% chance of being compressed (or rather, that we're "likely" to find 2 matches for "most inputs" and only 1 match for the rest). If you're not satisfied, add one more bit and allow us to consider 4 different transcendental numbers. (* Strictly speaking, we just scaled it as 2*N/(N+1) or 4*N/(N+2), which may have an adverse effect on the compression ratio for small inputs, but the final expression is X * (N-lgN-k) / (N-lgX), so the limit approaches X as N gets large.)
Challenge:
Find one 256-bit input that cannot be encoded in 256 bits or fewer using this method with pi and e as the encoders (let the first bit be 0 if you use pi or 1 if you use e). Using the conservative estimate of k, we'd expect 25% of the inputs to fail on either one, so you've got a decent chance of finding a counterexample that fails both. I'll be waiting.
Having said that I believe this is similar to another backup data compression algorithm I saw a presentation for a couple of days ago. There are two parts:
1) A database of unique chunks of data.
2) A blueprint of index numbers that define how the data fits together.
It takes a look at the data stream in X bit chunks and if its unique stores it in a database and stores an index pointer to it; if it has been seen before then it just stores the index pointer.
Obviously as this index gets bigger it takes longer to search through but there is less chance of a non-unique chunk. If this is done in a Disk-2-Disk-2-Tape situation it can take the backup of the server(s) onto a HDD and then run this algorithm at its lesuire to get the compressed version for tape. I'm assuming as they are marketing this for TB levels of data that they have this one worked out - at least for this level of data.
Another issue is that you get less and less compression as your index number takes up more bits (i.e. more and more unique chunks). This isn't going to be a practical problem in the near future as the one I was looking at was taking 8KB chunks. This means that to get enough unique chunks to get the index to be the same size as the data its replacing (8KB) you need at a minimum 2^65536 bits (10^19709 exbytes). This is simplified but even if you have a couple of orders of madnitude as a fudge factor for overhead in storing the index numbers you aren't going to run into this problem soon.
There is also a problem if you don't have too many repeating chunks. In fact there may be an *increase* in file size if you don't have many as you now have the overhead of the database to worry about.
So whats the the answer to the scoffers 'can you feed its output to itself?'
The answer would probably be yes but each time through you have less repeating chunks, therefore more unique ones so the database overhead eventually gets to be a problem i.e. you keep running its output through itself and it eventually comes near the theoretical minimum and oscillates, getting bigger then smaller, then bigger again.
Nobody knows how to search. There is one published app and its interesting.
nuff said
Hey, I'm just your average shit and piss factory.
Seriously though, please read this and respond to anything you disagree with.
Honestly, fuck that. I'm not going to waste my time decyphering the argument supporting your claim when I already know it's false. That sort of thing can be instructive, but only if the flawed argument is essentially insightful. This isn't.
Some flaws I gleamed with a quick scan:
After all, I am strangely colored.
I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.
I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.
All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.
I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.
I am ready for the big jump in life, who will jump with me?
I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.
I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.
All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.
I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.
I am ready for the big jump in life, who will jump with me?
I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.
I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.
All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.
I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.
I am ready for the big jump in life, who will jump with me?
I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.
I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.
All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.
I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.
I am ready for the big jump in life, who will jump with me?
I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.
I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.
All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.
I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.
I am ready for the big jump in life, who will jump with me?
I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.
I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.
All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.
I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.
I am ready for the big jump in life, who will jump with me?
I don't even need to read your silly paper to call bullshit because it's already been proven that it's impossible to compress truly random data in the general case by even 1 bit. Do you also spend your days trying to draw maps that violate the 4-color theorem? Crank.