Slashdot Mirror


New 25x Data Compression?

modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.

438 comments

  1. What kind of data? by Short+Circuit · · Score: 4, Insightful

    I can create a compression algorithm that compresses my 2GB of data to 1 bit. But it would be crap for any other datastream fed to it.

    1. Re:What kind of data? by ivan256 · · Score: 5, Insightful

      The article says:

      it can compress anything: email, databases, archives, mp3's, encrypted data or whatever weird data format your favorite program uses.

      In other words, they're full of crap.

    2. Re:What kind of data? by Hao+Wu · · Score: 1, Insightful

      Like saying that a library card is the same thing as a library.

      --
      I suggest you read Slashdot
    3. Re:What kind of data? by slimey_limey · · Score: 4, Insightful

      So it can compress its own output? Sweet....

    4. Re:What kind of data? by swimboy · · Score: 5, Funny

      It can compress anything! At the demo, I saw them compress 25 oz. of snake oil so that it all fit in a 1 oz. jar!

      --
      Ask me how the Heisenberg Principle may or may not have saved my life.
    5. Re:What kind of data? by devjoe · · Score: 5, Insightful

      Well, there's an idea here that might hold some truth. Note that they are marketing it to data centers, people with LOTS and LOTS of files. Because people tend to have multiple copies of the same files, they can achieve great compression by eliminating the duplicate copies in the archive -- or likewise, any files with large sections that are the same among various files.

      20 email accounts subscribed to the same mailing list? Store the bodies of those e-mails only once, and you save a big chunk of disk space. A bunch of people downloaded the same MP3 file? We only need one copy in the archive. As long as there are multiple copies of the same data, it can compress any type of data.

      The difference here is that they are taking advantage of the redundancy of files across an entire filesystem (and a HUGE one), rather than the redundancies within an individual file. (I would assume they also do the latter type of compression with a conventional algorithm.) 25x compression seems extreme, but I am sure they can achieve some extra compression here.

    6. Re:What kind of data? by Anonymous Coward · · Score: 0

      I can also create a compression algorithm that compresses any 2GB of data to 1 bit. It just so happens that it tends to be a little lossy.

    7. Re:What kind of data? by tverbeek · · Score: 5, Informative
      I just fed Diligent Technology some bogus personal data and downloaded their brochure, and as far as I can tell from a quick gleaning, they achieve these impossible compression ratios across multiple versions of the same data set. So your initial full backup will be compressed at mathematically-possible-in-this-universe ratios, and your subsequent incremental backups - which only store the changes compared to the previous backup - will (with typical data scenarios) be much smaller. It's incremental backups on the byte level, basically.

      So they're not exactly lying about the compression ratios, they're just redefining the term to describe compression not of data-sets but of data-sets-over-time.

      --
      http://alternatives.rzero.com/
    8. Re:What kind of data? by Poltras · · Score: 1

      Have you ever tried to randomly access a movie and get a full image using MPEG-4, which uses a similar technique? If you are able to just switch to a randomly chosen frame and have a full image instantly, look at how much processing power it needed.
      I'm betting my money that if it used in the way you suggest (which is what TFA also suggests), then the benifit won't surpass the processing power it will ask. Again, hdd space is cheap, cpu power is expansive. If it is used for bandwidth or small medias, then question is how it will compare to other well known techniques (i.e. gzip) and how are they going to push it.

    9. Re:What kind of data? by a_nonamiss · · Score: 1

      Sounds like the same thing that AMANDA has been doing since 1997.

      What is old is new again.

      --
      -Arthur
      Cave ne ante ullas catapultas ambules
    10. Re:What kind of data? by fyndor · · Score: 5, Informative

      You hit the nail right on the head. No compression can ever make a statement that it can compress anything by ANY set value, unless the value your talking about is zero :) This would imply that you could compress the output of a compression process and compress it 25 times more. Then take that output and comress it 25 times more. Then take that output... See where I'm going? You could say that MOST files of DATATYPE_X will compress UP TO 25x, but there will always be the exception to the rule. There is no such thing as a free lunch. You can't have infinite compression... but it'd sure be a lot cooler if ya did :)

    11. Re:What kind of data? by Methlin · · Score: 1
      So they're not exactly lying ..., they're just redefining the term ...
      Just like any good marketeer, which is why we continually have to come up with new terms to describe what the old term USED to mean; see: opt-in
    12. Re:What kind of data? by mnmn · · Score: 1

      How about this data set across time:

      cat /dev/random

      or

      dd if=/dev/random of=/tmp/file ; sleep 1 ; dd if=/dev/random of=/tmp/file2

      Can it compress it even 2x reliably?

      --
      "Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
    13. Re:What kind of data? by TheDreadSlashdotterD · · Score: 1

      I walk away for a few hours and suddenly I find my old post. what was old is new again.

      --
      I have nothing to say.
    14. Re:What kind of data? by networkBoy · · Score: 4, Funny

      1.
      I can compress anything you give me by a factor of at least 1 (inclusive of my own output).

      "-1 pedantic", I know.
      -nB

      --
      whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
    15. Re:What kind of data? by morcheeba · · Score: 2, Funny

      I've heard you have to do the decompression carefully, though -- If you do it too quickly, you just end up making a big mess.

    16. Re:What kind of data? by Short+Circuit · · Score: 1

      Aw, come on. I posted a thank-you reply. :)

    17. Re:What kind of data? by Anonymous Coward · · Score: 0

      you're an idiot

    18. Re:What kind of data? by TheDreadSlashdotterD · · Score: 1

      I know. No worries.

      --
      I have nothing to say.
    19. Re:What kind of data? by TheNetAvenger · · Score: 5, Funny

      In other words, they're full of crap.

      But the Slashdot Post says that is all runs on Linux. And knowing the infinite power of Linux, I believe them.

      In addition to being the best OS in the world, Linux is also the most secure, does everything better than every other OS, and if given the right developers it is the ONLY os that could do something as impressive as compress data past the limits of possiblity.

      I'm sure with the right developer, Linux could also be used to harness zero point energy, create wormholes for travel in your basement, and possibly cure most diseases... /wink

    20. Re:What kind of data? by qazwart · · Score: 1

      This product has been around for quite a while. They use it to put six whole tomatoes into a can of Hunts Tomato Paste.

      (Boy, am I dating myself...)

    21. Re:What kind of data? by Intron · · Score: 1

      While you're up, get me a pack of Luckies and some Brilcream.

      --
      Intron: the portion of DNA which expresses nothing useful.
    22. Re:What kind of data? by drDugan · · Score: 1

      "hdd space is cheap, cpu power is expansive"

      did you mean expansive or expensive? It certaining *is* expansive, but I don't think that's what you meant.

      I would say that of the big four: cpu, drives, network bandwidth, and memory the only one that wis going to be really interesting moving forward is memory. cpu power and drives are not the limiting factor for most of the interesting things I see people trying to do.

    23. Re:What kind of data? by Xabraxas · · Score: 1

      You don't even need AMANDA to do that. You can shell script it.

      --
      Time makes more converts than reason
    24. Re:What kind of data? by Anonymous Coward · · Score: 0

      Another example. To compress a directory full of pr0n, it takes signatures of each image and combines duplicates. To gain additional compression, it swaps male and female and labels the result "gay", mirrors top to bottom and labels the result "fetish" and modifies hue-saturation values and labels the result "ebony".

    25. Re:What kind of data? by fuzzix · · Score: 1
      In other words, they're full of crap.

      Anybody remember ZeoSync?
    26. Re:What kind of data? by Poltras · · Score: 1
      It certaining *is* expansive, but I don't think that's what you meant.

      uh yes. but you probably meant "certainly" ;)

      as for memory, I _believe_ I'd rather buy a 500 mhz 2G ram with 500 Gb hard drive than a 3.0 Ghz Xeon 2G ram with 50G hdd. Both being considered equivalent in term of performance (on the network) / data contained.

      In the war of memory/cpu consumption (because basically compression is just about trading one for the other), memory still wins (in my heart) because no viable solution for low-end cpu exist.

    27. Re:What kind of data? by weffew... · · Score: 1
      ... So basically they are doing what IBM Tivoli Storage Manger has been doing for a few years with differential backup.
      Big deal.

      Wef

    28. Re:What kind of data? by Poltras · · Score: 1
      I'm sure with the right developer, Linux could also be used to harness zero point energy, create wormholes for travel in your basement, and possibly cure most diseases...

      but what's the use if it cannot cook and bring your breakfast at your bed the exact moment you wake up? Oh that and play music... without using speaker, that is.

    29. Re:What kind of data? by tylernt · · Score: 1

      "Because people tend to have multiple copies of the same files,"

      I see this all the time on our file servers. Different departments need copies of the same stuff in their own shares (not to mention the fact that every user has their own copy of WMP10 in their user folder). I've mitigated the problem to some extent with a "public" folder that is softlinked to a central location, but this has drawbacks if one dept deletes a file, it's gone for everybody else too.

      I would love to have a filesystem that automatically detects dupe blocks and links them, and then unlinks the blocks that get changed later. Since I'm not aware of any Linux or Windows filesystems that do this, the technology that TFA talks about would be a nice to have.

      --
      DRM 'manages access' in the same way that a prison 'manages freedom'
    30. Re:What kind of data? by Anonymous Coward · · Score: 0

      "It certaining *is* expansive,"

      Did you mean certaining or certainly? When trying to be a pedantic dick, it helps if you don't look like a doofus yourself.

    31. Re:What kind of data? by j1bb3rj4bb3r · · Score: 1

      So they're not exactly lying about the compression ratios, they're just redefining the term to describe compression not of data-sets but of data-sets-over-time.

      That's kind of like the spot on the Colbert Report last night where they've redefined marshlands to include golf course water traps, and as a result our overall marshland has actually increased!

      --
      *yawn*
    32. Re:What kind of data? by smerkel · · Score: 2, Interesting
      About a year ago, we went through an evaluation of different data protection technologies to replace a tape based platform we were using at the time. I wanted to get away from tape if I could - I simply had a problem buying 5x more media then data I was protecting.

      I came across a company called Avamar (http://www.avamar.com/). They do something similar, except they deduplicate the data on the client side, before it ever traverses the network. Needless to say, I was a bit skeptical with their claims. I was able to con them into letting me eval the platform for 3 months. As it turns out - it works as advertised.

      I was able to consolidate 2 large tape libraries (L700's) into 8 x Dell 2850's - all running Redhat Enterprise, all with 6 x 300GB SCSI drives in them. We are currently protecting 7TB of data (Note: The hardware is currently 58% utilized). We process 20,000 backup jobs a month. And as an example, pulled directly from last night's activity log, we performed a full backup of a Windows box with 100GB of data on it in less than 15 minutes. (Note: The science behind this is very practical.

    33. Re:What kind of data? by billcopc · · Score: 1

      Reference counted filesystems are a tricky affair because you have to be damned certain to not wipe the data as long as there's at least one reference to it. Certainly feasible but it doesn't have the appeal of mainstream usage, as it would likely be very expensive in terms of processing and caching. For a file server it would be great, but for everyone else it's overkill.

      --
      -Billco, Fnarg.com
    34. Re:What kind of data? by tfb · · Score: 1

      Although I was as irritated as anyone by the obviously-bogus-looking claims, it's significant that being able to do this kind of incremental thing is pretty interesting (and other people do related things). Given ballooning data volumes, frequently faulty internal accounting (not charging for backup volume/retention/recovery-time) and increasingly fussy regulatory environments, backup volumes are often becoming a serious problem. Even relatively small organisations can end up putting away tens of terabytes a week, and large ones must be frightening. So tricks which can reduce the storage requirement this way may be well worth doing. Although probably not as worth doing as fixing the underlying stupidity which causes the problems in the first place in many cases. But fixing stupidity is hard.

    35. Re:What kind of data? by inKubus · · Score: 1
      "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux."


      Editor: Welp, this is going on the front page.

      --
      Cool! Amazing Toys.
    36. Re:What kind of data? by failure-man · · Score: 1

      The dead giveaway for full of crap, at least in my mind, is claiming that it will compress encrypted data. It's always been my understanding that encrypted data appears as a cryptographically random jumble of bits.
       
      No pattern == no compression. Compressing random data is the CS equivilent of the perpetual motion machine . . . . . .

    37. Re:What kind of data? by WhiteWolf666 · · Score: 3, Funny

      oOo. Sounds like you are going to find the data singularity.

      A single byte that is all other data compressed together, and from which all knowledge flows! The universal black hole of data!

      Don't tell me .... is this a new MS Vista technology?

      --
      WhiteWolf666 an exBush supporter. All you new-school,compassionate,save the children Republicans can rot in hell
    38. Re:What kind of data? by Apparition-X · · Score: 1

      Well, I will reply to the top post in the vain hope that this gets moderated somewhere (positive). Amidst the vast number of posts on this subject, the overwhelming consensus seems to be that this is impossible. Unfortunately the overwhelming consensus in dead wrong.

      Here is how it works. One, we are not talking about compression of individual files; we are talking about "compressing" a data stream composed of *a lot* of data. The more the better. 10s or 100s of TB (that's terrabytes, kids, 1000x a GB, 1000000 x a MB) is idea. The technology then compares blocks that the data is composed of, and retains pointers to the blocks that are the same, rather than the entire block. Therefore, it doesnt really matter if your data is composed of an Oracle database, a bunch of mp3s, and a few million Excel spreadsheets, it is all good.

      Now for the catches: you need a lot of data, as mentioned. It helps if this is backup data, a lot, because you are highly likely to capture redundant files (either across different machines, or over time) and therefore rendundant blocks. Finally, it is slow. Very slow. When compared to an enterprise backup that might be capable of generating 1000 MB/s on aggregate to a few dozen tape drives, it is show-stoppingly slow. Think sub 100 MB/s.

      For other competitors to datamojo, try Copan, Sepaton, and DataDomain.

      And I will be the first to admit that I did not RTFA, and I do have the advantage of being familiar with a lot of backup technologies. So I don't know for sure, but if the article mentions any of the points I raised, it is depressing at best that out of the dozens of posts moderated positively that hardly a one mentions any of the facts above. Holy moly.

    39. Re:What kind of data? by kesuki · · Score: 1

      I have a pretty simple solution for you on that 'shared' directory, rather than linking the folder itself, sybolically link all the files in a special folder, when people 'delete' a file they're 'deleting' the symbolic link to it, and have a crontab set up to scour those 'shared' directories for 'added' files, and then put a little 'script' that outputs a ls of the real content of that folder, and then another to allow users to 'add' files that weren't in their magic folder before.

      that would take away the 'accidental' deletion issue, but then takes away some of the ease of people sharing the files with everyone.

    40. Re:What kind of data? by MikeFM · · Score: 1

      I'm sure it can store any kind of data but the compression ratio will vary. If you feed it's own output back into it eventually it'll reach a limit where it can't compress anymore. For any normal kind of file though it's a perfectly reasonable concept. I already save a lot of space with my own little filesystem (FUSE based) that avoids saving duplicate files and compresses files that aren't already compressed. Over a large amount of files that saves an amazing amount of space because many files are duplicate. You'll notice that they are targeting these systems for large data warehouses where a lot of files are duplicates and many files have a lot of redundant data.

      So make fun of their marketing, marketing is always full of bullshit, but I wouldn't throw this product out as bullshit without hearing more.

      --
      At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
    41. Re:What kind of data? by rw2 · · Score: 2, Insightful

      1.
      I can compress anything you give me by a factor of at least 1 (inclusive of my own output).

      "-1 pedantic", I know.


      It would be more pedantic if it were accurate...

    42. Re:What kind of data? by jedimark · · Score: 2, Funny

      You mean /dev/random? :)

    43. Re:What kind of data? by TheNetAvenger · · Score: 1

      Maybe for all articles we submit from now on, we should add, "And it all runs on Linux"

      Or maybe be creative with these other suggestions.

      "Only on Linux could such amazing technology be created"

      "Linux is the only way this could have ever been possbile"

      "And because this article is about one of the biggest discoveries every, we should note that the researcher might be using Linux for his research"

      "Oh, did I mention Linux?"

      "No they don't use Linux, that is why the Space Aliens landing are not news"

      Be sure to add Linux in some way to everything you submit if you want it to make it on Slashdot. Even if Microsoft does something wonderful, just add. "but as soon as a Linux version is available it will be amazing, until then pretend it don't exist." /smile

    44. Re:What kind of data? by tylernt · · Score: 1

      Hmm, that's a good idea. I like it for files that never get edited. Because, when someone edits a file, it still gets edited for the other departments.

      Still, that gets us one step closer... thanks for the suggestion. :)

      --
      DRM 'manages access' in the same way that a prison 'manages freedom'
    45. Re:What kind of data? by homer_ca · · Score: 1

      It all depends on how much redundancy is in the data, and it's big YMMV. De-duplication is potentially a very powerful technique. Let's say you're backing up an office full of desktop PCs. There's no need to store multiple copies of the OS files. Just store a token referring to a certain file that belongs to Windows or whatever. Back in the mid 90s there was a service called @backup that backed up your PC over a dialup link. It used this exact technique for uploading files of the OS and common applications.

      Actually they're still around: @backup

    46. Re:What kind of data? by DeafByBeheading · · Score: 1
      1.
      I can compress anything you give me by a factor of at least 1 (inclusive of my own output).

      "-1 pedantic", I know.


      It would be more pedantic if it were accurate...

      I'll bite... Why is that inaccurate?
      --
      Telltale Games: Bone, Sam and Max
    47. Re:What kind of data? by tmasssey · · Score: 1
      Funny. I build exactly the same style of systems, just a lot cheaper:

      BackupPC...

    48. Re:What kind of data? by dbIII · · Score: 1

      That makes sense - just like getting huge compression with lossy formats, like the massive amount of compression in my portrait attached below.

    49. Re:What kind of data? by CastrTroy · · Score: 1

      I just tried this. Using bzip2 and gzip, the file ends up bigger when compressed. Any file with enough entropy ends up bigger than the original file. I doubt that any algorithm could achieve good compression of the already compressed files that we are already storing (open office, mp3, jpeg, video, and tons of others). As a general ruled, stuff that is compressible (text) doesn't take up so much space anyway, it's the binary data that takes up all the space, and it's also the stuff that doesn't compress well.

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    50. Re:What kind of data? by forgetful_ca · · Score: 2, Interesting

      I believe he's going to say something like % lost to overhead like the file name, filesize, index, etc...
      but that would be wrong. Anything compressed by a factor of 1 wouldn't need those things. You would just spit out the original file again with no changes whatsoever. In fact, I have already encoded such a compression algorithym. (and have patented the process. oh, and um, copyrighted it. and stuff.)

      cat my1file > my2file

      ta da!

    51. Re:What kind of data? by DeafByBeheading · · Score: 1
      Hmm... Thanks, but I hope there's more to it than that...

      cat my1file > my2file

      ta da!

      cp is much more efficient at this =)
      --
      Telltale Games: Bone, Sam and Max
    52. Re:What kind of data? by rw2 · · Score: 1

      Because the arbitrary data compression algorithm must add data to the result file in order to distinguish it from a file that has been compressed. Think of it in terms of a header that says "take the rest of this file as a literal".

    53. Re:What kind of data? by moving_comfort · · Score: 1

      Compression isn't really... what this is.

      My company (huge healthcare conglomerate) had Diligent in for a presentation/demo on a non-disclosure basis. (We have oodles*oodles^oodles of data, you see, and our lifecycle requirements are unforgiving.) The CTO guy gave a pretty convincing "here's how we do it without giving aware our IP" presentation. It does involve what most people call RDE (Redundant data Elimination) - they call it "factoring". I seems to get up to 25% across multiple, similar data sets, but the algorithm seems to actually work. The reduction started at regulat LZ levels (~2.5) and worked it's way up to 20-25%.

      You had to really stream the data into multiple engines to get numbers that matched an average standard-VTL backup throughput level, though, so I'm not sure that the overhead this "factoring" would require would be acceptable on the average Joe's workstation.... But the core stuff involved here is real.

      (aside - Diligent marketing seemed to exist on the bleeding edge of their GA dates, though, so some of these claims may still be kinda vaporous at the present time.)

    54. Re:What kind of data? by pornking · · Score: 1

      It looks like they are doing a type of dictionary encoding, but using a hashtable for the dictionary and keeping it in memory. That way, each hash entry points to a chunk of data, and the actual files consist of hash entries followed by differences.

      I don't know how well it would work on a small scale. I can, however, see where they could get huge compression ratios out of a multi-terabyte backup system. In a big company, There's going to be a lot of redundancy across multiple user's files, and in multiple revisions of any single user's files. There will probably also be a lot of duplicates of compressed or encrypted files. While compressed or encrypted data itself can't be compressed, multiple copies of the same file can certainly be compressed down to not much more than the size of one.

      It's a special case, in that you get 25x compression only when storing massive amounts of multi-user business related data, and I'm sure it won't work for everybody even then, but it's also a useful special case. If I had 50,000 desktop PCs to back up nightly, I would be very interested.

      I always figured Google was doing something like this in GMail.

      --
      pornking
    55. Re:What kind of data? by Anonymous Coward · · Score: 0

      They are not full of crap, they just haven't yet created the de-compressor.

    56. Re:What kind of data? by poopdeville · · Score: 1

      An encrypted file has Kolmogorov complexity on the order of the size of the encryption algorithm plus the Kolmogorov complexity of the original string. Hence, not random.

      --
      After all, I am strangely colored.
    57. Re:What kind of data? by zopf · · Score: 1

      /dev/null

      --
      Did you see the pool? They flipped the bitch!
    58. Re:What kind of data? by locofungus · · Score: 1

      1.
      I can compress anything you give me by a factor of at least 1 (inclusive of my own output).

      "-1 pedantic", I know.

      It would be more pedantic if it were accurate...

      Actually it could be accurate.

      However, if true, the following statement is also true:

      I [can] compress anything [you give me] by a factor of no more than 1 (inclusive of my own output).

      The proof is left as an exercise for the reader.

      At which point the algorithm becomes trivial.

      --
      God said, "div D = rho, div B = 0, curl E = -@B/@t, curl H = J + @D/@t," and there was light.
    59. Re:What kind of data? by mwvdlee · · Score: 1

      In short; it's not revolutionary compression but rather a convenient method to efficiently use backup media.

      Basically it's a large archive (tar.gz) of all the versions of the files.

      I wonder how this compares to ZIP format, which can also store multiple version of the same file inside a single archive.

      The only real addition of this is that it's splitting the single archive into multiple volumes based on date instead of volume size.

      --
      Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    60. Re:What kind of data? by ajs318 · · Score: 1

      The compressibility of encrypted data depends on the encryption algorithm.

      Data compression, in any form, basically works by finding a set of rules that can be used to describe things; if the descriptive rules can be expressed more briefly than the things they describe, then the compression works. Non-lossy compression, as used for text and programs, requires that the original data can be recovered exactly; lossy compression, as used for images and sounds, only requires that a good approximation to the original data can be recovered.

      This works well for data that already tend to follow rules. For instance, in a program there will be several words -- function and variable names, and the reserved words -- that occur again and again. In an unencrypted text file, there will be some words, and some letters and punctuation marks, that occur more frequently than others. That gives you a hook on which you can base a compression rule.

      Encrypted data is had by combining the plaintext with a keystream {a sequence of hopefully as nearly random as possible numbers}. If the output of encryption software repeatably compresses well with different plaintexts, that probably indicates that the keystream is following rules simple enough for the compression software to pick up on them ..... and an attacker might be able to deduce what those rules are, and hence recover the plaintext. To be certain that compressibility is not an artefact due to the plaintext, modify the Source Code* of your encryption software to just generate a keystream, without encrypting any plaintext; recompile, and attempt to compress lengths of keystream.

      * If you don't have the source code, then your encryption software is already insecure. If you need to ask why, you don't understand security.

      --
      Je fume. Tu fumes. Nous fûmes!
    61. Re:What kind of data? by MSZ · · Score: 1
      In other words, they're full of crap.


      Long long ago, when 9600 bps modems were considered lightning-fast there were some guys with similar type of claim. They even released the software! Too bad that the software used to compress data with some typical algorithm, write it into hidden file and pretend small file with details where data is hidden was the actual archive.

      Or like this company that promised to make Windoze memory bigger with on-the-fly realtime compression, only to be later found to simply increase swapfile.

      Well, maybe Duke Nukem Forever will be compressed with this new invention to fit on one floppy ;-)
      --
      The moon is not fully subjugated. I demand a second assault wave preceded by a massive nuclear bombardment.
    62. Re:What kind of data? by Jesus_666 · · Score: 1

      It all depends on whether you want the compresson to be lossless...

      --
      USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)
    63. Re:What kind of data? by bWareiWare.co.uk · · Score: 1

      A system that dose 1:1 compresion is not a system that can achive 'by at least 1' as he states.

      Whilst you can passthough uncompressable input, any compression system that can ever achive better then 1:1 is going to have to add at least one bit to uncompresable input to flag it as uncompressed.

      Otherwise your arbitory input my happen to look exactly like a valid compressed file with difrent contents and the uncompressor would not know what to return.

    64. Re:What kind of data? by Alioth · · Score: 1

      So - basically they aren't doing anything more innovative that can already be achieved with rsync and hard links.

    65. Re:What kind of data? by richlv · · Score: 1

      um, i don't want to think into this too much, but maybe lvm2 can help you in some or other way.
      for example, have a single source repository of all data. mount a writable snapshot for each department. this way only changes will take up diskspace.
      of course, this would be bad in long term if you have no chance to sync central repository now and then, thus would work only for relatively short periods of time.

      --
      Rich
    66. Re:What kind of data? by rw2 · · Score: 1

      lol

      ok, fair enough.

    67. Re:What kind of data? by tverbeek · · Score: 1
      I just tried this. Using bzip2 and gzip, the file [of random bytes] ends up bigger when compressed. Any file with enough entropy ends up bigger than the original file.

      You've just demonstrated one of the principles of lossless data compression: in the set of all possible files, for every one that get smaller with a given algorithm, there's another that gets bigger. The algorithms we use are those that result in typical files getting smaller, and not bloody likely files getting bigger.

      --
      http://alternatives.rzero.com/
    68. Re:What kind of data? by CastrTroy · · Score: 1

      Here's a nice algorithm.

      If (compressedfile.size > originalfile.size) {
              saveUncompressedFile();
      }
      else {
              saveCompressedFile();
      }

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    69. Re:What kind of data? by Anonymous Coward · · Score: 0

      "(Boy, am I dating myself...)"

      Can't get laid either, huh? Welcome to the club.

    70. Re:What kind of data? by arodland · · Score: 1

      "Size of the encryption algorithm"? In what sense exactly?

    71. Re:What kind of data? by DeafByBeheading · · Score: 1

      Ah. But if you are willing to go with literally 1:1 compression (as opposed to at least 1:1), you can ignore headers and just assume that all "compressed" files are just exact copies of the uncompressed files. That's what I (and probably great-grandparent) was talking about. You bring up a good point, though.

      --
      Telltale Games: Bone, Sam and Max
    72. Re:What kind of data? by rw2 · · Score: 1

      Yes, I was indeed thinking of an algorithm that would actually compress data sometimes. As I said in a cousin posting, I agree with what you say in the parent.

    73. Re:What kind of data? by Torne · · Score: 1

      Ah, but actually that still makes the not-compressible files grow a little - you have to include at least one extra bit of information to indicate whether the rest of the file is compressed or not ;)

    74. Re:What kind of data? by tverbeek · · Score: 1

      Nice try, but that's not a compression algorithm. It's a wrapper for deciding whether to apply a compression algorithm.

      --
      http://alternatives.rzero.com/
    75. Re:What kind of data? by poopdeville · · Score: 1

      Bit-length of a sane implementation should be good enough. There's a specific definition in terms of Turing machines, but as far as I know, it is only used to establish that the concept is well-defined.

      --
      After all, I am strangely colored.
    76. Re:What kind of data? by forgetful_ca · · Score: 1

      What flag? it's all useable the way it is, ergo there IS no uncompressor.

  2. Breaking news! by ivan256 · · Score: 3, Insightful

    Company breaks Shannon Limit. Debunking at 11!

    Seriously though. Gzip can compress down to 98%... if your data is mostly redundant. The chance that they're doing this on the random data they claim in the article is nil.

    1. Re:Breaking news! by Anonymous Coward · · Score: 0

      According to the article, their claim is compressing typical data 25x, while normal compression algorithms get 2x. I don't know what algorithms are "normal", but what
      I commonly use gets 3x-5x on most datasets (3x was low end, for stripped executables,
      5x was for man pages). Back when I was still using floppies, I sometimes got nearly
      6mb out of them.

    2. Re:Breaking news! by zalas · · Score: 1

      This reminds me of the various funny antics people try to claim in comp.compression (for instance, being able to compress everything to something smaller and totally ignoring the pigeon-hole principle) and also of the recent Euclid Discoveries's claim regarding superior quality parametric encoding of video.

    3. Re:Breaking news! by nizo · · Score: 4, Funny

      Maybe it is lossy compression, which would be really nice when compressing executables and old spreadsheets.

    4. Re:Breaking news! by alexhs · · Score: 1

      Exactly. And as it is targeted to large data centers, I wonder if they didn't implement some sort of sparse files. Compressing large chunks of 0's sure give you impressive compression ratios...

      --
      I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
    5. Re:Breaking news! by Austerity+Empowers · · Score: 5, Interesting

      His point is that the Shannon limit provides a mathematical upper bound for how good a lossless compression algorithm can be for arbitrary data sets. gzip gets 98% of that maximum bound, so any algorithm that claims to be 12x that is either not lossless, or not generic. Gzip etc. are all based on several related algorithms known generally as "entropy coders" (http://en.wikipedia.org/wiki/Entropy_coding).

      Lossy compression and compression of particular data sets do not have to obey this. With lossy compression you can compress down as far as you can tolerate.

      Coding particular sets gets some extra compression by coding some of the data in the compress/decompress utility. For example if all your files have a 1MB standard header and 1KB of data, you can omit the 1MB of header because it's always there, and just send the 1KB of data! Truly amazing compression! Of course it only works under those conditions.

    6. Re:Breaking news! by jthill · · Score: 3, Insightful

      If gzip gets 98% of what's possible, then what the hell are bzip2 and 7zip doing?

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
    7. Re:Breaking news! by vleo · · Score: 1

      You can not compress random data at all. Just try this (under Linux) at home:

      [vleo@hydra vleo]$ dd if=/dev/random of=x bs=1 count=1000
      1000+0 records in
      1000+0 records out
      [vleo@hydra vleo]$ ls -l x
      -rw-rw-r-- 1 vleo vleo 1000 Apr 6 01:39 x
      [vleo@hydra vleo]$ gzip x
      [vleo@hydra vleo]$ ls -l x.gz
      -rw-rw-r-- 1 vleo vleo 1025 Apr 6 01:39 x.gz

      i.e. you CAN NOT compress random data.

      --
      Vassili Leonov ...it is the actions that affect us, not the motive...RMS
    8. Re:Breaking news! by mdielmann · · Score: 1

      It's also worth noting that certain file formats have very redundant information. It's not unusual to compress a 24bpp bitmap file to 10%, similarly for text files. I've also seen many databases compress to 20% but, again, very redundant data. Some formats don't compress much at all, especially those that have some kind of compression built in.

      --
      Sure I'm paranoid, but am I paranoid enough?
    9. Re:Breaking news! by Savantissimo · · Score: 2, Insightful

      Your example can be compressed to the minimal algorithm for the pseudorandom number generator you used plus the seed it used to produce your data.

      --
      "Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
    10. Re:Breaking news! by Anonymous Coward · · Score: 0

      A little more than that. /dev/random seems to take in timing details from my mouse movements and my keyboard presses, to judge from its output while I'm typing this. So throw in the timing variance from one (1) human brain listening to the radio while typing, and being distracted by a cat clambering on to the desktop. Compress that algorithm!

    11. Re:Breaking news! by Raindance · · Score: 1

      I'm not sure the Shannon Limit need be broken here.

      Think about how much redundancy is in even a single standard MP3 file, for instance-- the individual frames are compressed (including huffman compression) but *none of the musical similarities between each MP3 frame are used to compress the file*. That's a lot of unused order.

      Theoretically, compressing the first two minutes of an MP3 should result in a much smaller filesize than compressing the first minute and second minute separately. With GZ it doesn't. GZ is definitely not the final word in compression.

      That said, I don't expect this company to ship any product.

      p.s. If I have my facts wrong, I'd love to be corrected. It's been a little while since I looked into MP3 and GZ.

    12. Re:Breaking news! by Thundersnatch · · Score: 1
      If gzip gets 98% of what's possible, then what the hell are bzip2 and 7zip doing?

      Illustrating that the grandparent knows very little about gzip and data compression in general.

    13. Re:Breaking news! by evilviper · · Score: 2, Interesting
      If gzip gets 98% of what's possible, then what the hell are bzip2 and 7zip doing?

      Despite the obvious answer (he's simply wrong), 7zip is somewhat "cheating" in this 3-way comparison, as it uses a much, much, much larger block-size (memory). You can set it to use hundreds of MBs of RAM, whereas gzip and bzip2 are both limited to 9KB max.
      .

      Off-topic Rant:
      I was actually quite impressed with 7zip and it's lzma/ppmd compression methods when I first saw it compressing better than bzip2. However, once the novelty wore off, I began to realize it just takes far too-much memory. There is no possible chance of using them on an embedded system, a handheld computer, or even just a fairly old PC with less-than around 64MBs of RAM (or much higher, depending on requested block-size). It also takes a serious ammount of extra time over gzip/bzip2, while being only a trivial compression improvement in the large majority of cases. The exceptional cases are... neat... but they don't make LZMA/PPMd practical for normal use.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    14. Re:Breaking news! by vleo · · Score: 1

      My point was that trully random data can not be compressed. The details of /dev/random implementation are irrelevant, but my example still confirms my point. And it's not really "my point", it's compression 101.

      --
      Vassili Leonov ...it is the actions that affect us, not the motive...RMS
    15. Re:Breaking news! by smallpaul · · Score: 1

      His point is that the Shannon limit provides a mathematical upper bound for how good a lossless compression algorithm can be for arbitrary data sets. gzip gets 98% of that maximum bound, so any algorithm that claims to be 12x that is either not lossless, or not generic. Gzip etc. are all based on several related algorithms known generally as "entropy coders"

      I think you mean to say that gzip gets 98% of the efficiency of an algorithm based upon entropy coding. You can imagine data sets that are resistent to that technique but very friendly to some other technique. For example, you can imagine a very efficient encoding of "the prime numbers from one to ten thousand." In fact, the phrase IS a very efficient encoding of those numbers (compared to gzip).

      I guess what I'm trying to get at is that even gzip is not really generic. It just depends on particular patterns of redundancy that happen to be frequent in the data sets most people work with most of the time. Wouldn't it be accurate to say that given truly arbitrary (random) input, gzip increases the length of most strings?

    16. Re:Breaking news! by x2A · · Score: 2, Funny

      If only they had this a few years ago during the enron mess, they could have claimed "we didn't fiddle the accounts, we just saved it using lossy compression techniques".

      Just like the "our intelligence wasn't wrong about Sadam having WMD's, the satalite images just come to us as lossy JPEGs"

      (the point of this post lost due to compression)

      --
      The revolution will not be televised... but it will have a page on Wikipedia
    17. Re:Breaking news! by MickLinux · · Score: 1

      Let's see... I think lossy compression would be just fine.

      !Wrksht1.xls
      > Microsoft Excel file: First one. Run Huffman compression algorithm with Excel-
      > base modified compression tree. ...78% compression.

      !Johnsales.xls
      > Microsoft Excel file... we already have one of those, mark as redundant and delete.
      >

      ! Lovelettr.doc
      > Microsoft Word file: Part of office. Make a note Excel-> Word, and delete.
      >

      That said, I'm used to using Word98, which was famous for grinding up longer documents, chewing them up, and spitting them out (recursively!) in infinite loops. So I'm already used to this.

      ! Word.EXE
      > Hmmm. Inherently redundant. Delete with prejudice.
      >

      Sorry. I'm displaying my prejudice. But their failure to follow through on purchased customer support, claiming that nothing was happening, literally cost me thousands in direct losses, and more in lost contracts. Total loss, tens of thousands.

      > Burp! All files compressed, for a total loss of: 96%.

      --
      Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
    18. Re:Breaking news! by Anonymous Coward · · Score: 0

      LZMA compresses 30% smaller than bzip2 for tarballs of source code, even when set up to only use a one-megabyte table. LZMA is fine for embedded systems; you just have to make the table a little smaller at the cost of a slightly worse compression ratio.

      LZMA is the reason the Firefox Windows download is half the size of the Firefox Linux download.

      LZMA's only problem is that its compression speed is a good deal slower than bzip2. It's decopression speed, however, is quite a bit faster.

      I'm surprised you didn't like LZMA; what were you trying to compress that didn't compress well with a 20-bit (one-megabyte) table?

    19. Re:Breaking news! by tonyr60 · · Score: 1

      "My point was that trully random data can not be compressed"

      Bollocks. This string "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" is just as truly random as any other random stream of characters. Repeating strings are just as likely to turn up in a random stream as any specific non-repeating stream.

      However you could accurately state that truly random daya probably cannot be compressed.

    20. Re:Breaking news! by Achromatic1978 · · Score: 1

      Damnit! I knew I should have moved to 3rd Normal Form!

    21. Re:Breaking news! by Anonymous Coward · · Score: 0

      They WERE using random data. Problem is their random number generator randomly gave all 0's. That's the problem with random data, you can never be sure...

    22. Re:Breaking news! by Magnus+Reftel · · Score: 1
      If gzip gets 98% of what's possible, then what the hell are bzip2 and 7zip doing?
      Pretty much the same thing, except that they are more suited to common cases (and thus are worse at trying to compress random garbage).
      --
      print "Yet another p{erl,ython} hacker\n",
    23. Re:Breaking news! by rew · · Score: 1

      The '98%' was his way of saying "comes a long way in the right direction". In practise, if bzip2 gets you an additional 10%, the shannon limit might be 20% from gzip, or 10% from bzip2.

      Of course, gzip and bzip work on a byte-level. Suppose the data is much more regular on a 9bits-per-word level. In that case, any reasonable compression program will fall a factor of 8 short of the theoretical limit. Recompile gzip for 9bit words, and voila!

    24. Re:Breaking news! by jthill · · Score: 1
      The guy tripped my BS detector. Even if 98% turns out to be accurate considering all *possible* inputs, it's so misleading that "misleading" really doesn't cover it. And I still don't believe the 98%. Maybe the Huffman coding imprecision costs 2%, but that's not all there is to it by a long shot.

      Unless you mean "pretty much the same" in the same way as that guy who pointed out that all human languages would look "pretty much the same" to a space alien, I don't get how you put the bwt in with the count-and-distance coders.

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
    25. Re:Breaking news! by 1u3hr · · Score: 1
      Seriously though. Gzip can compress down to 98%... if your data is mostly redundant. The chance that they're doing this on the random data they claim in the article is nil.

      "They" don't claim that. The random blogger that Slasdot linked to might have. See their docs. Basically they're talking about making a series of backups, and being clever about finding common factors between sets.

    26. Re:Breaking news! by Anonymous Coward · · Score: 1, Insightful

      Er, no.

      gzip has a 32KB sliding window (so it can see up to 32KB of the file at once to look for redundancies). It uses the deflate algorithm, which is basic LZ77 compression coupled with Huffman encoding. The entropy encoding step could be improved with arithmetic coding, because Huffman requires at least 1 bit per output symbol, but arithmetic encoding can represent several symbols with just one bit. PATENTS on arithmetic coding are what stops that from happening in free software. gzip could swap arithmetic coding in place of Huffman coding TODAY and IMMEDIATELY get better compression. It chooses not to do so, to avoid even the HINT of patent infringement.

      bzip2 has a configurable 100KB to 900KB (in steps of 100KB) window, so at most it can see 900KB to look for redundancies. However, it also matches and encodes redundancies in an entirely different way (BWT+MTF+RLE instead of LZ77). The final step is again Huffman, which can't beat arithmetic coding!

      lmza is once again back to the LZ77 style of redundancy searching. True, its window can potentially be huge (2^28 = 256MB, although the Wikipedia page says this can go to 2^32 = 4GB now), it is doing basically the gzip algorithm, but with much better designed data structures, so it can search for matches much faster. Also, gzip uses a hash-based matching algorithm (it saves memory - it's state-of-the-art for 1991 when most home machines had 1-16MB RAM total) which can miss several potential matches. LZMA uses a trie, which stores ALL potential matches... certainly using more memory, but gives better compression. The REAL enhancement to compression is use of a Range coder (the "Markov" part of the acronym). This is almost IDENTICAL to the banned/patented arithmetic coding, giving the much greater entropy level compression over Huffman. Certainly, there's a legal risk in case some unexpected clause in the arithmetic coding patents covers range coding, but most people don't think so.

      So the basic answer is that gzip/bzip2 refuse to use the potentially patent infringing methods that lzma dares to use. LZMA also uses a more accurate matching algorithm. That's why it's better. Sure, increasing the window size will definitely find more redundancies (if they're there to find), but the bad matching algorithm in deflate means that most of the gains would be lost.

      Remember: gzip/bzip2 are not used for peak data compression! They're used because they're carefully designed to avoid ALL known compression patents, to stick to PUBLIC DOMAIN PRIOR ART. This is done to avoid all lawsuits! That's the design behind them. Best compression is an afterthought.

    27. Re:Breaking news! by ivan256 · · Score: 1

      Read more closely. You're saying the same thing I did.

    28. Re:Breaking news! by Savantissimo · · Score: 1

      "trully random data can not be compressed"

      Quite right, but "truly random" is an exceptionally slippery concept. Any given apparently-random sequence of sufficient length could turn out to be highly compressible, but by the pigeonhole theorem the same is certainly not true for all sequences.

      --
      "Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
  3. *sniff* by bryanp · · Score: 4, Insightful

    *sniff* *sniff* *sniff*

    I smell ... vapor.

    --
    "An unarmed man can only flee from evil, and evil is not overcome by fleeing from it." Col. Jeff Cooper
    1. Re:*sniff* by Eradicator2k3 · · Score: 0

      Sorry, that was me. Too many bean burritos to go with the cauliflower and hard-boiled eggs I had.

      --
      Mr. T pitied this fool on 27 July 1992.
    2. Re:*sniff* by nogginthenog · · Score: 1

      Smells more like bullshit to me

    3. Re:*sniff* by darkmeridian · · Score: 2

      My bad.

      --
      A NYC lawyer blogs. http://www.chuangblog.com/
    4. Re:*sniff* by Anonymous Coward · · Score: 0

      Where?

    5. Re:*sniff* by D_Gr8_BoB · · Score: 1

      A company called DataDomain makes a very similar product that they claim averages 20:1 compression for backups. It's real, has been shipping for some time, and generally works as advertised. The trick to getting such good compression is in the kind of data you're storing. If you run three backups in a week, the amount of actual changed data each time will be very small. Of course, if you just try to use a DataDomain box or similar as general-purpose storage for your MP3s, you're going to get very limited benefit out of it.

    6. Re:*sniff* by BigCheese · · Score: 1

      I don't even know why this sort of crap is even here. The last 419 scam I was spammed with was more interesting then this.

      OTOH the comments have been pretty good so it's not a total loss.

      --
      The obscure we see eventually. The completely obvious, it seems, takes longer. - Edward R. Murrow
    7. Re:*sniff* by Anonymous Coward · · Score: 0

      Yep. Everyone should just ignore the article and go back to using 7Zip. It's the world's best all-purpose compression and even better, it actually exists!

    8. Re:*sniff* by roman_mir · · Score: 1

      Snif snif, I smell millions of investors' gold!

  4. Limited application by Locke2005 · · Score: 4, Funny

    Yes, it can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.

    --
    I've abandoned my search for truth; now I'm just looking for some useful delusions.
    1. Re:Limited application by Toba82 · · Score: 1

      You made my day. Burning slashdot AND the current story at once.

      --
      I pretend to know more than I really do by mooching off google and wikipedia.
    2. Re:Limited application by Bull999999 · · Score: 5, Funny

      I, too, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.

      --
      1f u c4n r34d th1s u r34lly n33d t0 g37 l41d
    3. Re:Limited application by Anonymous Coward · · Score: 0

      Yes, it can compress data to 1/25th of original size... but it only works on slashdot articles

      That must be a pretty inefficient algorithm...

    4. Re:Limited application by LNO · · Score: 3, Funny

      I, as well, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.

    5. Re:Limited application by sprag · · Score: 2, Insightful

      Wouldn't you get 1/50th since is seems like every other story is a dupe.

    6. Re:Limited application by revlayle · · Score: 1

      I also compress... 1/25th /compressed

    7. Re:Limited application by Jason+Scott · · Score: 1

      compress25

    8. Re:Limited application by WilliamSChips · · Score: 1

      1/25

      --
      Please, for the good of Humanity, vote Obama.
    9. Re:Limited application by stupidfoo · · Score: 1

      c25

    10. Re:Limited application by Alien+Being · · Score: 3, Funny

      Wow, *your* algorithm even compresses the moderation!

    11. Re:Limited application by sprag · · Score: 4, Funny

      I, as well, welcome our 1/25th of original size overlords... but it only works on hot grits articles, which are highly compressable due to the large amount of petrified data.

    12. Re:Limited application by Anonymous Coward · · Score: 0

      I, too, can comp...o(oyt@u5ttHttu|ztthw ttDXLptTLupzAtvx8

    13. Re:Limited application by Feanturi · · Score: 1

      I would like to mention that in addition to the others who have voiced their opinion on this subject that likewise, I.. Oh crap.

    14. Re:Limited application by tshak · · Score: 4, Funny

      I, wanting cheap karma, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.

      --

      There is no longer anything that can be done with computers that is nontrivial and clearly legal. -- Paul Phillips
    15. Re:Limited application by Yvan256 · · Score: 1, Funny
    16. Re:Limited application by slimey_limey · · Score: 1

      0.5 != 1/50
      1/50 == 0.02

    17. Re:Limited application by i+kan+reed · · Score: 1

      shouldn't there be 25 of these posts?

    18. Re:Limited application by networkBoy · · Score: 1

      .

      top that :-)

      --
      whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
    19. Re:Limited application by networkBoy · · Score: 3, Funny

      dude, karma whoring funny comments is approaching the usefulness of this compression algo.
      hate to break it to you this way :-)
      -nB

      --
      whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
    20. Re:Limited application by mclaincausey · · Score: 1

      ditto

      --
      (%i1) factor(777353);
      (%o1) 777353
    21. Re:Limited application by Anonymous Coward · · Score: 0

      .

      top th

    22. Re:Limited application by Bull999999 · · Score: 0

      Besides, if you want to really karma whore, post the following:

      M$ Windoze sucks! (+1 Insightful)
      Linux rocks! (+1 Underrated)
      Bush sucks donkey balls. (+1 Informative)
      Bill Gates is funding a mind ray to turn everyone into Windoze users (+1 Interesting)

      --
      1f u c4n r34d th1s u r34lly n33d t0 g37 l41d
    23. Re:Limited application by painQuin · · Score: 1

      I suspect that 1/25 number takes dupes into account already

      --
      A guilty conscience means at least you've got one.
    24. Re:Limited application by Elad+Alon · · Score: 1

      to

      --
      News for merdes. Shit that matters.
      Ask me about my sig.
    25. Re:Limited application by dotgain · · Score: 1

      Don't encourage them, please.

    26. Re:Limited application by misleb · · Score: 1

      I'd think it would be hard to compress due to all the monkeys hitting random keys. Random data is difficult to impossible to compress.

      matthew

      --
      "THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
    27. Re:Limited application by Anonymous Coward · · Score: 1

      I, anting heap arma, an ompress ata o 1/25th riginal ize... ut t nly orks n lashdot rticles, hich re ighly ompressable ue o he arge mount f edundant ata.

    28. Re:Limited application by saridder · · Score: 1



      topped :)

      --
      --- RFC 1149 Compliant.
    29. Re:Limited application by x2A · · Score: 1

      Damn slashdot post filter gets me!

      So here's a link

      --
      The revolution will not be televised... but it will have a page on Wikipedia
    30. Re:Limited application by complete+loony · · Score: 4, Funny

      I, forgetting that funny doesn't give karma, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    31. Re:Limited application by hublan · · Score: 1

      No, I am Spartacus!

      --
      My spoon is too big.
    32. Re:Limited application by projecto2501 · · Score: 1

      I have discovered a truely remarkable proof for this compression algorithim, which this reply is too small to contain.

    33. Re:Limited application by miro+f · · Score: 1

      ^

      I get 173x data compression, take that diligent!

      --
      being vague is almost as cool as doing that other thing...
    34. Re:Limited application by fellip_nectar · · Score: 1

      Mod parent 'Redundant'!

      --
      Worst. Signature. Ever.
    35. Re:Limited application by sprag · · Score: 1

      Nah, it really is 1/50. If you get 1/25 compression not counting dupes, you'd surely get 1/50 with the dupes.

    36. Re:Limited application by IntergalacticWalrus · · Score: 1

      In Soviet Russia, files compress YOU at 1/25th of original size.
      In Korea, only old people use 1/25th of original size compression.

  5. Heard this before by Jordan+Catalano · · Score: 5, Interesting

    Does anyone else remember a "state-of-the-art" fractal compression program that appeared back around 95 or so? It was very impressive at first - you'd compress a four meg file down into a few kilobytes, and it would decompress just fine afterwards... until you deleted the original file. Turns out the program only stored a pointer to the location of the original file on the drive in its output file. I bet more than one person, after thinking they had verified it worked, lost some valuable data.

    1. Re:Heard this before by chrismcdirty · · Score: 1

      I don't remember it from the time, but in a post on an inferior "technology" news site last week, there was another bogus compression story in which someone brought up a fractal compression program. But this one moved the original file to a hidden location, and gave you a trojan at the same time! Talk about efficiency!

      --
      It's like sex, except I'm having it!
    2. Re:Heard this before by bmwm3nut · · Score: 1

      yeah, i got a copy of that back in the windows 3.1 days, so it was pre 95. i remember that is was shareware and it said that if you bought the full version it would decompress the file if you deleted the original. i never believed it, and if it did really work, then it'd be around today.

    3. Re:Heard this before by Ex+Machina · · Score: 1

      I seem to remember it being in the TigerDirect catalog! :) some things never change

    4. Re:Heard this before by Basecamp88 · · Score: 1

      Sounds like they reinvented the shortcut and tried to call it compression

    5. Re:Heard this before by chaboud · · Score: 1

      Yes. Didn't BYTE jump on that one and report it?

      That was legendary.

      Someone needs to have their name put on the interval between such occurences. These sorts of things are only for people who think that "Information Theory" might have been that group with Kurt Harland and James Cassidy.

    6. Re:Heard this before by Uzik2 · · Score: 1

      saw that in 1985 too...

      --
      -- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
    7. Re:Heard this before by Anonymous Coward · · Score: 0

      I think it was called "Windows 95"...

    8. Re:Heard this before by arekq · · Score: 1

      A few kilobytes to store the file location, that's awfully inefficient.

    9. Re:Heard this before by Orgasmatron · · Score: 4, Interesting

      Yup, that was OWS. You actually could delete the original file, but once it got overwritten, or if it wasn't available, you couldn't deOWS it any more.

      Back in the day, I figured out what was going on when I took a disk to another machine, couldn't restore the file. I then tested the disk in the machine I had made the archive on, and it worked fine. It was a good hoax. We all got a good laugh out of it.

      --
      See that "Preview" button?
    10. Re:Heard this before by slobak · · Score: 1

      I actually authored a freeware program that did this back around '95, for fun and as a project to learn a programming language. I am not sure it is what you are referring to, though. It was a simple DOS utility called 'Black Hole'.

      The "compression" was achieved by moving the file to hidden directory and replacing it with a pointer to the hidden location, plus some random data which was a hash of the file so the user got the impression the file was being compressed somehow. This was good because the user couldn't accidentally delete the real file. Of course, it was bad because .. well .. it didn't actually save you any space, and the file was non-portable.

      Meh.

    11. Re:Heard this before by Fnkmaster · · Score: 1

      Holy crap, I thought of the same thing when I saw this story. It wasn't 1995 when it originally came out, it was earlier though - I believe 1993 (I distinctly recall living in Florida when that happened, and I left in '93). This program made the BBS circuit at the time, and it was commonly used as a vector for viruses as well.

      Clever hoax, took a good 10 minutes to figure out what the heck it was actually doing.

    12. Re:Heard this before by 4D6963 · · Score: 1
      I never heard of such a thing but I remember when I was about 12 backing up my most important files and programs on a single floppy disk by storing aliases to them on it.

      "Wow, look mom! Super Maze Wars only takes 9 kB!!!"

      --
      You just got troll'd!
    13. Re:Heard this before by Anonymous Coward · · Score: 0

      Yeah, I remember the original article in Byte. Too bad I'm on a trip, so I could go and check the paper magazine. Was Byte duped?

    14. Re:Heard this before by miro+f · · Score: 1

      you mean this doesn't work?! damn!

      *throws out boxes of his old floppies*

      --
      being vague is almost as cool as doing that other thing...
  6. Obligatory Beowulf by Illbay · · Score: 0

    Wow, imagine the Beowulf cluster that WON'T be needed to store this!

    --
    Any technology distinguishable from magic is insufficiently advanced.
  7. The proof... by jforest1 · · Score: 5, Funny

    It's true! It compressed my 10GB collection of ASCII PR0N into 1 meg!

    1. Re:The proof... by Dynedain · · Score: 3, Funny

      The ASCII results:

      *

      --
      I'm out of my mind right now, but feel free to leave a message.....
    2. Re:The proof... by bataras · · Score: 1

      From the patent application:

      a. Sujbect (S), being of normal stature stands approximately 5 meters back from data (Q).
      b. intrinsic vo-luminosity of matter decreases proportional to the square of the distance
      c. size of data decreases by 25x
      d. profit

  8. Grain of salt by GillBates0 · · Score: 1
    Obviously nothing concrete or released yet so take with the requisite grain of salt.

    Or atleast with 1/25th a grain of salt.

    --
    An Indian-American Hindu committed to non-violent thought/speech/action alarmed by the global explosion of radical Islam
    1. Re:Grain of salt by Anonymous Coward · · Score: 0

      Shouldn't that be 25x grain of salt?

    2. Re:Grain of salt by slimey_limey · · Score: 1

      Wouldn't that be 24/25ths of a grain?

    3. Re:Grain of salt by GillBates0 · · Score: 1

      No, because 25x compression would reduce the size of a hypothetical grain of salt 1/25 times.

      --
      An Indian-American Hindu committed to non-violent thought/speech/action alarmed by the global explosion of radical Islam
  9. right. sure. by Doktor+Memory · · Score: 2, Interesting

    Number of companies claiming a breakthrough in compression technology since the release of bzip2: too many to count.

    Number of them which were anything other than complete bullshit: 0

    I'm not holding my breath.

    --

    News for Nerds. Stuff that Matters? Like hell.

  10. hmmn by Bizzeh · · Score: 1, Interesting

    doesnt colinux have 2kb compessed files that open up to around 10gb? since they are just all null files. also, such a compression where your doing so much is gonna eat into time and cpu usage, and if 1 thing goes wrong in any of it, you loose all that data.

    1. Re:hmmn by chrismcdirty · · Score: 0

      Yeah. You get a 6GB disk image that is basically zeroes over and over again. So that 2KB is describing that it's essentially 6 billion-ish zeroes one after another.

      --
      It's like sex, except I'm having it!
    2. Re:hmmn by xwipeoutx · · Score: 1

      I came across this one day, which is a highly compressed 'demo' - music & 3d rendered video sort of thing. I was impressed.

      Found it here. There's another cool one from that site.

      Note: There's no virii here, but scan it if you don't believe me

  11. This post is sooo full of BS by Khyber · · Score: 1

    They say it will work on anything? Sorry, I don't think so. I can take 2 gigs of straight 0's and compress it into a file with table and it only be maybe kilobyte in size. But, given technology and greed today, I doubt we're breaking the Shannon Limit anytime soon.

    --
    Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
  12. /dev/zero ? by slimey_limey · · Score: 5, Funny

    dd if=/dev/zero bs=1m count=1m | lzop - | gzip -f -| gzip -f - | gzip -f - | wc

    gives about three kilobytes for a terabyte of data.

    1. Re:/dev/zero ? by tigersha · · Score: 1

      What a waste of all the millions of fine engineering man-hours spent at AMD and Intel...

      --
      The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism
    2. Re:/dev/zero ? by gkhan1 · · Score: 1

      It's even less than that since you include the algorithm in the mix. Assuming the algorithm is not part of the storage, one could easily construct a compression algorithm capable of storing 2 gb in one bit.

      if bit=0, then the data is the entire World of Warcraft version 1.10
      else if bit=1 then the data is Hitchcocks Vertigo, Xvid format
      else it's simply the data as it is.

    3. Re:/dev/zero ? by Pieroxy · · Score: 1

      I guess that if you have three choices (1, 0, or else) you will need more than one bit to store it!

    4. Re:/dev/zero ? by Anonymous Coward · · Score: 0
      I guess that if you have three choices (1, 0, or else) you will need more than one bit to store it!


      well, three different values...

      a tit ?
    5. Re:/dev/zero ? by Anonymous Coward · · Score: 0

      if bit=0, then the data is the entire World of Warcraft

      Yup, that's pretty much what I found too.

      Try Guild Wars instead. I found it awesome after my experience of WoW.

      No doubt this is related to the fact that a pile of Blizzard guys left and founded NCsoft, so it's just natural progression.

    6. Re:/dev/zero ? by Peter+Cooper · · Score: 1

      Not really. A file of 1 bit in length with that bit set to 1 = first choice.. File of 1 bit with bit set to 0 = second choice.. File of more than 1 bit = third choice :)

      That said, I'm not sure any file system supports bit level indexing and storage :)

    7. Re:/dev/zero ? by l33td00d42 · · Score: 1
      how about "unix executable" compression? ;)

      me@host:~$ cat << EOF | wc -c
      > #! /bin/sh
      > dd if=/dev/zero bs=1M count=1M
      > EOF
      42

      sweeet.

    8. Re:/dev/zero ? by Pieroxy · · Score: 1

      So how exactly do you store a file that would be 1 bit long? You can't.

      No matter what, you can't store three values inside a unique bit.

    9. Re:/dev/zero ? by toddestan · · Score: 1

      What a waste of all the millions of fine engineering man-hours spent at AMD and Intel...

      But think of all the disk space saved!

    10. Re:/dev/zero ? by Omniscient+Ferret · · Score: 1

      Bah. I piped that through bzip2 -9 twice, & got it down to 248 bytes!

    11. Re:/dev/zero ? by gkhan1 · · Score: 1

      The data is one bit long, when we talk about algorithms we don't relate it to such ugly things as real world filsystems, it all lives in the wonderful world of pseudocode.

  13. Good luck to them! by Anonymous Coward · · Score: 0

    Don't really mind them presenting this, with a little luck they may even get funded. I recall an issue in Holland where we had our "Internet guru" Roel Pieper who has invested massively in a compression patent allowing a movie the size of Star Wars Ep. 3 be compressed onto 1 1.44" floppy disk. Ofcourse this played a few years ago, the algorithm was never mind strangely enough. Mr. Pieper still believes in the idea.

    I'm not claiming that this story is bull, I'm only saying that they're absolutely right to present it at this stage.

  14. Currently.. by Douglas+Simmons · · Score: 1

    25 times what? A 25th of the original file? Does it matter if it's already compressed or is it the same on anything? How does bzip stack up on a text file, yo?

  15. Incomplete Article Summary by bigtallmofo · · Score: 5, Funny

    The summary should have read...

    StorageMojo is reporting that a company named Practical Nano Cold Fusion Duke Nukem Forever at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital...

    --
    I'm a big tall mofo.
    1. Re:Incomplete Article Summary by chrismcdirty · · Score: 0

      You forgot that the first part of their name is Infinium

      --
      It's like sex, except I'm having it!
    2. Re:Incomplete Article Summary by pegr · · Score: 1

      The site, StorageMojo, looks pretty bogus to me as well. I just ran through every article they have. They're all from the same person, average six or so per year, and maybe go three pages for all of them. The comments for this /. article will be bigger than the entire site...

    3. Re:Incomplete Article Summary by master_p · · Score: 1

      Offtopic, but having seen that DNF has become the laughing stock of the industry, I wonder how much 3drealms has realised that. Which is purely a shame, because Duke Nuke'm 3d was one of the best games ever in terms of content and gameplay in the FPS genre.

  16. Dubious by pilkul · · Score: 4, Insightful

    Stuff like new compression algorithms generally comes out in academic papers, which are then applied in practice by regular programmers. That's what happened with the Burrows-Wheeler algorithm at the core of bzip2. Some company concerned with mostly implementation rather than theory wouldn't come up with a revolutionary advance. The writeup is very vague, but it sounds to me like they're just using a simple LZ type algorithm, and they're only claiming 25x compression if the data is mostly the same already. Well duh.

    1. Re:Dubious by glassware · · Score: 1

      Sounds to me like they're a backup company, and they're achieving 25:1 when backing up Windows servers by skipping all the redundant DLLs. Sounds like the author of this article mistook a real company with ridiculous claims about their backup performance for a magic new algorithm.

    2. Re:Dubious by tcopeland · · Score: 1

      > That's what happened with the Burrows-Wheeler algorithm

      The Burrows Wheeler Transform is very cool indeed. Brian Ewins used it to make the PMD duplicate code detector much much faster.

    3. Re:Dubious by nogginthenog · · Score: 1

      It's amazing how many people forget that Huffman encoding is the basis for pretty much all compression techniques. e.g. LZ encoding (and varients) generally use Huffman encoding to compress the dictionary. Huffman dreamt up his technique in the 50's.

    4. Re:Dubious by Helios1182 · · Score: 1

      BZip2 uses Huffman Encoding, Move to Front Encoding, and Burrow-Wheelers Transformations. So Huffman does get a lot of credit.

    5. Re:Dubious by pornking · · Score: 1

      If they are taking into account redundancy in all files on all the computers being backed up, and doing so transparently, then that's a huge win. I could easily see 25:1 even without any new algorithms. Imagine skipping more than just redundant DLLs. If Bob writes a report and sends it to 30 coworkers who each make revisions and send them back, followed by Bob integrating those revisions back into the document and forwarding the final result to an additional 300 people, it doesn't need to take up much more space than the size of the original document plus the difference between it and the final version. Sure you can use file servers to manage some of that redundancy, but then you have a manual process that can never be anywhere near as efficient.

      However, backing up all the redundant DLLs would also be done, and it suggests one additional benefit. It now becomes possible to back up the entire contents of an arbitrarily large number of machines for very little more space than just backing up all the user files. You then have the ability to restore any one of a very large number of machines to the state it was in during any incremental backup you still have around.

      --
      pornking
  17. sounds like a O(n^n^n) problem. by Ancient_Hacker · · Score: 4, Interesting
    Couple "issues":
    • The cost of disk space versus the cost in computer time in finding all the matching substrings. Disk space gets bigger a whole lot faster and easier than CPUs speed up, so even if this idea is economically feasible today, it can only get worse from here.
    • This scheme may work just swell with some data streams, but probably pathologically awful with others. A good example: a billion empty records in a database might be compressed to a very few bytes. The system operator relaxes, and lets a log file fill up the rest of the disk. Then a bunch of database records need to be added, or the existing records need some sequential numbering added and guess what? There's no space for the new records, or to expand the existing ones. Argh.
  18. dd if=/dev/urandom of=file bs=10MB count=1 by vlad_petric · · Score: 1

    compress that :)

    --

    The Raven

    1. Re:dd if=/dev/urandom of=file bs=10MB count=1 by dotgain · · Score: 1

      Maybe if the compressor thinks it's random data, it just records that fact, and when it's time to decompress, it just cats random data back at you again.

    2. Re:dd if=/dev/urandom of=file bs=10MB count=1 by tsm_sf · · Score: 1

      Just make sure you use the --smartass flag and you should be good.

      --
      Literalism isn't a form of humor, it's you being irritating.
    3. Re:dd if=/dev/urandom of=file bs=10MB count=1 by Anonymous Coward · · Score: 0
      Ok, here's the compressed file:
      dd if=/dev/urandom of=file bs=10MB count=1
      But keep in mind that I'm using quantum compression here, so you can only decompress it once. After that the data will change because you observed it.
  19. Pfft, Yeah I know about this by Anonymous Coward · · Score: 0

    Yeah, I've worked with this before. It's just lossy data compression. It eliminates the data you won't care about a couple of weeks from now. If you used it on a collection of your boss's memos, for example, the compression ratio is 100%. Kind of like how lossy audio compression eliminates the parts your ear doesn't care about.

  20. Shame on you, ScuttleMonkey! by RobertB-DC · · Score: 3, Funny

    Posted by ScuttleMonkey on Wed Apr 05, '06 03:23 PM
    from the make-sure-to-give-it-to-more-than-just-the- corporate-monkies dept.


    You would think that an editor called Scuttle Monkey would know that the correct plural of "Monkey" is "Monkeys", not "Monkies".

    "Monkies" would be the plural of "Monkie", which I guess is what you'd call a baby Monk Seal, or if you knew him really well, a resident of a Monastery. "Hey, Monkie, nice robe!"

    Of course, if you were talking to Michael Nesmith, the singular form would be "Monkee". But that's neither here nor there.

    --
    Stressed? Me? Of course not. Stress is what a rubber band feels before it breaks, silly.
    1. Re:Shame on you, ScuttleMonkey! by Anonymous Coward · · Score: 0

      Quite being a FAG!!! and go home.

    2. Re:Shame on you, ScuttleMonkey! by Carthag · · Score: 1

      It could also be the plural of Monky, perhaps?

    3. Re:Shame on you, ScuttleMonkey! by Valdrax · · Score: 1

      You would think that an editor called Scuttle Monkey would know that the correct plural of "Monkey" is "Monkeys", not "Monkies".

      Preposterous! Clearly, you are in the wrong, and the plural is Monkees.

      --
      If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").
  21. No, really, it's true! by dreamchaser · · Score: 1

    Seriously. I hear that they are going to use it with Duke Nukem Forever to fit all the map and texture data onto only 22 DVD's.

  22. A grain of salt? by thewiz · · Score: 1

    Obviously nothing concrete or released yet so take with the requisite grain of salt.

    Actually, I'd say take the news of this "breakthrough" with a Salt Lick.

    I hope it's true, but I'm not holding my breath.

    --
    If "disco" means "I learn" in Latin, does "discothèque" mean "I learn technology"?
    1. Re:A grain of salt? by nizo · · Score: 1

      If you can't afford your own salt lick, you can probably find one just lying around on the ground at a cattle farm. As an added bonus it is probably chock full of growth hormones and random cow medicines, so enjoy eating it!

  23. Calgary / Canterbury corpus? by Spy+der+Mann · · Score: 3, Interesting

    If they can't compress the canterbury corpus or calgary corpus beyond 3X, then it's a SCAM.

    1. Re:Calgary / Canterbury corpus? by Anonymous Coward · · Score: 0

      I bought the "ULTRA9" compression program, that gets 99.3% compression on both the canterbury and calgary corpus. I tried it and it works!

      If you remove the rather large licence key file though, it's about the same as bzip. I guess they revert to a different algorithm if it's an unlicenced version.

    2. Re:Calgary / Canterbury corpus? by MichaelHH · · Score: 1
      Compression CAN do huge volumes

      Calgary/Canterbury corpus will be blown away by what they see here. I can easily do 100x on them without even breaking a sweat!

      I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.

      I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.

      All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.

      I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.

      --
      I am ready for the big jump in life, who will jump with me?
  24. Also, speed. by Anonymous Coward · · Score: 0

    Sounds like an application where you want some speed, which kind of rules out PPM*/PPMd + aritmetic coding -- which is among the best general compressors we have today. As if it needed another good debunking.

  25. Sad truths about data compression. by k.a.f. · · Score: 5, Informative

    1. There can be no algorithm that can compress every stream by a constant factor, let alone by 25. Whoever says otherwise is mistaken or lying.

    2. Achievable compression depends on the nature of the input material. Big files (music, movies) these days are already compressed by their respective codecs, so they compress really badly.

    3. While there are algorithms that, on average, compress better than others, usually this is paid for by running slower, often much, much slower.

    Mmmmmmh, salt.

    1. Re:Sad truths about data compression. by AgNO3 · · Score: 1

      Did you miss the bit about this being temporal compression? As in each version of a back up is has redundant data removed from the file and points back to a previous version that already contains that data. Sounds good to me when in my industry we have TB's a data back up everyday that are often redundant to the previous days files at many levels. So I have back up and a redundant back up everyday for a given project. Now Tuesday's backup will take up WAY less space because it can point to Monday's back up for all the files that have not really changed. I would have to look into this more but that sounds pretty fucking good to me for the types of data back up my industry uses. (lots of back ups during the job then at the end only the final work needs to be archived.)

      --
      OMG Ponies!!! with Glitter!!!! I miss Pink :-(
    2. Re:Sad truths about data compression. by Anonymous Coward · · Score: 0

      Dear Sir,

      I submit to you my compression algorithm that compresses any input by a constant factor (1).

      dd if=file of=file.compressed

    3. Re:Sad truths about data compression. by Anonymous Coward · · Score: 0

      This is very much not true.

      I have made an algorithm that will compress any stream by a constant factor. I call it NSLC - New Standard Lossless Compression, and it compresses all streams by a factor of 1.

    4. Re:Sad truths about data compression. by zippthorne · · Score: 1

      Incremental backup is not the same as compression.

      --
      Can you be Even More Awesome?!
    5. Re:Sad truths about data compression. by alexmipego · · Score: 1

      1. They don't talk about a constant factor.
      2. While, nowadays, most content is compressed, it is based on that content only and not knonwing about other content you've in your system.
      In fact, I've tought about this myselft before (and tested with some success, but I'm no compression expert), if you pick a file, compress it, then you "resort" using a random order, there a probability that you'll find a new order where compressing it again will result in a better overall compression rate. Applied to the article solution, just think if they take advantage of file fragmentation? Perfect for windows :P but would work on linux too.
      3. There are algorithms that are simultaneouly faster and better.

    6. Re:Sad truths about data compression. by Achromatic1978 · · Score: 1
      It's "pretty fucking good" until you discover during a restore that the 192nd tape is corrupted, and as such, you lose the value of all your incremental backups to the point of the previous full. If you did a full back up every Monday, and incremented each day, a corrupt tape in the full backup could see you almost two weeks back.

      Then it's not so fucking good.

    7. Re:Sad truths about data compression. by AgNO3 · · Score: 1

      and that is why you have redundant back ups. All of our daily back ups right now are raid 50. So uh the chances of loosing Monday are like SLIM. Maybe if we backed up to tape but we don't and we back up a couple hundred TB everynight.

      --
      OMG Ponies!!! with Glitter!!!! I miss Pink :-(
    8. Re:Sad truths about data compression. by Achromatic1978 · · Score: 1

      Okay... so you'd then keep what, a week's worth of backups. So, we say 8 x say 400TB, 3200TB. Using RAID 50, RAID 50: ( (Number of Drives In Each RAID 5 Set - 1) / Number of Drives In Each RAID 5 Set) ... and, say 147GB SCSI drives. So we're looking at an efficiency of about 60%. So we're looking at a system with 37,000 hard drives, you're saying? That'd be ... umm... pretty impressive.

    9. Re:Sad truths about data compression. by AgNO3 · · Score: 1

      I would be wouldn't it. I meant GB for when I said a few hundred TB sorry. Each film is a minimum 6 hours of stored footage average is probably closer to 8 hours so 691200 Frames on Fast storage. If we are only doing 2k that is 12 MB per frame per 10bit cineon. it might be 4k. So for 2k just the raw footage to store on raid 3 is 8TB. How many drives does it take to store 8 TB on raid 3? Ok. Now we take say Sin City where Every single shot is a vfx shot and we have 5 version of every shot. (but we only have 90min now because we only work on the edit revisions so we have maybe 3 version per scene per edit. OK so for 2k we are storing each edit at 1.5TB so 4.5TB x 5 is another 22.5TB. So we for ONE movie and there are more projects going on then that every day 30.5 TB of data conservatively. When I was working on Sin City at Cafe FX we where also doing Blade Trinity and Flight of the phoenix. So we have all the 3d frames that get comped into the scanned frames and revisions on top of revisions. 30.5 it probably way way low. Anyway so for all three of those movies we are at minimum talking about 100TB. That is backed up EVERY DAY but I think only 3 days or 2 are kept on drives then it is put on tape. So uh how many hard drives is that? How many Hd's do you think ILM has? or Digital Domain or Rhythm and hues? Places that are always doing a few movies plus TV spots plus episodic? You can not offline the stuff while you are still editing the show cause you have to have access to it. I don't know still sounds like this would really really rock. Oh and double those numbers if the film is 4k.

      --
      OMG Ponies!!! with Glitter!!!! I miss Pink :-(
    10. Re:Sad truths about data compression. by Achromatic1978 · · Score: 1
      Aha. :)

      ILM has, apparently (according to a story last week, 170TB at 85% occupancy). As for edit suites, don't worry, we have a couple of Avid setups in our offices. ;-)

      ObPedant: Going from 2K to 4K film quadruples storage requirements, not doubles. ;-)

    11. Re:Sad truths about data compression. by AgNO3 · · Score: 1

      That has to be 170TB of online. Avid is so Toast at NAB unless they have something that no one has ever heard of. Oh can you say online editing of cineons and DPX in a certain other editor that AVID probably really hates for like 1/10th the cost of a nitris. ah just read the story. That is ONLINE storage and it does not say in what configuration that storage is in. http://cgw.pennnet.com/Articles/Article_Display.cf m?Section=ARTCL&ARTICLE_ID=250488&VERSION_NUM=2&p= 18 and that story makes me want to try this temporal solution even more. I did not say implement. I said test try, give it a whurl if it works.

      --
      OMG Ponies!!! with Glitter!!!! I miss Pink :-(
  26. Is this ZeoSync 2.0? by 123abc · · Score: 1

    Did they even get 1.0 working yet?

  27. seems specialized by j1mmy · · Score: 1

    It sounds like the backup volume in this system is essentially a .zip file that you keep stuffing data into. If some copy of the data you're stuffing in is already there, you don't need to store it again. 25x is believable if you're backing up the same data over and over again, I guess.

    1. Re:seems specialized by Uzik2 · · Score: 1

      If it does deduplication at the file level it might achieve some good
      reductions in some offices. Only storing one copy of the spreadsheet
      that's copied 800 times across the entire network by 800 users
      would save a lot of space.

      --
      -- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
  28. You can do better than that. by bigtallmofo · · Score: 1

    I can take 2 gigs of straight 0's and compress it into a file with table and it only be maybe kilobyte in size.

    Without putting much thought into it, I can even do that. 2 gigs of straight 0's with a real-world algorithm pretty easily compresses down to 12 bytes, far fewer than the kilobyte you quote. You could store it in just: 2000000000x0

    Use an abbreviation for 2 billion or other byte-saving tricks and you could compress it down even more.

    I suspect such smoke and mirrors is something similar to what this company has done to achieve their reported compression results.

    --
    I'm a big tall mofo.
    1. Re:You can do better than that. by Directrix1 · · Score: 1

      You must be referring to Run Length Encoding (RLE).

      --
      Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
    2. Re:You can do better than that. by Danga · · Score: 1

      You could store it in just: 2000000000x0

      Use an abbreviation for 2 billion or other byte-saving tricks and you could compress it down even more.


      Why the hell would you store it as actual characters and use 12 bytes? If you were going to store it that way at least use hex. I would just store it as 773594000 so that would get it down to 3 bytes if you eliminate the "x" since that is reduntant and you would only be multiplying by 0 or 1 so the first nibble (right to left) could be the multiplier and the remaining nibbles would be the multiplicand. Of course the best compression possible would only use 1 bit being 0.

      --
      Hey, there is only one Return and it's not of the King, it's of the Jedi.
  29. OSHI! by TheRealMindChild · · Score: 1

    de-duplication and calculating and storing only the changes between similar byte streams is apparently the key

    Maybe you want to tale a gander at RLE

    --

    "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
  30. 25x compression for something repeated 25 times by demon411 · · Score: 2, Insightful

    Yup, let me just add to others saying that 25x compression is impossible for arbitrary data. It's just an indexing problem, if you have a 2 kbyte files (2^12288 possible permutation) it is impossible to map all to the (2000/25=) 82 byte files (2^656 possible permutations). Good thing the article talks about what data this applies to...(sarcasm)

  31. Where have we heard this one before? by overshoot · · Score: 3, Insightful
    Once upon a time, my VP bought into a firm that had discovered a guaranteed-perfect compression algorithm: it would reduce the size of any data file, no exceptions.

    A cow-orker asked if it could be used on its own ouput.

    --
    Lacking <sarcasm> tags, /. substitutes moderation as "Troll."
    1. Re:Where have we heard this one before? by Prospero's+Grue · · Score: 1
      Once upon a time, my VP bought into a firm that had discovered a guaranteed-perfect compression algorithm: it would reduce the size of any data file, no exceptions.

      Yeah, it's called rm, isn't it? You can even use the flags '-r' for recursive (compress the compression for even more savings) and '-f' for flatten (makes the result occupy even less space than before). Run rm -rf from the root directory and just watch how much disk space frees up. Amazing!

      --
      The opinion above is fiction. Any similarity to real opinions, including facts and logic, is purely coincidental.
    2. Re:Where have we heard this one before? by rev_sanchez · · Score: 1

      The VP then replied, "Oh my God, a talking cow!"

      --
      If you didn't come to party don't bother knocking on my door. Prince '1999'
    3. Re:Where have we heard this one before? by ADRA · · Score: 1

      Nah, I'd rather see this recursion:

      rm -f /bin/rm

      --
      Bye!
    4. Re:Where have we heard this one before? by moochfish · · Score: 1

      Once upon a time, my VP bought into a firm that had discovered a guaranteed-perfect compression algorithm: it would reduce the size of any data file, no exceptions.

      A cow-orker asked if it could be used on its own ouput.


      Answer: Sure! But decompressing the data is still under development.

    5. Re:Where have we heard this one before? by Anonymous Coward · · Score: 1, Funny

      Please do not let this person ork any more cows. It's bad for him, and bad for the cow. Just say no to Cow Orking!

    6. Re:Where have we heard this one before? by Ezel · · Score: 1

      Man, I don't ever get modpoints any longer.

      But that was +1 funny in my book :-)

      --
      Prosp long and liver.
  32. I've always imagined this conversation by jfengel · · Score: 5, Funny

    Developers: We've got some really good ideas for reducing backup space by using compression and incremental backups.

    Marketing: How much in the best conceivable case?

    Developers: Oh, I dunno, maybe 25x.

    Marketing: 25x? Is that good?

    Developers: Yeah, I suppose, but the cool stuff is...

    Marketing: Wow! 25x! That's a really big number!

    Developers: Actually, please don't quote me on that. They'll make fun of me on Slashdot if you do. Promise me.

    Marketing: We promise.

    Developers: Thanks. Now, let me show you where the good stuff is...

    Marketing (on phone): Larry? It's me. How big can you print me up a poster that says "25x"?

    1. Re:I've always imagined this conversation by spun · · Score: 2

      Someone please mod this "insightful" as opposed to funny (which it also is.) Does anyone doubt that this is pretty much how it happened?

      Comedian Bill Hicks had the most insightful proposal for marketing types:

      "By the way if anyone here is in advertising or marketing... kill yourself. No, no, no it's just a little thought. I'm just trying to plant seeds. Maybe one day, they'll take root - I don't know. You try, you do what you can. Kill yourself. Seriously though, if you are, do. Aaah, no really, there's no rationalisation for what you do and you are Satan's little helpers, Okay - kill yourself - seriously. You are the ruiner of all things good, seriously.

      No this is not a joke, you're going, "there's going to be a joke coming," there's no fucking joke coming. You are Satan's spawn filling the world with bile and garbage. You are fucked and you are fucking us. Kill yourself. It's the only way to save your fucking soul, kill yourself. Planting seeds. I know all the marketing people are going, "he's doing a joke"... there's no joke here whatsoever. Suck a tail-pipe, fucking hang yourself, borrow a gun from a friend - I don't care how you do it. Rid the world of your evil fucking machinations. I know what all the marketing people are thinking right now too, "Oh, you know what Bill's doing, he's going for that anti-marketing dollar. That's a good market, he's very smart." Oh man, I am not doing that. You fucking evil scumbags! "Ooh, you know what Bill's doing now, he's going for the righteous indignation dollar. That's a big dollar. A lot of people are feeling that indignation. We've done research - huge market. He's doing a good thing." Godammit, I'm not doing that, you scum-bags! Quit putting a godamm dollar sign on every fucking thing on this planet!

      "Ooh, the anger dollar. Huge. Huge in times of recession. Giant market, Bill's very bright to do that." God, I'm just caught in a fucking web! "Ooh the trapped dollar, big dollar, huge dollar. Good market - look at our research. We see that many people feel trapped. If we play to that and then separate them into the trapped dollar..." How do you live like that? And I bet you sleep like fucking babies at night, don't you?"

      --
      - None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
    2. Re:I've always imagined this conversation by jthill · · Score: 1

      First time in my life I ever wanted to bookmark a /. post.

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
  33. damn people! by rhaig · · Score: 1

    RTFA.

    of course you can do this. Look at datadomain.com.

    they expect 20-80x compression because they're marketing themselves as backup to disk (doing repetitive full backups). you get the same patterns over and over again.

    and whoever posted the RLE wikipedia article, thank you for understanding the solution.

    and no, everything isn't going to compress 25x, but everything will compress some. There are repeated bitstreams in everything. a 64bit string has a finite number of patterns. I don't know how small they chunk it up, but it's beliveable.

    --
    "We are not tolerant people. We prefer drastically effective solutions"
    1. Re:damn people! by C_Kode · · Score: 1

      they expect 20-80x compression because they're marketing themselves as backup to disk (doing repetitive full backups). you get the same patterns over and over again.

      Hmm, I don't like the thought of all my backups utilizing a single copy of a pattern that happens a million times. Imagine; You have 30 days of backups, and a single pattern occurs 25,000 times between all 30 backups. You get block errors where that single pattern exist on the disk there by destorying all 30 backups. Now, I can understand keeping a copy for each backup that way the loss of a single copy of said pattern only mangles that single backup. Sharing a copy with your entire archive of backups is crazy. (IMHO anyhow)

    2. Re:damn people! by ergo98 · · Score: 1

      and no, everything isn't going to compress 25x, but everything will compress some.

      Completely ridiculous statement. If this is true, then you can infinitely pipe its output into its input, until you're left with a single bit. It's hardly a complex exercise to realize why that's foolish.

      There are repeated bitstreams in everything. a 64bit string has a finite number of patterns. I don't know how small they chunk it up, but it's beliveable.

      This is the naive foolishness that leads people into believing ridiculous compression claims, buying into them again and again. This sort of "just magically see the repeats" nonsense has been debunked a trillion times, so I'm not going to point it out.

      If, indeed, they use a diff/rtpatch type vector files, then not only is it difficult to believe that they do it efficiently (you ever use one of the binary patching generation tools? They're TERRIBLY resource intensive. I can't imagine trying to do it for an entire system image), but it's not really "compression", per se. Maybe they use a transaction log approach, however many SANs already have that functionality, eliminating the innovative element of it.

    3. Re:damn people! by Maffy · · Score: 1

      and no, everything isn't going to compress 25x, but everything will compress some. There are repeated bitstreams in everything. a 64bit string has a finite number of patterns.

      Yes, a 64-bit string has 2^64 different "patterns". Instead of storing the pattern itself, you could just store an identifier for the pattern, and that would only require log_2(2^64) bits.

      Oh wait...

    4. Re:damn people! by SpecBear · · Score: 1

      But is this really compression?

      From the article: "So if you have 100 GB to back up, their product, Protectier (see name comment above) can turn it into 4GB, something you could burn onto a DVD in a few minutes...The way Diligent achieves it exceptional compression ratio is by comparing all incoming data to the data already arrived. When it finds an incoming stream of bytes similar to an existing series of bytes it compares the two and stores the differences."

      The problem is, I can't restore my data using that DVD if that 4GB relies on data that's been previously stored elsewhere. Am I missing something, or is this just a hyped-up way of marketing incremental backups?

      By my reading, the interesting bit is that the system can to incremental backups more quickly and more reliably than current solutions, but that's not sexy enough. I think this guy nailed it pretty well.

    5. Re:damn people! by Allasard · · Score: 1
      >There are repeated bitstreams in everything. a 64bit string has a finite number of patterns. I don't know how small >they chunk it up, but it's beliveable.

      This is the naive foolishness that leads people into believing ridiculous compression claims, buying into them again and again. This sort of "just magically see the repeats" nonsense has been debunked a trillion times, so I'm not going to point it out.

      I've actually seen either Diligent's or someone-else's similar presentation before, while looking for Virtual Tape Libraries. (maybe Avamar? http://www.avamar.com/products4.asp )
      I was actually quite impressed with the product. I think they use huge hash index tables to do the block comparisons quickly. Avamar's site mentions they use 12k chunks. It was just damn expensive.

      In any case, the 25x number I'm sure is an average they've found. Keep in mind this is for backups. So, they are saying if you backup 1000 PCs with Windows, 98% of the OS never changes from machine-to-machine, and you only have to backup those blocks once.

  34. What's that smell in the air? Oh yeah, Bullshit. by Senjutsu · · Score: 1

    Further, since the software operates on byte-streams, it can compress anything: email, databases, archives, mp3's, encrypted data or whatever weird data format your favorite program uses.

    It can compress anything!1111 Even already compressed mp3s and encrypted data, both of which have a high degree of data entropy, and are essentially uncompressible!

    Magical compression for everyone!!

  35. This definitely works by All+Names+Have+Been · · Score: 5, Funny

    I can tell you, this technology definitely works. I've seen them compress random data streams to 1/25th (even 1/30th!!) their size. This works *TODAY*. Coming out real soon now is the software that allows you to decompress your data. This is still in development.

    1. Re:This definitely works by Anonymous Coward · · Score: 0

      You cashed the check before posting, I hope?

    2. Re:This definitely works by sploxx · · Score: 1

      I've seen them compress random data streams to 1/25th (even 1/30th!!) their size.
      I've seen it too, it really works. And they can even put that into a self-extracting executable!!

      #!/bin/sh
      cat /dev/urandom >$0

  36. Re:right. sure. by Anonymous Coward · · Score: 0

    I'm not sure if you imply that bzip2 is actually good at compression, but just in case you were: it is bad, slow and bad compression ratios. Some of the good common programs are 7zip and (win)rar. A benchmark can be found for example in http://www.maximumcompression.com/data/summary_mf. php.

  37. Re:Heard this before - OWS by insitus · · Score: 1

    I can't remember what the full name was, but it's initials were OWS.

  38. Pretty lame coverage too... by sarkeizen · · Score: 1

    A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key

    Yes, for that and every other compression system.

  39. Great job Slashdot... by X · · Score: 1

    Sigh, this is nothing more than a non-redundant store. Very similar to stuff already offered by a number of vendors, even Microsoft. The "fast way to know what's already on disk" is just to store hashes of the data in an index. Move along, nothing to see here......

    --
    sigs are a waste of space
  40. Vist the Diligent WebSite and learn.... by sherpajohn · · Score: 4, Informative

    ....I mean jeez. They are not in the file compression business, they are in the "data protection" business. Specifically disk based backup. They make NO cliam regarding "data compression" - the 25X claim is explicitly in regards to the disk space required to backup data. What they say is that using their solution can lead to a 25x less disk space requirement for backups. It may involve some new compression algorithms, but appears to be more based on never backing up the same data more than once.

    --

    Going on means going far
    Going far means returning
    1. Re:Vist the Diligent WebSite and learn.... by nogginthenog · · Score: 1

      but appears to be more based on never backing up the same data more than once.

      What a cool idea! I think I'll patent it!

    2. Re:Vist the Diligent WebSite and learn.... by TeknoHog · · Score: 1

      Yeah, this reminds me of rsync which I use for backups and many other things. It's not the same thing, and I guess this technology is better in the way that you can recover any backup, not just the latest. Rsync "simply" keeps two filesystems in sync by transferring only the changes, so if you delete something before backing up, you cannot recover it.

      --
      Escher was the first MC and Giger invented the HR department.
    3. Re:Vist the Diligent WebSite and learn.... by noidentity · · Score: 2, Insightful

      Now that sounds more reasonable. Instead of putting the incremental backup smarts on the client side, put it on the server side. This way the client can use whatever old scheme is handy, perhaps a plain file copy, and let the server sort out the redundancy with data already copied previously. Only the server has to contain the complex algorithms, so there's less of an opportunity for screw-ups.

      That blog entry smells artificial, though. Very calculated. Right about here, I become wary:

      "The way Diligent achieves it exceptional compression ratio is by comparing all incoming data to the data already arrived. When it finds an incoming stream of bytes similar to an existing series of bytes it compares the two and stores the differences. The magic comes in a couple of areas, as near as I can make out given Neville's natural reticence on the "how" of the technology.

      First, one has to be smart about how big the series of bytes before worrying about trying to compess it, since if it's too short there won't be much or any compression. Secondly, the system needs a very fast and efficient method of knowing what is has already received so it can know when it is receiving something similar. And it all has to be optimized to run in-line at data rate speeds on a standard server box -- which runs the cool and reliable Linux OS."

  41. Re:sounds like a O(n^n^n) problem. by ignorant_newbie · · Score: 1

    >The system operator relaxes, and lets a log file fill up the rest of the disk.

    If your logs are on the same partition (let alone _disk_) as your database files, you deserve this kind of fate.

  42. Re:100X - 1000X by irritating+environme · · Score: 3, Informative

    This is completely false. There are fundamental mathematical limits to the amount you can compress data in a lossless format. In fact, each compression format ususally has overhead on the file to store the mapping data to decode/decompress it. That overhead+the compressed file is usually less than the original file, until you run the compressor once or twice. Then the file doesn't compress at all, and the compression record overhead actually increases the overall file size.

    --


    Hey, I'm just your average shit and piss factory.
  43. Results of Search in 1976-present db for: by Rogerborg · · Score: 1
    --
    If you were blocking sigs, you wouldn't have to read this.
    1. Re:Results of Search in 1976-present db for: by Anonymous Coward · · Score: 0

      Stupid git! You also have to search published applications, of which they have 1 (and is quite relevant).

      See USPTO Patent App: 20060059207

  44. TFA by pcosta · · Score: 4, Insightful

    If everybody stopped laughing and actually RTFA, they aren't claiming 25x compression on anything. The algorithm is targeted at data backup, i.e. very large files and works by comparing incoming data patterns to patterns already stored. Looks like a modification of LZH that uses the compressed file as the pattern table. I'm not saying that it works or that is a breakthrough, but they are not claiming impossible lossless compression on anything. It might actually be interesting for the application it was designed for.

    1. Re:TFA by Anonymous Coward · · Score: 0

      "Looks like a modification of LZH that uses the compressed file as the pattern table."
      but that's exactly how LZH works now -- the data stream is a dictionary, and the commpresed data is an offset and length to repeat. the data dictionary slides along with the current data stream

      so, highly redundant data with extra bits (like ascii text files) will compress very much, but random pre-compressed data like a gif, jpeg, or mp3 can't compress much more.

      nobody would really seriously claim 25x compression -- that's just magic

  45. To those who're wondering... by TrumpetPower! · · Score: 2, Insightful

    If you're wondering why this is pure bullshit, this might help.

    Lossless compression is nothing more than an algorithmic lookup table. It's a substitution cipher like what you find in famous quote puzzles.

    Take two different messages. Compress each. When you decompress them, you have to get two different messages back, right? So you need two different messages in compressed form. If your compressed message uses the same symbolic representation as the uncompressed message--and, since we're talking ones and zeros here with computers, that's exactly the case--then it should quickly be apparent that, for any given length message, there're so many possible permutations of symbols to create a message...and you need exactly that same number of permutations in compressed form to be able to re-create any possible message.

    Compression is handy because we tend to restrict ourselves to a tiny subset of the possible number of messages. If you have a huge library but only ever touch a small handful of books, you only need to carry around the first drawer of the first card cabinet. You can even pretend that the other umpteen hundred drawers don't even exist.

    It's the same with text. You only need six bytes to store most of the frequently-used characters in text, but we sometimes use a lot more than just the standard characters so they get written on disk using eight bytes each. English doesn't even use every permutation of two-letter words, let alone twenty-letter ones, so there's a lot of wasted space there. You only need about eighteen bits to store enough positions for every word in the dictionary. A good compression algorithm for text will make that kind of a look-up table optimized for written English at the expense of other kinds of data. ``The'' would be in the first drawer of the cabinet, but ``uyazxavzfnnzranghrrt'' wouldn't be listed at all. If you actually wrote ``uyazxavzfnnzranghrrt'' in your document, the compression algorithm would fall back to storing it in its uncompressed form.

    Also, don't overlook the overhead of the data of the algorithm itself. If you've got a program that could compress a 100 Mbyte file down to 1 Mbyte...but the compression software itself took several gigabytes of space, that ain't gonna do you much good. It's sometimes helpful to think of it in terms of the smallest self-contained program that could create the desired output. An infinite number of threes is easy; just divide 1 by three. Pi is a bit more complex, but only just. The complete works of Shakespeare is going to have a lot more overhead for a pretty short message. And ``uyazxavzfnnzranghrrt'' might even have so much overhead for such a short message that ``compression'' just makes it bigger.

    Cheers,

    b&

    --
    All but God can prove this sentence true.
    1. Re:To those who're wondering... by harrkev · · Score: 1
      ``uyazxavzfnnzranghrrt''
      And now that you have guessed my real name, I shall have to kill you.
      --
      "-1 Troll" is the apparently the same as "-1 I disagree with you."
  46. Reminds me of "fractal compression." by erroneus · · Score: 1

    I remember years ago there was this horrible "joke" program. It claimed to compress files down to some amazingly small sizes. You could "compress" the file, then erase it, and "expand" the compressed file and it seemed to work just fine! It was done by recording the sectors on disk that a file occupied. So yeah, you can delete it and "restore" it... but try emailing that compressed file? Or expanding it a week later!

    The description of the process sounds pretty good, but then again, so too does the medicinal properties of snake oil.

  47. 4000:1 compression by AYeomans · · Score: 1

    There's a very simple way to get much better compression - simply store the SHA-256 hash of every file instead. My average file size is about 126 Kbyte, so that's a 4000:1 compression.

    OK, OK, you still have to store a full version of each file (or a traditionally compressed version). So for a single PC it doesn't make sense. But for an enterprise there are thousands of copies of those Windows OS files, tens or hundreds of those Powerpoint presentations, scatter-gun emails, etc - so why not just store them just once, and replace with the SHA-256 hash for every other version?

    --
    Andrew Yeomans
  48. I can compress data to 100% its size by mOOzilla · · Score: 1

    I call this the "del" compression algorithm

    1. Re:I can compress data to 100% its size by yoyhed · · Score: 1

      I, too, can compress data to 100% its size. I call it the "I'm going to take a fucking nap and not do a goddamn thing" algorithm ;-)

      --
      WHO NEEDS SHIFT WHEN YOU HAVE CAPSLOCK/ DAMN1
  49. For Christ's sake, Slashdot editors by osgeek · · Score: 1
    Please add "startling data compression" to a list of filters for obviously bullshit articles that have no business even getting attention from Slashdot. The people who submit the articles are either complete suckers or Google AdSense whores. The list should also contain:
    • flying automobile/car
    • holographic data storage
    • Duke Nuke'em Forever
    • perpetual motion engine
    1. Re:For Christ's sake, Slashdot editors by Anonymous Coward · · Score: 0

      Yes, and editors, please also add "RTFA" and "Look into it before crying foul" to every reply like this one. It's not traditional compression jerky, it's using pointers to data you've already stored once.

  50. Lossless and Reliable? by sbaker · · Score: 1

    It's certainly possible (for some types of data) to perform LOSSY compression down to 25:1 - but this system is a backup system...you don't want lossy compression in a backup system!! So let's assume these guys are talking lossless compression.

    The best current compression algorithms for English text come close to 10:1 lossless compression - so there is hope that their system could do that good.

    Even simple run-length encoding will manage spectacular compression ratios well over 100:1 on images that are diagrams...but they typically manage zero compression at all for most photographs.

    Most notably, if your files have ALREADY been compressed it is unlikely in the extreme that any lossless scheme will encode them further.

    There is mathematical PROOF that you can't losslessly compress a generalized stream of random numbers at all.

    So examining this claim, we have to deduce that - yes, a well implemented scheme using basic known technology would be able to get into the 25:1 range for SOME files. However, we know for sure that it won't get close to 25:1 for files full of essentially random numbers - notably, files already compressed by some other scheme.

    We know then that this cannot be a bold, sweeping claim like "No matter what - you'll get 25:1 compression" - that's simply not possible - and you can prove it using math.. So if that *IS* what they are saying - then we must yell "BULLSHIT".

    However, if instead they are claiming "We can compress a typical PC user's file system by 25:1" - then maybe so. In a community of PC users, there will be lots of copies of the same files on lots of PC's - there will be lots of easy-to-compress text files and images of simple diagrams and such. If every PC has a copy of WORD installed on it - then large compression ratios are possible by merely noting this fact. Perhaps that's enough to overcome the likely 1:1 non-compression of that guy's copy of the first billion digits of PI, all of those ZIP and JPG files that are already well compressed. MAYBE we believe their claim for "typical" situations. However, there are no programs out there that can RELIABLY get better lossless compression than 10:1 for text or better than 2:1 for photos. There has to be an awful lot of easy to compress stuff to counteract the effect of a bunch of large photos and ZIP files. One 1MB JPG file has to be accompanied by 50MB of stuff that can be 50:1 compressed in order to average out to 25:1 overall. So their 'magic' compression scheme would have to be able to compress easy-to-compress files by a factor of maybe 100:1 or more in order to allow room for all of those JPG's and ZIP's.

    That's a tall order indeed. I think that even for a typical PC's hard drive, this claim is BS.

    What's for sure is that they are being a little dishonest by not qualifying their claim in some way.

    --
    www.sjbaker.org
    1. Re:Lossless and Reliable? by Anonymous Coward · · Score: 0

      I agree with what you've said, but to add...

      The first million digits of pi can be easily compressed... by noting that a particular file "contains the first million digits of pi".

      That's an extreme example, but the important point is that compression ratios can be greatly expanded by making the compression program know things about what is being compressed. I can nearly recreate my whole hard drive by knowing what version of operating system and applications I've installed. Only a very small fraction of the data is really unique to my installation.

    2. Re:Lossless and Reliable? by sbaker · · Score: 1

      Right - but *my* work PC has almost a terabyte of satellite imagery that's already compressed with a proprietary algorithm...plus maybe 2 or 3 gigs of SuSE Linux and a few hundred megs of other stuff. Even if the compression tool was smart enough to compress the SuSE files down to a handful of bytes that say "Insert SuSE Linux v9.7 here", that's only a third of a percent compression of what's on my PC. Compressing JPEG imagery more than a fraction more than it is already (in a lossless manner) is probably close to impossible.

      So the best algorithm imaginable is unlikely to get more than maybe a 1.5:1 compression out of my PC. It only takes a few users to have similar issues to render claims of a 25:1 compression rate totally impossible to achieve...even with specialised knowledge of the file contents.

      --
      www.sjbaker.org
  51. Re:Heard this before - OWS by CAR912 · · Score: 2, Informative

    This seems good, otherwise Google for "ows compression OR compress OR compressor", and according to this, OWS stands for the author's initials.

    --
    - Move "Sig". For great justice!
  52. And if you run it twice... by Anonymous Coward · · Score: 0

    ... you get 625x compression. Woohoo!!

  53. Re:profound implications by Anonymous Coward · · Score: 0

    You forgot:
    6. Watch me pimp my shitty porn sites.

    Way to go asswipe, heap mockery upon your potential customers, that'll get those hit counters moving!

  54. Shannon Limit? by NormanICE · · Score: 0

    Pardon my ignorance, but what does the Shannon Limit have to do with compression? From WikiPedia, the Shannon Limit describes the maximum bandwidth on a channel, comparing signal to noise, but nothing about compression.

    1. Re:Shannon Limit? by Anonymous Coward · · Score: 0

      Signal is non-redundant data (ie, maximally compressed), the Noise is innefficiency, perhaps redundancy or just random stuff. It's very simple, brilliant (and short) doctoral thesis. It has everything to do with compression.

  55. How it works? by Timothy+Brownawell · · Score: 1
    From the article and the so-called "datasheet" (It is nothing of the sort.) on their website, this system apparently does 2 things:
    1. Content-addressed storage
    2. Stores diffs instead of full files where possible
    (1) removes duplicate entries, which is good for repeated backups to the same place. (2) is good for storing similar files, *if* you can find them. It sounds almost like their storage is addressed with some form of non-crypto hash that only changes slightly between similar files ("no disk I/O", so they must be able to match things without actually looking at them).

    Overall, it sounds like what they do is very similar to git packs, except they claim to be able to do it without lots of I/O, which claim sounds like the specialized hashes. If this is the case, it'd be good for never-deleted nightly backups to the same disk and for systems with lots of similar files. It would get very good "compression" in those cases compared to dated .tbz files, but it wouldn't be (as) significantly better than other tools designed for that kind of usage.

    Of course, if used to just compress your 100GB movies folder it still wouldn't be able to do much of anything. Implying that it would, as TFA does, sounds totally bogus. I doubt it would be more than 2-3x better than any other compressor designed for what it's being used for. (Supposedly git packs are really good like this, because of the "similar files" thing.)

  56. Whats the biggie? by NoMercy · · Score: 1

    So, basicaly it's compressed incremental backups, since almost every tape drive compresses it's data stream before writing it to tape, and almost every backup software offers incremental backups where only what's changed since the last backup gets backed up...

    Whats the biggie?

  57. oooo! by temojen · · Score: 1

    Tar/gzipped rcs repository... so original.

  58. Snake Oil Convention by dskoll · · Score: 1

    That's nothing. I can compress a 1-terabyte truly-random one-time pad to one bit. So I can sell you two amazing products: Unbreakable encryption and unbeatable compression.

    (I'd tell you whether the bit is "1" or "0", but then I'd have to kill you.)

  59. Nothing to see here... by Null+Nihils · · Score: 1

    "storing only the changes between similar byte streams"

    "as far as I can tell from a quick gleaning, they achieve these impossible compression ratios across multiple versions of the same data set."

    Right, so, this claim is no big deal. This is called delta compression and it has been around for a long time. Online games use this method to compress updates sent to clients based on the previous updates received. So instead of sending kilobytes of info each update, the server sends, oh, about 25x less data. I believe it was Quake III that first used general delta compression for online games.

    This is not a novel technique... which means they will get awarded a US patent and start suing willy-nilly.

  60. Re:100X - 1000X by Anonymous Coward · · Score: 0

    Your humor toggle is broken. How the hell did you get modded up?

  61. Shouldn't that be... by august+sun · · Score: 1

    Shouldn't that be 25/ compression?

  62. The actual claim on the website by mypalmike · · Score: 1

    "1. Reduces backup storage capacity 25X or more!"

    So, after you buy their product, you'll have less backup storage capacity than you had before.

    --
    There are 0x40000000 types of people: those who understand 32-bit IEEE 754 floating point, and those who don't.
  63. Re:100X - 1000X by Kiaser+Wilhelm+II · · Score: 1

    How is that "humor"? What is the punch line?

    --
    Lord High Crapflooder The Right Honourable Vlad Craig Esther McDavenpherson III
    Destroyer of Mercatur.Net
  64. As Confucious said by caluml · · Score: 1

    As Confucious said: It's easy to achieve high compression ratios by piping to /dev/null. Recovering the data back is the tricky thing.
    Or it might have been someone else.

  65. Here's the nice and easy answer - compressed delta by mbourgon · · Score: 1

    From what I can see, all they're doing is running differential backups. As long as you have the preceding files, you can get it all back. I've done this with database backups, both by using differentials and by manually creating a diff-style file.

    On to the next article.

    --
    "Sometimes a woman is a kind of religion, she can save your soul & set you free from all your sins" - Bad Examples
  66. It probably works; but it is also obvious. by ClarkEvans · · Score: 1

    This technique has been used by e-mail servers for a great many years. You scan an incoming email to a given individual to see if it is already stored (ie, sent to another customer); if so, you don't save the whole email, only the different 'To' and 'Date' lines. It's simple, trivial, and very effective. It also isn't new.

  67. MOD PARENT DOWN by gEvil+(beta) · · Score: 4, Funny

    Mod parent down! Nobody needs to see goatse again...

    --
    This guy's the limit!
    1. Re:MOD PARENT DOWN by Just+Some+Guy · · Score: 0, Troll
      That would have been:

      O

      --
      Dewey, what part of this looks like authorities should be involved?
    2. Re:MOD PARENT DOWN by ryanvm · · Score: 1

      I thought it was a nipple. Eh - I guess you see want you want to.

  68. It's not BS, read the summary again. by Phishcast · · Score: 1
    A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key.

    All I see in the replies is mathematical Shannon limits and how this is snake oil. It's not about compressing my 650MB ISO image down to 5MB. Say it with me people, de-duplication. This works especially well in the backup to disk space. Think about it, I'm doing incremental backups every day and full backups on the weekends. The vast majority of my second, third, and nth full backups are comprised of data I've already stored. Why store it again? Perhaps compression is the wrong word for it, but essentially you're storing many times what you could store without a de-duplication appliance.

    This applies best to backup-to-disk scenarios but it's not limited to them. Another example, say an email with an 10MB Word/OpenOffice document attachment goes out to the whole company. 200 people save the attachment to their H: drive (sorry, /home/user around here). That's 2GB of space. With the method employed here I store the document once and then only store pointers to it. Your effective compression ratio is 200:1.

    A step further, this can be applied at the block level rather than the file level. One of the 200 people above could change 1MB of the document. I only need to store that 1MB of changed data.

    This stuff works, and it uses methods that have been around a long time. Don't be so quick to yell bullshit without understanding what's going on underneath the covers.

  69. I use a competitors product at work.. by cluon · · Score: 1

    I use a pair of similar devices at work, and they do get the job done for what they're designed to do--backups. If you're not saving the same data over and over again, then you'll never see any better compression than what you'd get with gzip.

    The drives are typically set up over NFS of CIFS as a disk based storage addition to your backup software (think NetBackup or Networker here). Our environment is currently seeing about 15x compression over our retention period. Increasing that retention period would increase our ratio.

    If you're thinking about getting one of these, keep in mind that an initial full backup of your environment will have to fit into the native storage on the device. The savings are seen when you do your next full backup, and the next, and the next. But if you're trying to fit 5TB native onto a 3TB storage device, you'll never even get off the ground.

  70. This Technology IS in wide use already by civik · · Score: 1

    This is old news. Capacity Optimized Storage has been out and in use for several years now.

    I just got through evalutating de-duplication products for the company I work for, and contrary to what the article states, this technology is indeed being used today. Three main companies are vendors of de-duplication compression: Diligent, Data Domain, and Avamar. After looking at this stuff for 3 months, I can tell you it ISNT smoke and mirrors and the technology DOES work. Some other posters have stated correctly however -- the compression factor increases based on how often the devices see the same data. For example, you back up a SQL server to the device, you might get a more reasonable 4:1 compression ration but subsequent backups you get a 98% compression on that same data.

    Diligent is a fibre-attached gateway that will use whatever LUN is presented to it in a SAN. Its weakness is that the powerhouse features such as replication/open file system seem to be vaporware, limiting its usefulness for any serious applications.

    Avamar is a rip-n-replace soup-to-nuts replacement for your existing backup infrastructure. That is its strength and its weakness. No company with a Veritas Netbackup infrastructure that it has built over the course of a decade is going to tear it out overnight to replace it. However it is a VERY cool product as it does all the compression on the client end so backing up a 20gb server might take 6 hours the first time but subsequent times the same server will back up in like 6 MINUTES.

    Data Domain basically makes hardware appliances that do the same type of compression on the hardware end, they make devices that work with existing infrastructure. Basically you target the Data Domain box instead of your tape library. They have the advantage of basically being a 'snap-in' replacement for tape. The disadvantage is that unlike Avamar you are still piping ALL the data off of the server to the device where the compression happens. They have a good replications system, and seem to be a neck ahead of the other vendors in he COS horse race.

    COS is great technology and it solves a lot of backup headache.

    --
    Make it a malt liquor. I want to be as clever and handsome as possible.
  71. I do this all the time! by Ancient_Hacker · · Score: 1

    I do this all the time. I have a 23 megabyte tar file on my computer. Boss calls and says he needs it right away. I drag it to my USB flash memory drive. Copies over in a FLASH! I rush to work, stumble in to the boss's office, plug in my memory stick, and voila! Windows had dragged a 1.2kb shortcut .lnk file instead of the 23 megabyte original. Much grumbling. Now THATS's compression!

  72. Did I see a redundant thread? by NRAdude · · Score: 0

    Network Redundancy Adminsitration here.

    When I raised a pig for agricultural purposes, it could compress 2.5 pounds of feed into 1 pound of weight gain.

    It's too bad the pig was sold; raised for FFA, but assimilated to a 4H member.

    Only problem with all that compression is the mice were always in the feed to spread virus, and the pig had to be de-wormed every 3 months or so (If IRC).

    No different than what we at NRA deal with on house-calls; just a bunch of lazy pigs that want their OS cleaned and their Computer assembly smelling like lemon.

    your friend,
    Network Redundancy Administration, dude!
    Gregory-Thomas

    --
    without prejudice
  73. Forgot to mention... by Phishcast · · Score: 1

    Check out Data Domain for a similar product. There are other people doing this stuff.

  74. There's a lot of BS being spoken here, but... by Expert+Determination · · Score: 1
    ...not the BS you think.

    Look, there is a nearly trivial theorem that says you can't put more than N pigeons into N pigeon holes with no more than one pigeon per hole. And from this it can be deduced that there is no algorithm that is guaranteed to compress any N-bit stream into one with fewer than N bits. But a useful compression algorithm doesn't need to compress every single bitstream. It just needs to be able to compress the kinds of streams that come up in real life. This is a tiny fraction of the total number of streams that could possibly appear. So the standard no-go theorem does nothing whatsoever to prove that there isn't a useful 25x compression algorithm.

    Having said that, this article is pure BS simply because it implies the existence of an algorithm that does an amazing job at characterizing the kinds of strings that might come up in real life. I don't believe that anyone can do that job as well as this story implies. And that's why I don't believe it, not because of some oh-so-smart-but-ultimately-useless theorem that people are bandying around to show how clever they are.

    --
    "The White House is not an intelligence-gathering agency," -- Scott McClellan, Whitehouse spokesman.
  75. Okay I realized what they did... by JollyFinn · · Score: 1

    They are doing CVS style for backups. For instance instead of storing 100 times the system state you get 1 system state and 100 diffs for it. Of course some compression on basic state and diffs are applied. And it looks like they also compress across multiple machines. So they are just applying compression in scale and location that isn't normally done. You normally don't compress across multiple backup generations, nor multiple workstations. When considering 30 backups of 25 developer workstations the dataset is having so much redundancy in data that I'm surprised if the compression ratio would be only 25x. Here's a good one. How much multiple backups help after that compressor. Perhaps they help if you need to get to a specific stage to undo some things that happened after certain backup. Also there is problem that if ONE set goes bad backups on *ALL* backups on all workstations go bad. Good new is that they probably have some redundancy duplicate raid1 style system below this compression layer. And taking tape backups every now and then on the compressed dataset would make it reasonable to have on tape backup of ALL data on 100+ workstations at end of every day they are ran depending on amount of data that is different between workstations and amount of changes that happen on the workstation.

    --
    Emacs is good operating system, but it has one flaw: Its text editor could be better.
  76. Probably a commercial version of LZIP by Gothmolly · · Score: 1

    Currently the sourceforge site is down, but LZIP allows you to specify an arbitrary compression level that you want, and the algorithm picks through the data set until it reaches it. Further discussion is here.

    --
    I want to delete my account but Slashdot doesn't allow it.
  77. 25x is NOTHING! by jthill · · Score: 1
    Bzip2 uses the BWT, which, ON RANDOM DATA, crams a terabyte of sort key into a megabyte buffer EVEN BEFORE IT STARTS!

    You think I'm making this up, don't you.

    Jim

    --
    As always, all IMO. Insert "I think" everywhere grammatically possible.
  78. shannon limit anyone helllo by Anonymous Coward · · Score: 0

    clearly fud hype. its not possible. its been proven.

  79. Re: proposed compression method by Anonymous Coward · · Score: 0, Interesting
    I propose the following algorithm:

    Compression:
    1. Search pi until you find your data. (*)
    2. Record the length and offset of the match. (**)
    3. Prophet.

    Decompression:
    1. Use the base-16 digit extraction method to recover the stored data.
    2. Profit.

    Obvious variants:
    • Split the data into chunks and encode a list of offsets (i.e. trade off encoding time for compression ratio).
    • Search the compression stream (PI) and the data for compression ratios of a preset threshold (e.g. >=1000x); optionally attempting to span the gaps created by previously compressed data (e.g. if you compressd "cdef" in "abcdefghi", then you might try to match subsets of "abghi" next) -- it just requires a slightly more complex encoding format.

    * = You'll need step 3 to find step 1 in a reasonable amount of time.
    ** = It would be fscking hilarious if someone were to prove that you can't always find a given chunk of data within {original-size} bytes of pi, so the offset might be bigger than the original data, and this algorithm wouldn't even be guaranteed to actually compress your data.

    p.s. If I catch anyone trying to patent this, I'll refer the patent office to this post as prior art and also reveal the value that yields this hash: cb775b9b061b03e8666819ede2181d2e. Anyone that cracks this will get a chuckle. ;-)
  80. They can do fine without Canterbury Corpus Test by RedLaggedTeut · · Score: 1

    The Canterbury Corpus Compression Test only measures primarily the compression ratio, not the speed.

    They are still in business if they have an algorithm that is faster when receiving repeat data over long periods of time, that is faster than the two obvious algorithms of e.g.
    a) uncompressing all related backups.tar.bz2 received so far, appending the new backup.tar, then compressing to new backups.tar.bz2
    b) uncompressing a vcs like subversion.tar.bz2, committing the update, then compressing it to new subversion.tar.bz2 again.

    Also, the methods above only work well on data that is somewhat sorted between users submitting it to backup.

    --
    I'm still trying to figure out what people mean by 'social skills' here.
  81. compress /home/user/same-old-crap-again-v3.ppt by billstewart · · Score: 1
    "/home/user/same-old-crap-again-v3.ppt is the same proposal as /home/user/same-old-crap-again-v2.ppt except the shipping date slipped by a year!"

    Sure, I don't see 25:1 happening for arbitrary data types, but in the corporate market there is a lot of redundancy if you're clever enough to be able to identify it, especially for corporations that are large Microsoft Office + Microsoft Outlook users (which is to say "most large corporations".) A lot of the documentation is the same file attachments getting sent around to multiple people, often kept in Exchange mail servers as opposed to individual desktops, or documents that substantially re-use previous documents. Depending on how granular you want to be and how entrenched in the more bloated Microsoft formats you want your code to be, you may be able to find most of your document already in storage, as long as you've got indexing capabilities to look for it. Maybe you just look for hashes of whole documents, or maybe you look for documents with similar names and internal tags and start comparing pages.

    Video compression is well-known to use this kind of approach - you've got an initial frame with reasonably-high resolution picture, then you track the changes, usually by some model that breaks the picture down into objects that move a bit. ... And then there's music compression "It's the same old country song with the same three chords in G, she's left him and she ain't coming back, except there's this little 6-note riff at the end of the chorus when he says she took his dawg with her too."

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  82. Might work for typical back-up by porttikivi · · Score: 2, Informative

    The article talks about backup. The idea could be, that instead of managing incremental backups you just optimize compression of data that is similar to old data. In that way you can do "full" backups, but actually save only incremental backup worth of data.

    See http://en.wikipedia.org/wiki/Venti for similar ideas in a system that easily achives 25x compression for typical archival storage. When a file has been changed only those 512 kbyte blocks that are really new are saved, other blocks are just mapped by their SHA1 hashes to existing blocks. So files with small changes, very similar files and files sharing common parts will all compress very nicely. In a multi-user system the files of different users tend to also have lots of similar parts: same emails, same office documents with perhaps minor changes, same reference material / tools / libraries as personal copies etc.

    My guess is TFA refers to a re-invention of this wheel, most likely in an inferior way.

    --
    Anssi Porttikivi / app@iki.fi
  83. Quantum Compression by DJScrib · · Score: 1

    One thing I've always wondered about with compression is if Quantum Computers ever become a reality. Now people always say a quantum computer can try every permutation for a bitset simultaneously cracking encryption ciphers instantly. So my question is if you had a large piece of data (such as an image taken by a voyager probe), could you generate a bunch of different checksums/hashes using algorithms and transmit those to a computer. Then on the computer try every possibility until you find the possibilities which all generate the transmitted checksums. Finally then run some AI on possible photos and rule out the obviously garbage ones until you find the correct one. (Figuring random data won't show a meaningful picture). So could that work if the all-powerful quantum computer becomes a reality?

    1. Re:Quantum Compression by Anonymous Coward · · Score: 0

      Well, it would kind-of work. Matter of fact, it would work exactly as you think it would. You haven't followed through on the exact implications though:

          I have 100kB jpeg file == 819,200 bits.

          I hash it down to 160 bits using MD5 and transmit that to my quantum computer.

          The quantum computer works out all the possible input files that could have generated that hash. If we don't send the length of the original file along with the hash data, that is of course an inifinite number - even quantum crypto can't help you there! So let's say it's only all possible files of the same length that the quantum computer tests.

          Because the hash is 160 bits, and assuming it's an effectively random mapping of inputs to outputs, we know that one in every 2^160 of those files generates the same hash.

          So we still have 2^(819200 - 160) different files to choose between. That's going to take a while, even with a bit of AI support. You might as well just not bother to send the hash at all, and run your selection algorithm over all possible image files.

  84. Re:right. sure. by Fulcrum2000 · · Score: 1

    Correct, the two best compression programs in the world are PAQ8H (command-line) and WinRK 3.0.3 (GUI). But both are very, very slow. Compressing 300 MB takes over 6 hour on a AMD 2800+

  85. The article is wrong by Anonymous Coward · · Score: 0

    I'm not here to defend Diligent Technologies. Their claim of 25x is well worded marketing crap. That being said, they make no claims towards 25x compression. That was done by the author Robin at the StorageMojo. Diligent claims to enables the effective capacity increase of disk systems by 25 times or more. A very weak claim when you look into the specifics, but not at all the claim of 25x compression being spread by StorageMojo. This is more of an example of a lie being spread by someone who did not check their facts.

    If you look at the comments at that site, someone has already pointed this out. Robin's weak reply was:

    "Well, I'm looking at a document from them that says "Reduce Required Backup Storage Capacity by 25X With 100% Data Integrity." Whether that is better compression, or better backup, I'll leave to others to decide. But if they can really do it, even if it is only 10x in practice, it is still huge compared to existing technology."

    This is just another example of the bad side of the blogosphere. This is starting to piss me off.

  86. Entirely possible by Coward+Anonymous · · Score: 2, Informative

    This is entirely possible and they are not the only ones doing it, for example http://www.datadomain.com/ has been doing it for a while. The big storage vendors do it to some extent as well.
    The idea is based on "de-duplication" of data and is only really practical for backups (where most data from backup to backup is identical) or central repositories of data for a large organization that has multiple similar data sets, for example, many installations of Windows that are often similar.
    From my experience x25 is a bold claim for general data. I've seen small scale tests that showed x30 compression over backup sets but those implementations had performance issues.
    From the description in their white-paper, despite their claims, it appears they are performing some kind of hash by definition (e.g. mapping a space to a smaller space).

  87. Performance, reliability, DATA transfer rates by PhYrE2k2 · · Score: 1

    The tradeoff is always performance when creating either compression or redundancy, as well as reliability.

    Usually with more advanced compression comes more information dependency, lessening the chances of recovering a partial archive, a partial file, or dealing with any damage. This could make it bad for tape storage or anything that can have small portions damanged (CDs, etc)

    Additionally there's performance. Of course we can all compress and compress based on dictionary sizes and algorithms specific to any application, but as those dictionary sizes get bigger, the huge amount of memory and processor power is needed. Think about an algorithm that depends on 50% of a file... Huge calculations, and having to load the whole file into memory in order to work on only a small part of it.

    We keep pushing the data transfer envelope, so why are we caring so much about packing compression? Think the past 10 years- 10Mbit, 100Mbit, 1Gbit, 10Gbit have all had their day and many gone. Circut speeds are increasing as well as wireless transmission speeds. Disc subsystems have enough trouble keeping up with Ethernet! You can't compress data as fast as you need to send it out, nor even read it from the disk!

    Compression was super in a day of floppy disks and 9600bps modems, but it hasn't evolved much. ZIP and RAR are still what they were. Other formats from the early 90's have mostly disappeared (LHA, ARC, etc) as despite better compression just aren't needed.

    Want proof? You can download your latest movie in XVid (700MB) or DVD-R (4.3GB) with similar quality- why are so many people downloading the DVD? Bandwidth at the consumer level is cheap and abundant. Broadband is everywhere. [note: I know the argument is that this compression is lossy and file compression isn't-- but the point is that bandwidth makes pulling extra data something of non-concern, compared to the user's processing time and interest].

    -M

    --

    when you see the word 'Linux', drink!
  88. Similar experience... by jwiegley · · Score: 1

    So back in 1998 I started work for a company that had an interest in video streaming. There was some company that claimed to have a system that could broadcast streaming video and audio in realtime over a 28Kbps modem link with no visible degredation of quality and no lag. All built around hype such as "we've acquired the brightest signal processing engineer in the business who has made an astounding break through."

    So a colleague and I flew up to San Jose to research the product and attend a by-invitation-only demonstration. There were two black box endpoints set up across a serial modem. One end point was just an NTSC stream digitized by their system and provided to the serial link. The other end point supposedly uncompressed the stream and built an NTSC stream.

    To test lag I asked them to yank the video cable from the digitizer and sure enough the receiving end instantly tracked the change.

    Everybody at the demonstration was under full draconian non-disclosure agreements. I asked for them to take the cover off the boxes so that we could verify that there did indeed exist some sort of reasonable computation processing going on and not some sort of standard RF video transceiver link hidden in the boxes. They said absolutely not on the basis of trade secrets.

    Of course we went home and never thought about their company again. Funny, I never saw any news about any adoption of their "fantastic, revolutionary" product.

    Another story from the "If it's too good to be true..." department.

    --
    I will never live for sake of another man, nor ask another man to live for mine.
    1. Re:Similar experience... by Xeger · · Score: 1

      You may not have heard from them again, but others did. They were eventually proven to be a huge fraud -- all of their demonstrations were hoaxed (including one famous case where they had to run coax across a river bed to rig a supposed "long distance" demo).

      I wish I could remember names or give links, but the details escape me. I read an article on the web about 2 years ago.

  89. Re:right. sure. by thc69 · · Score: 1

    Didn't you know? "A combination of de-duplication and calculating and storing only the changes between similar byte streams" IS a breakthrough. No previous compression algorithm ever did that sort of thing before...

    --
    Procrastination -- because good things come to those who wait.
  90. MOD PARENT INSIGHTFUL by Spaceman40 · · Score: 1

    That is all.

    --
    I [may] disapprove of what you say, but I will defend to the death your right to say it.
  91. Actually ... by Anonymous Coward · · Score: 0

    >> I'm sure with the right developer, Linux could also be used to harness zero point energy, create wormholes for travel in your basement, and possibly cure most diseases... /wink

    Well, you're just looking for karma there, but actually ...

    If you think that those things you mention are impossible (or even just unlikely), then you haven't been paying attention to the way in which science continually reinvents the meaning of "impossible".

    Here's a rather more likely worldview: nothing is impossible (and I do mean *nothing*, even logical impossibilities), given enough time. Even logical impossibilities just need a re-examining from a different angle. (Don't forget Godel -- the rules change when you examine a logical system from outside its domain.)

    The vanquishing of "impossible" is actually a consequence of the fact that we cannot observe the structure of reality or nature directly, but only her behaviour, ie. the way she responds to our stimuli. As a result, the "impossibilities" that our primitive theories conjure up tend to evaporate over time, as new expanded theories replace them. And this will never end it seems. There is no reason to believe that any particular observation is a fundamental one.

    So, pretty much nothing is impossible. :-)

    1. Re:Actually ... by TheNetAvenger · · Score: 1

      If you think that those things you mention are impossible (or even just unlikely), then you haven't been paying attention to the way in which science continually reinvents the meaning of "impossible".


      I never said they were impossible, the humor was that Linux was the only answer to acheiving them. Get it?

      Zero Point Energy has massive potential, and is there and will be eventually harnessed. Wormholes are still possible in theory although the newer version of M theory tends to make them less likely to be a method of travel.

      Everything is possible. I have even written papers on pattern key based encryption concepts that could advance compression beyond current theroretical limits. Even writing some preliminary key compression systems, but NOT on Linux.

      The whole story became even MORE funny when the person that wrote the Slashdot blurb thought that it running on Linux was something that was at all relevant.

      Advanced concepts of compression are not language dependant let alone OS dependant. Nor is the OS even relative, except to the (cough) poster of the article which apparently wanted everyone to see how 'wonderful' Linux is and thinks that there is a correlation to 'great advances in technology'.(cough)

      Take Care...

    2. Re:Actually ... by ldj · · Score: 1
      Thank you, Mr. Obvious! :)

      The deal is, your over-the-top response makes you sound no less foolish and insecure than the author of the article summary. I could be wrong (been known to happen), but I'm guessing there are practically zero knowledgeable Slashdot readers that would buy into the idea that a data compression algorithm can only be coded to run on a single OS. That you felt compelled to point out the relatively obvious in such a heavy-handed manner gives the impression that you have are just as much anti-Linux as the summary author is pro-Linux.

      The above may or may not be true, but that's certainly the impression I was given. Just some food for thought.

      Take Care...

      You too! :)

      --
      Open Source: I'll show you mine if you show me yours.
    3. Re:Actually ... by TheNetAvenger · · Score: 1

      Wow...

      Ok, come on.. Do we really have to put.. "AND IT ALL RUNS ON LINUX" on every freaking article?

      Humor aside, there is a point there, take it or leave...

      I won't say "take care", as you apparently like to mock kindness and sincerity... Maybe it is something that exists outside your realm of reality.

    4. Re:Actually ... by ldj · · Score: 1
      Ok, come on.. Do we really have to put.. "AND IT ALL RUNS ON LINUX" on every freaking article?
      No, we don't. And I don't see it on every article. My response was just to let you know how you were coming across. Nothing more.

      I won't say "take care", as you apparently like to mock kindness and sincerity... Maybe it is something that exists outside your realm of reality.
      I think your shields are set a little too high. I quoted your "take care" because that is a common closure that I also use. And as with you, I do intend it to mean "hope things go well in your life." Sorry if that was misunderstood. Please don't stop using the phrase if you mean it. There's way too much ill feeling in the world as it is. We need as many "positive vibes" as we can get! :)

      --
      Open Source: I'll show you mine if you show me yours.
  92. Well.. by Anonymous Coward · · Score: 0

    That 25x compression is based on disk storage on archived media. Which makes sense, you back up a database, how much of it actually changes? I didnt see what their method is, but it is possible, and for backup purposes.

    Good example, I have 20 webservers, the OS is the same on each server, but the configuration is different. (see where im going?) The software is smart enough that all the servers are the same, but the configurations are different. So it doesnt have to back up the entire same directories.

    But for a long time, I've always thought the method of key'ed compression would lead to a better than 2x compression rate. I remember RLE and Bignum ascii compression that would lead entire ansi sites to less than 2k compressed, perfect for modems. All using Keys.
    But thats a preset key, not a cpu crunching processed key like bzip.

  93. This might work in the aggregate by Sarusa · · Score: 1

    I think what they're saying here is that if you're backing up an entire hosting site, or an entire company set of documents, information, etc, that you will find a lot of redundant content. Then you add normal streaming compression on top of that.

    So I can believe the 25x (as a generous/marketing figure) in this specific use case. It wouldn't work at all for compressing single files for distribution elsewhere because it requires that you have all the other documents as context.

    This would be very annoying to do on the fly as well (what if your 'base' document that 12000 other documents are similar to changes?), but again is well suited for backup or read-only media.

  94. Totally lame... by ConceptJunkie · · Score: 1

    Obviously nothing concrete or released yet so take with the requisite grain of salt.

    Come on, editors. There are people who believe the world is flat and stars are little candles in the air who are shaking their heads in disbelief over this article.

    No Digg!!!

    --
    You are in a maze of twisty little passages, all alike.
  95. rsync by boldi · · Score: 1

    You are absolutely right. If I compress my disk into a simple .tar and transfer it dailty by rsync, it's more than 1/25th comression. Not too much change every day, most of it is static data.

  96. Entropy by Anonymous Coward · · Score: 0

    You cannot (losslessly) compress data beyond its entropy (-sum_i {p(i)log2 p(i), p(i) probability of ith symbol -- Rate in bits per input symbol}. From this, we know that we cannot compress equiprobable random bits *at all* and a highly 'deterministic' data stream to 0 outbits/inbit (in the limit as input stream size goes to infinity).

    The amount you can theoretically compress depends on the input data.

  97. No patents? by cperciva · · Score: 1

    Looking at the company's website, I can't see any mention of patents -- either issued or pending. If they really don't have any patents, I don't think they're going to get very far: Compression is one of the most over-patented fields around.

    There aren't many details about how their product operates, but unless they've been extremely careful they probably infringe either the rsync patents (Pyne) or the blocklet patent (Williams).

  98. Crap... by x2A · · Score: 1

    I was gonna make some kinda 4th dimension joke about it using time to achieve it's compression ratio, rather than just compressing the amount of space used... but it sounds like that's actually pretty much true!

    Oh well, saves my brain power trying to word the joke :-p

    --
    The revolution will not be televised... but it will have a page on Wikipedia
    1. Re:Crap... by Anonymous Coward · · Score: 0

      to achieve it's compression ratio

      "its".

  99. Compression hoax number 3 by Futurepower(R) · · Score: 1

    This is the third time that I can recall that a Slashdot editor has accepted this same hoax.

    --
    Before, Saddam got Iraq oil profits & paid part to kill Iraqis. Now a few Americans share Iraq oil profits, & U.S. citizens pay to kill Iraqis. Improvement?

    1. Re:Compression hoax number 3 by bluephone · · Score: 2, Insightful

      News media around the world carried the "news" of the Raelians cloning a little girl. The vast majority of intelligent people knew it was crap, most average peopel assumed it was crap, the news media all said to take it with a grain of salt and that they could secure no no proof. News is news, whether it's news of a real advance, or news that a potentially reliable source is making astounding claims. Only through the analysis of these claims can knowledge grow.

      --
      jX [ Make everything as simple as possible, but no simpler. - Einstein ]
    2. Re:Compression hoax number 3 by arodland · · Score: 1

      But even still there's a difference in scale. With the cloning hoax, the odds that some secretive scientist managed to make everything work out for a human cloning were one in a billion at best, but they were there. With this compression claim, the odds are one-in-an-"everything we know about math, starting from arithmetic, is wrong".

    3. Re:Compression hoax number 3 by bluephone · · Score: 1

      But now everyone who reads the comments here knows this is bunk, including people who aren't so quite as mathematically/technologically adept, and they can talk about it with others, etc. Hoaxers and fraudsters grow in the shadows, so exposing them to the light is the best way to keep them at bay.

      --
      jX [ Make everything as simple as possible, but no simpler. - Einstein ]
  100. Re: proposed compression method by Haeleth · · Score: 1
    It would be fscking hilarious if someone were to prove that you can't always find a given chunk of data within {original-size} bytes of pi, so the offset might be bigger than the original data, and this algorithm wouldn't even be guaranteed to actually compress your data.

    Here is a chunk of data that cannot be compressed with your algorithm:
    1234
    Hilarious, eh?
  101. April Fools is OVER guys...! by sinewalker · · Score: 1

    I mean, come on! Can we please stop with the stupid /. articles and get on with nerd news? The past week's been rediculous.

    --
    “Our opponent is an alien starship packed with nuclear bombs. We have a protractor.” — Neal Stepnenso
  102. Vaporware video applications would be possible! by Temsi · · Score: 1

    Now, if this were true in any way shape or form, I personally know several people who would have access to BILLION of dollars in development for video applications and television broadcasting applications, not to mention feature film distribution and/or production.

    Currently, uncompressed High Definition video requires enormous storage, as well as massive bandwidth to play in realtime.
    With a 25:1 data compression scheme (no image degradation), any laptop would be able to store hours of High Definition video.
    If this can truly compress anything, then encoded video shouldn't be a problem. Which means a 25mbit video (DV video format) could be downloaded as a 1mbit stream...

    25x compression would allow lossless compression of 4k video to be stored on regular miniDV tapes...

    Now, having said that, I think I can say with a lot of certainty that the entire story is BULLSHIT, therefore none of what I just wrote will happen anyway, so...

    --
    -- This sig for rent.
  103. Re:100X - 1000X by andywww · · Score: 1

    There have been many posts criticizing this as vaporware, and only 2 posts explaining why it doesn't have to be.
    The problem is more in the summary article (both the slashdot summary and the linked article) than in the feasiblitity of the technology. Rather than compressing a dataset 25:1, the company reduces the amount of space needed to backup a dataset by eliminating some redundancy.
    Repeat: not data compression- backup technique. That's why its not for home users.
    It bothers me how many modpoints the trolls have gotten.

  104. Well that's not surprising. by Ayanami+Rei · · Score: 5, Informative

    That's called the law of large numbers.
    Systems like this bank on the fact that most enterprise backup systems (that is... Veritas) can't tell when a file is changed slightly between backups. They use a coarser-grained whole-file approach (which is very reliable though, and already only stores one copy of each file). But people who know about the magic of rsync understand the speedups that can be obtained by leveraging rolling hashtables and other tricks to get binary deltas of large files, and only transmitting those changes.
    Given a large enough set of backups and enough time, the potential size savings is enormous.

    Veritas should really be implementing this themselves, though.

    And I have a feeling this is what's behind the 25x claims of the article. The key is the mention "enterprise"... large data sets... lots of potential redundancy to exploit.

    --
    THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
    1. Re:Well that's not surprising. by GigsVT · · Score: 1

      the potential size savings is enormous.


      Yep. We have a 1TB+ archive that we back up every hour with rsync.

      Through the magic of rsync-incremental, we have snapshots of what the archive looked like from present day to 60 days ago, pick any day within the last two months and I can get you a copy of what the archive was then. All this takes up less than 2TB of space. rdiff-backup would be even more efficient on certain datasets than rsync-incremental is.

      Doing conventional backups (full+forward delta incrementals) on that much data over the network would take days to do a full backup, with rsync we never need to do a full backup, the amount of data transfer saved over conventional backups is more like 1000x.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
  105. It works, but... by Anonymous Coward · · Score: 0

    OK, doing the "check the incoming stream againt what we already have and only store the difference" thing might just work. But it could be dangerous. Imagine losing just *one* of those differences along the way that you'll need to reconstruct the "uncompressed" file. Think RAID striping without parity or redundancy -- lose one drive and...

  106. Diligent at Storage Networking World by robnsara · · Score: 1

    Their spiel here (I've visited their booth at SNW as well) is that it is primarily used in part of a virtual tape solution. Their software sits on a Linux box (which they recommend a quad-Opteron with lots of RAM) emulates a tape library, then passes data to your backend SAN storage.

    The compression they use for compression/data de-duplication seems to be in a similar vein to stuff used by Data Domain and other WAFS type solutions, just on a higher-bandwidth model.

    If I recall correctly, Diligent is made up of some spinoff guys from EMC. (correct me if I'm wrong)

  107. Actually, I once tried that. by MickLinux · · Score: 5, Interesting

    I once used a Huffman data compression algorithm, recursively, in order to see just how much compression I could get. The first round, I got maybe 75% compression on the data I was using. The second round, I got 10%. The third round, I got 3%. The fourth, I got 1%; and after that, I'd typically actually increase the size of the data slightly. Let's not forget that I am including the size of the initial data table.

    So then I tried it with LZW compression, and it still eventually grew in size.

    The neat thing about doing this, though, is that it taught me something about the mathematical basis for entropy. You see, I couldn't believe that I was getting the diminishing returns, so I wrote some algorithms to output the histogram curves.

    What I saw was that the best Huffman compression came when the Histogram was farthest from what I'll call a "perfect bell curve". I don't know if that is the same curve or not, but it looks a lot like one half of a perfect bell; or maybe like the radiation output of a blackbody in physics.

    Anyhow, as I successively compressed the data, the data moved towards a tighter bell curve in general, and always towards that perfect bell, in specific (so long as the data would compress, that is.) I didn't do the calculation, but it would be interesting to calculate what the closest bell curve was, and then do a standard deviation of the histogram from the bell curve, and correlate it to compression.

    So then I thought "well, I'll compress only a portion of the data, the part that is compressible". But any typical portion of the data still seemed to follow that pesky bell curve. So then I thought to intercept the data, and see if I could visually spot any patterns.

    Indeed, I could. Wow -- look at that string of zeros here; and that repeated series 1001001001001, *four times*, there. Surely I could get compression out of that. Funny thing, though. Every time I tried, I could get compression for that data set, but then lousy compression for anything else. When I tried to generalize the compression to include every possibility, I again couldn't get compression. In other words, truly entropic data does have repetition. It does have some item that shows up more commonly than others. It does have patterns. But the patterns are no more than what you would expect, (or actually, if you want to be correct but confusing, only an expectable percentage of the patterns are more than what you would expect, by any given amount.) And when you include all the patterns of length n, including patterns of length n=1, then there just isn't any more entropy possible for the data.

    And just as it takes an increase in entropy to drive a heat engine (2nd law of thermo), it also takes an increase in data entropy to get compression.

    --
    Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
    1. Re:Actually, I once tried that. by albieomoss · · Score: 1

      finally someone makes the connection between entropy and data compression.

      --
      DankLogic - There is a system to everything.
    2. Re:Actually, I once tried that. by Anonymous Coward · · Score: 0

      What were you measuring with that histogram? Surely it wasn't simply a histogram with 256 slots showing how many times 0..255 showed up.

      Otherwise, you could have made a table, and shuffled the byte values, and attempted to compress again because you'd no longer have a bell curve.

    3. Re:Actually, I once tried that. by alexhs · · Score: 1

      I guess it was an adaptative Huffman algorithm, or else it sounds weird.

      AFAIK, the classical Huffman algorithm essentially can't compress its output at all, it has been proven to be the optimal algorithm when converting a symbol (usually 8bit) in an integer number of bits.

      --
      I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
    4. Re:Actually, I once tried that. by RovingSlug · · Score: 1

      Rather than an argument for entropy, I find it easier to think of the set of all possible files all at once. Seriously.

      A lossless compression algorithm must have a unique mapping (1:1) between input and output. Think of it as just a shuffling all possible output strings against all possible input strings.

      There are always more, different input strings of length N than there are total strings from length 1 to length N-1. It is mathematically impossible to correspond all possible inputs of a certain length with output of strictly shorter length.

      It's also easy to show (by induction) that if you guarantee the output string is never longer than the input string, then you also guarantee that the output string can also never be shorter than the input string. That is, to guarantee your compression factor is strictly >=1 actually guarantees compression factor is =1.

      So designing a compression algorithm comes down to mapping the common inputs to shorter outputs, and displacing uncommon inputs with longer outputs. Necessarily.

      1) choose what you want to compress, 2) choose what you don't want to compress, 3) design your algorithm accordingly.

    5. Re:Actually, I once tried that. by citizenr · · Score: 0

      OK, and what if you inject "predictable entropy" (is that an oxymoron?) between compressor cycles? I am talkin XORing with known string(S), like fibonacci string or primes? Or something more advanced than XORing, but still predictable. What would happen then?

      --
      Who logs in to gdm? Not I, said the duck.
    6. Re:Actually, I once tried that. by Anonymous Coward · · Score: 0

      " What I saw was that the best Huffman compression came when the Histogram was farthest from what I'll call a "perfect bell curve". I don't know if that is the same curve or not, but it looks a lot like one half of a perfect bell; or maybe like the radiation output of a blackbody in physics. "

          Actually, the best Huffman compression of all would come when the histogram looks like a delta function - i.e. all the bins contain zero entries except for one. That is to say, long runs of just a single symbol.

    7. Re:Actually, I once tried that. by MickLinux · · Score: 1

      If it's truly entropic, then the xor will still come out entropic. In other words, you won't get anything useful.

      --
      Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
    8. Re:Actually, I once tried that. by MickLinux · · Score: 1

      And yes, predictable entropy is an oxymoron (except for God).

      Try this: Flip a coin 50 times, and record it. You should have a fairly even distribution. Now, XOR each one with the answer before. You should still get an even distribution. That's XORing random with random.

      Now, try XORing the first 50 flips with 1, and the second with zero. The result should still come out random. It would, wouldn't it?

      Then try XORing it with alternating 1s and zeros (1 0 1 0 1 0 1 0 1). That case is no different than the one in the paragraph before. So it should still come out random.

      Now, no matter how more complicated you make your predictable pattern, it's not going to be essentially any different than XORing with 111111 000000.

      --
      Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
    9. Re:Actually, I once tried that. by citizenr · · Score: 0

      XORing with one and the same byte yes, and what about XORing with fibonacci string? :)

      Anyway you are putting too much attention to the "XOR" part of my post, and to little to the idea of manipulating the data in a predictable, yet messy way. Take for example fibonacci string - you could shuffle bytes arround according to it, ROT different parts with different Fibonacci offsets, then shuffle some more, then XOR some parts with other ROTed parts, then shuffle some more - all this in a small reversible procedure. After all this imput data would hardly reasemble what it was before. .. or take that data, and simply encrypt it with a big key, doesnt look the same, compresses again.

      This idea is hunting me since C64 fastloader coding days. Can you direct me to some place with sample code to make those histograms you spoke earlier about? Also some theory behind those would by nice.

      --
      Who logs in to gdm? Not I, said the duck.
  108. Re:100X - 1000X by Anonymous Coward · · Score: 0

    Keep compressing till you have a couple of bytes?

  109. Re: proposed compression method by Anonymous Coward · · Score: 0

    Yes, you provided a base 10 counterexample: the string 1234 occurs at position 13,807 counting from the first digit after the decimal point. But then again, we all already knew that such examples exist, because we've all been to the base-10 pi-searching webpages; I could have saved you the trouble and listed some in my original post, but I was thinking base 2, not base 10 (Hint: the base 10 search space is less dense than the base 2 search space).

    Anyway, (1) the counterexamples in base 10 do not prove that the numbers can't be compressed in say base 2 , and (2) even if you were to disprove it in base 2, all we have to do is throw in a few more common transcendental numbers and encode which number we're indexing. To fully disprove the method, you'd have to show that there are counterexamples in an arbitrary number of transcendental numbers.

    For example, can you also find an example that doesn't appear in the first 1234 digits of e, e^pi, and ln2? That would only add 2 bits. If we're willing to accept more overhead or include a delimiter, I could create a gigantic table of trascendental numbers to be used. Then just encode the smallest value of log2(index-of-transcendental-number) + delimiter + log2(offset-in-the-chosen-number).

    p.s. Besides, nobody really cares about compressing small files. They're only important when there are a lot of them, and then you just tar 'em together.

    $ echo 1234 | wc -c
                5
    $ echo 1234 | gzip --best | wc -c
              25
    $ echo 1234 | bzip2 --best | wc -c
              42

    Using the 1234 counterexample we might be tempted to throw out the baby with the bathwater and say that gzip and bzip2 are horrible compression algorithms. hehe.

  110. NO COMPRESSOR IS GENERIC!!! by moultano · · Score: 2, Insightful

    Guess what? It is IMPOSSIBLE to create a generic compression algorithm. Gzip operates by doing exactly what you mention: operating on a particular set of data: that being data with some exploitable redundancy. There are plenty of files that will get bigger when you give them to Gzip.

    Entropy coders work by making assumptions about the probability distribution of the data they recieve. They assume they are working on a set of data in which certain types of data are more likely than others, so they store those more compactly, but as a result they HAVE to store others less compactly. No matter how you slice it, you can not store more than 2^n unique strings in n bits. The only gains you can make are by assuming that you aren't going to be dealing with all possible strings, and compacting the ones that you care about.

    That may have actually been what you meant, but I really didn't want anyone reading that to get the impression that there was something magical about entropy that made it a different approach than narrowing the set of data you are storing. The two are fundamentally the same thing.

    1. Re:NO COMPRESSOR IS GENERIC!!! by Austerity+Empowers · · Score: 1

      You can't compress random data. That's about as pedantic as is worth going in to without getting deep in to math.

  111. If 25x is what they supposedly accomplished... by RickBauls · · Score: 1

    What is the current max?

  112. Why does it have to be Linux? by jbplou · · Score: 1

    Imagine storing a terabyte of data on a single disk, and it all runs on Linux

    Why can't the same concept be used to compress on Mac, BSD, Windows, and Solaris?

  113. They might not be talking about single files by Junks+Jerzey · · Score: 1

    Sure, everyone knows there's no way to mash an arbitrary file down 25x. There's a trivial proof for that. But in this case it sounds like they're talking about 25x compression across multiple files. That is, if you store two identical files, then the second is a pointer to the first. If you have a bunch of jpegs, then you cat them all together into a new file (while keeping the originals around), the new file is super small. At least that's how I read the article.

  114. Re: proposed compression method by Anonymous Coward · · Score: 0

    To rephrase what I was getting into at the end: Small numbers are obvious exceptions. You could have said 2 and made the same point: "the string 2 occurs at position 6 counting from the first digit after the decimal point." ;-)

    The effective size of the "useful" search space is related to n/lgn. So I'm basically saying it's possible to conceive that there is some point after which there are no counterexamples, or where the counterexamples are so sparse that the probability of finding a matching counterexample in more than one transcendental number is unlikely: n/lgn eventually starts growing fast as n gets large, but when n is small, lgn is large compared to n, so the search space is unnecessarily restricted.

    If I set my level of significance to 64 digits, then you're going to have a heck of a time finding a counterexample >=64 digits long unless you can mathematically prove that such counterexamples exist. But why stop at 64 digits? That's not even worth compressing. Let's talk 2^20 or more digits! :)

    To further compound this, you'll also have to prove that the intersection of significant counterexamples in k different search spaces is non-empty; however, it's not even clear that you could hope to prove such with a 10-digit level of significance (before you run out and quote 0123456789 which occurs first at 17387594880, let me remind you that I mean base 2 and I've raised the ante and required that you show it for at least 4 transendentals).

    In short, this is not a simple problem to prove/disprove.

  115. This Virtual Tape Library stuff by hort59 · · Score: 1

    This is a virtual tape library vendor. If you backup your database (email server, home directories, etc) 25 times to their system and the data doesn't change much their software will find identical blocks and eliminate the redundancy creating a 25x compression. There is another vendor or two that do the same thing.

  116. Linux has something related.. RZIP by Convergence · · Score: 2, Interesting
    Most compression programs uses a very limited context. gzip cannot identify and exploit redundancy if it occurs more than 32kb or 64kb apart. bzip2 uses a blocksize of 900kb, and it too cannot identify redundancy more than 900kb apart. rzip however uses a context of 900MB, so it can exploit redundancy within a file, even if it occurs hundreds of megabytes apart.

    Although its not for every file, some times, this can be a huge win. In my case, backing up 60 versions of a 700kb XML file, I get 500:1 compression, 30 times better than what bzip2 gives me. Anytime you have a file where you know that it will have redundancy across more than 900kb, but less than 900mb, rzip can win big.

    It sounds that this company's program is a variation of this idea, designed with backups in mind and identify redundancy across tens or hundreds of gigabytes.

  117. Yeah, great work guys. by rice_burners_suck · · Score: 1
    ...has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt."

    I love how people make "claims" of stuff like this, and then there's never anything done already. It's like when you find a page that says:

    Spizpopd/OS is a real time interactive operating system with support for multicore multiprocessor multicomputer distributed grid computing, featuring complete support for desktop, embedded, and server systems. Its rock solid architecture provides a robust platform for safety-critical industrial, medical, military, and flight control systems, while providing support for all standards, including the ability to execute all Windows, Mac, UNIX, and Java applications with full binary compatibility and no need to recompile. Spizpopd/OS is coded entirely in hand-optimized assembly for extremely fast performance on all major processor families in production, and is distributed under the GNU General Public License with full source code available for all its features.
    And then you scroll down and it shows the most recent news posting was made on November of 1998, the only code is a semi-operational bootloader and nothing else has been written yet.

    Believe it or not, there's a ton of open source vaporware out there with fancy web descriptions like the one simulated above.

    1. Re:Yeah, great work guys. by Anonymous Coward · · Score: 0

      Sounds like the GNU Hurd!

  118. I'll believe it when I see it ... by mooncaine · · Score: 1

    ... running in Windows Whatever on my Intel Mac concurrently with OSX.

  119. What is this, stupid compression claim month? by mindstrm · · Score: 1

    Don't we go through this every year or so?

    - There is no such thing as a universal compression algorithm.

    - Compression algorithms are specific to the data they are compressing.

    Anyone claming to be able to compress everything by a uniform amount is lying, period.

  120. That is NOT the fractal IFS system btw by Anonymous Coward · · Score: 0

    Just to make sure that people don't get the wrong idea here from your post, the snake oil that you refer to wasn't the fractal compression scheme by Iterated Function Systems.

    That company didn't survive, but their compression system works fine. I did some projects with it back at university.

    In a nutshell, it involved finding recursive algorithms to generate output (and storing just the coefficients of the equations), so the more fractal self-similarity that there was in a scene, the better the compression achieved. An image containing (for example) leaves or trees with their highly self-similar structure would achieve absolutely enormous compression, and in general, most things in nature would do reasonably well.

    As always, the amount of compression was highly dependent on the input, but there's nothing unusual in that.

  121. Not entirely true. by moultano · · Score: 1

    That's not entirely true though. You can compress random data, but only given two assumptions:

    The data has a non-uniform probability distribution.
    You know that distribution.

    The trick behind designing compression algorithms is coming up with intuitions about the probability distributions of useful classes of real data, and then coming up with computationally tractible ways of exploiting them.

  122. Bell Labs solved this problem before... by Anonymous Coward · · Score: 0

    ..or at least built a system in which identical blocks of data would only ever be stored...once.

  123. Something to keep people guessing... by pimpsoftcom · · Score: 1

    Only 25%? Last year I figured out how to compress 1,000,000TB down to a floppy; They are so behind the times.

    Tip: Compression is fast, but decompression is very slow.

    And yes, I have mailed myself my notes as a form of prior art.

    --
    - d
  124. Which? by 4D6963 · · Score: 1

    That's great! But I'm wondering, if it can compress ANYTHING 25x, if I feed it with 1000110100111011101011101 will it give me a 1 or a 0? [/sarcasm]

    --
    You just got troll'd!
  125. Re:100X - 1000X by 4D6963 · · Score: 1
    You can easily achieve 25X compression with simple algorithms, but you need to keep cycling the output back through the input and the speed gets progressively worse

    Yeah, but imagine a beowulf cluster of these..

    Concievably, it you had enough time on your hands to you get almost anysize file down to just a few dozen bytes

    Actually, if you let it run recursively for about 257 years, you'll eventually shrink it to a couple of bytes, meaning that basically a fourth of all the files of the world can be compressed to this : 01

    I trully hope you were trying to be sarcastic tho.

    --
    You just got troll'd!
  126. Could be real by mattr · · Score: 1

    I don't know about these guys but the idea of 25x compression is not in itself a problem. Depends on your definitions, data, and time and computing resources. For example wavelet based "fractal" compression IIRC gave a 400:1 ratio for certain features, plus actually generating data to make zoomed in photos look realistic. Fractals and other functions can also be used to compress data losslessly if you have a hairy enough library and computer, from what I remember. But when they start talking about MP3s etc then it starts sounding like BS. And the post? What does "a terabyte on a disk" mean anyway? ++ to more slashdot meaningless posts.

  127. You geek! by thepotoo · · Score: 2, Funny

    Sheesh...when did you last get laid?

    --
    Obligatory Soundbite Catchphrase
    1. Re:You geek! by Anonymous Coward · · Score: 2, Informative

      Last time he was at your mom's house

  128. Re: proposed compression method by schmink182 · · Score: 1
    In short, this is not a simple problem to prove/disprove.

    There is no bijective mapping from any finite set to a smaller finite set. QED. The only way to create a good compression scheme is to restrict the domain of "likely" strings; exploit relative frequencies. Although it's creative and amusing, your compression scheme cannot work in general, and I'd expect it to actually inflate file sizes significantly in general.

  129. Information theory. by rew · · Score: 1

    There is a field of science called information theory. It studies "information content" and things around that like datatransmission and ECC codes.

    If I have a 10Mbyte file, it usually contains way less than 80 million bits of information. So, compression programs like "zip" and "gzip" can make the file smaller.

    The theoretical limit however is the actual information content. Suppose an information theorist analyses your file and conlcudes that your file contains 40 million bits, then gzip or any other compression program will have a hard time compressing the file beyond that. (unless the compression program "cheats" and compresses the file as: "Rogers 10Mb file #1", and has the original file elsewhere)

    Now, in practise I have a 440Mb spam-archive which compresses to 108Mb. This is only a factor of 4. If you realize that most spams are delivered tens of times, it must be possible to do a lot better. So if someone claims to be able to compress my spam mailbox a lot better, I can believe them.

    Information content in mp3's and images is near 100%. If anybody claims to be able to compress more than 20% out of one of these, they are full of crap on theoretical grounds.

    1. Re:Information theory. by TERdON · · Score: 1

      I have MP3s which are merely renditions of 8 kB MOD files (similar to MIDI). That makes for a (de)compression of several 100 times. It's quite possible that even other songs, in the format they were created, could be just as small. Just make sure not to let them contain any lyrics (ie singing - which necessarily has to be sampled) and keep to making pure electronica and similar stuff.

      Of course, this won't apply to music you don't have the "source code" for. :O)

      --
      I have a really elegant proof for Fermat's last theorem. If this sig was only a bit longer...
  130. Re: proposed compression method by poopdeville · · Score: 1
    To fully disprove the method, you'd have to show that there are counterexamples in an arbitrary number of transcendental numbers.

    Or just use the pigeon-hole principle...

    Easy as pie. Suppose that there is an algorithm that can (reversibly) compress any string of n bits to a string of n-1 bits. There are 2^n strings of n bits. There are 2^(n-1) strings of n-1 bits. No function from a set of 2^n elements to a set of 2^{n-1} elements is injective, hence not bijective. Contradiction. If you really must, proceed by induction to prove cases where the algorithm maps from sets of cardinality 2^n to sets 2^(n - j).

    The same reasoning explains why you can't make a constant size cryptographic hash function that never repeats itself.

    --
    After all, I am strangely colored.
  131. de-duplication and diffing by penguin-collective · · Score: 1

    Both de-duplication and diffing at the file system level are useful. If done intelligently, they could probably save lots of space on a standard Linux or Windows file system. Of course, they are nothing new; the reason they aren't in the file systems of today is mostly that it's hard to implement them sufficiently efficiently; right now, file system authors are still struggling with just keeping their various tables and data strctures consistent.

  132. de-duplicating data backup by penguin-collective · · Score: 1

    By the way, if you want a de-duplicating data backup solution, there are a bunch of them around; faubackup is a simple example.

  133. Write any number with N+1 digits with N digits? by Anonymous Coward · · Score: 0

    given a sufficiently large N?

  134. What the... ? by LesPaul75 · · Score: 1
    Slow down there... There's either some unintentional errors in your post, or you've missed something in your study of compression algorithms.
    Lossless compression is nothing more than an algorithmic lookup table. It's a substitution cipher like what you find in famous quote puzzles.
    Well, it's not quite "nothing more than" a lookup table... There's definitely more to it than that. It's not a one-to-one substitution like the cryptoquote puzzles. Otherwise there would be no compression. And there can be more than one way to represent the uncompressed data in compressed form, so it's actually a many-to-one mapping. Take RLE (run-length encoding), for example. The simplest form is just a single byte that represents the length of the run, followed by the repeated character. Now suppose the data that you want to compress is the ASCII text message: "HI!!!!!!!!" So, the encoded data would be (in hex) 01 48 01 49 08 20. Instead of ten bytes, you have six. They are a representation of the data that says "One letter H, then one letter I, and then eight exclamation points." But you could have just as easily compressed it as "One letter H, then one letter I, then five exclamation points, and then three exclamation points." And obviously, RLE is just about the simplest compression around. Take a look at PNG compression... It is extremely complex, and lossless, and certainly quite a bit more than a lookup table.
    You only need about eighteen bits to store enough positions for every word in the dictionary. A good compression algorithm for text... a look-up table optimized for written English...
    Huh? No one really compresses text with a "compression algorithm for text." Maybe they used to at some point... I don't know. But modern compression algorithms are smart enough to work on lots of different types of data with high efficiency on each. Text compresses nicely simply because it wastes a lot of bits by its nature, not because the compression algorithm knows what "text" is and knows what words are used frequently, or whatever you're suggesting there. A compression algorithm that was optimized for English would be pretty much useless, for many reasons, e.g. bad spellers, jargon, other languages, etc...
    You only need six bytes to store most of the frequently-used characters in text
    Six bytes? ASCII characters are a single byte, and that covers 256 possible characters. "Wide" characters are two bytes, which is enough for Unicode (UTF-16), which can handle just about every written language on Earth. Maybe you meant "six bits" instead of "six bytes," which would be enough 64 characters, but that wouldn't cover anything more than upper case (A-Z), lower case (a-z), numbers (0-9), and two other characters (maybe space and period?). Not very useful.
    If you've got a program that could compress a 100 Mbyte file down to 1 Mbyte...but the compression software itself took several gigabytes of space, that ain't gonna do you much good.
    What? Why not? If you had an algorithm that could achieve 100:1 compression on most data, you'd be a very rich person, no matter how much disk space the software occupied. The size of the software itself is a one-time cost. The compression savings that you get as a result is a recurring savings, indefinitely. I mean, imagine an extreme case, where the compression software itself is even too big to fit on a hard disk. Say it's like a terabyte, but it gives you 100:1 compression. No problem, install the software on a massive server somewhere, then everyone can upload their files to the server and download the resulting files which are 100 times smaller. Every end user in the world gets to store 100x more crap on their PC, at the cost of adding a few extra hard drives to a server somewhere, plus network bandwidth.
  135. it's a CVS!! by TheLoneCabbage · · Score: 3, Informative

    This is a back up system, not a single file compression (although for framed data like video, email, etc.. the compression scheme is still clever).

    Basically it's a CVS, if your backing up multiple computers, or user directories your going to see tons of repeate files, heck they'll even be the same name. Saving the diffs is a good idea. And not at all dificult to duplicate.

    For instance what if you were doing back up for a team of animators. Their files are HUGE, but 90% of the frames will be identical between the individual systems. (indeed the frames between one another will likely be very similar) You could get far more than 25x compression that way. The big downside of this idea is the memmory & CPU vs Speed trade off. You can't use this kind of system to back up to a tape or DVD system, it needs to be random access media.

    You could probably get nearly the same results by hacking rsync and diffing identical file names in different directories. Possible bonus for diffing files of similar file type.

    It's a clever idea, not a radical new technology.

  136. Version tracking? Noting redundant files? by Omniscient+Ferret · · Score: 1

    If you tracked deltas within files, you could look to xdelta as a filesystem, or possibly CVS.

    If you were just tracking changed files, you could look to Plan 9 filesystem or Dirvish.

    What might be up: Picture backing up a number of fairly similar machines (say, a group of Windows machines built from a common image), & noting duplicated files, only saving each once. You could count the space saved by a link as compression. If you have a homogeneous sample, you save lots of space & claim ridiculous compression.

  137. Store on single terrabyte disk.... by soccrates · · Score: 1

    Wow... storing a terrabyte of data on a single disk !! All you need is a terrabyte disk - ground breaking.

  138. Re:100X - 1000X by Hugo+Graffiti · · Score: 1
    There are fundamental mathematical limits to the amount you can compress data in a lossless format

    Try this: write a program to output to a file the integers from 1 to a million using some universal code. Then try compressing the file using (eg) gzip. I bet that comes close to what you refer to as the mathematical limit. But as you can see it's nowhere near. The program itself is the optimum compression.

    So really, it all depends on how much structure is inherent in the data and how easy it is to detect that structure.

  139. good... but not really that useful by nithu · · Score: 1

    Great... but I don't think home-users would need this.. most have a lot of space left on their hard disks even after storing everything they like... however, media companies might find this very useful... -- http://www.kudige.blogspot.com/

  140. Terabyte on a disk: Old news by Lars+Clausen · · Score: 1

    I can easily imagine having a terabyte on a single disk. In fact, LaCie, Hitachi and Seagate already sell such, among others. Disks are cheap and getting cheaper. Flash memory is more expensive, but getting cheaper even faster. I'm waiting for when the savings in mechanical breakdowns, power, heat and space makes flash memory more economical for petabyte storage than tapes and harddisks. Mechanical storage is for wusses:)

    -Lars

  141. Re: parent miscounts the number of pigeonholes by Anonymous Coward · · Score: 0

    As the input gets large, there are exponentially more pigeon holes than pigeons. Read on if that doesn't make sense. I'll explain!

    We're essentially dealing with a pseudorandom sequence where all inputs are equally likely, and we can define an arbitrary origin within the sequence to create a new pseudorandom seed. The chance of seeing some given binary pattern are 1/2^length; this is why you can find small counterexamples: the length of the number is close to the number itself, so the probability of finding the number within the pseudorandom sequence up to 2^length from the offset is rather low (hit or miss). However, this changes when N/lgN gets large.

    Let's suppose I consult /dev/random and get a string of 2^31 bits (256 MegaBytes if encoded RAW). If I get *really* lucky and find that it exactly matches the 0 offset, then I can encode the length in only 32 bits (4 bytes), and I'd have 64 million:1 compression. However, if I don't get so lucky then I'll have to search a bit for that string. If I find its start at 2^31 bits, not to worry: I encode its offset in 32 bits and its length in 32 bits, and my compression ratio is 32million:1. Using the pigeonhole principle, I can search until the 2^(2^31)- 2^31 - 8th bit and still maintain a 1:1 compression ratio. Let me rephrase that: I can search roughly 2^2147483647 potential pigeon holes before I have to put more than one pigeon in a hole.

    Note: the pigeons are only 2^31 bits large in this example, so we could just say 2^2147483616 holes and ignore potential matches that aren't aligned to 2^31 bits for now. Since the random probability of the transcendental pseudorandom generator (pi) matching a string of 2^31 bits is roughtly 1:2^2147483648, and there are only 2^2147483616 holes, one could argue that there could be up to 32 pigeons per hole -- or rather that there's only a 3% chance that we'd see a compression ratio better than 1:1. If we consider the unaligned matches, naive counting of the first 32 offsets pushes our statistical odds back to ~100% (note: 100% in this case does not mean it's guaranteed to happen; rather that on average it's likely to happen N times for N numbers).

    In short, I'd argue that it's statistically almost guaranteed that you could get at least a 1:1 compression ratio, and it's statistically "likely" that you could get a compression ratio on the order of 1,000,000x for any random string over 256Megabytes. And if pi:0 doesn't give you a good enough compression ratio for your given input, pick a different transcendental or pick some other origin within pi and report back. :)

    p.s. Remember the first post where I said finding the answer would require a prophet? There is no way to search pi through 2^2147483647 digits without an oracle. [HINT: You might find all the written works created by humans, plus all the works ever created by monkeys trowing feces at a typewriter before you find the value you're looking for.]

  142. REAL compression algorithm by RenHoek · · Score: 1

    There are ways to really compress any type of data though.

    Take this number 141592653589793238462643383279. I can compresses it very well:
    [the first 30 digits of pi after the comma]

    Ok, not so much compression there. But lets keep in mind that the definition of pi says that there are no repeating sequences in pi. This also means that ANY sequence can be found in pi. That means that the ISO from Vista is hidden somewhere in that sequence. The problem is knowing where.

    Say that I have a databank here at Compression-U Inc. This mighty database holds the number pi up to bazzillion digits. Via a very easy and quick algorithm I can find certainly sequences in that number. Now let me find the ISO for Vista for you.

    [4.2gb of digits, starting from digit 2^383715-1 of pi after the comma]

    There you go. I might even include the formula of how to calculate pi with it, and still retain amazing compression.

    So in short, it _can_ be done, it's just very time intensive.

    1. Re:REAL compression algorithm by PigleT · · Score: 1

      > Take this number 141592653589793238462643383279. I can compresses it very well:
      > [the first 30 digits of pi after the comma]

      zsh% echo '[the first 30 digits of pi after the comma]' | wc -c
                  44 ;) Otherwise an interesting idea though. Could come in handy for crypto headers signing email too - succinct, easily represented in ascii, requires sender to do a lot of work (shame about the cost to verify, though).

      --
      ~Tim
      --
      .|` Clouds cross the black moonlight,
      Rushing on down to the circle of the turn
    2. Re:REAL compression algorithm by Anonymous Coward · · Score: 0

      I already made the same suggestion! :-)

      I originally posted it as a joke, but some moderator gave me an overrated, so the post sits at "Score: 0, Interesting". :(

  143. compressing encrypted data by gr8dude · · Score: 1

    Well-encrypted data should look absolutely random. In that case, there are no patterns, hence compression algorithms won't be able to compress anything. Try to compress an encrypted file, and you will get something greater in size [the overhead used by the compression tool].

    Note: applies to _well_-encrypted data

    1. Re:compressing encrypted data by fbjon · · Score: 1

      A well-designed compression algorithm would break the encryption, compress the redundancies, and re-encrypt.

      --
      True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.
  144. Compression - Lossless - Proven by MichaelHH · · Score: 2

    This is my personal White Paper on lossless compression. Note this is no joke thread like that newb who posted he can get his to 1 bit. I affect random binary data. It achieves approximately, per cycle, a 81% remaining size of the origional file. I theorize the end limit to the size is in a range of 10 bytes to 10 kb. It will be different for every file type. Note this is an EXCEL filed that has been RARed. It was to big to upload normally. It is MEMORY intensive. I would prefer not to do it in excel except I lack the proper software to replace that crappy program. Here is the link to my website: http://www.security1.free2host.net/Compress.php

    --
    I am ready for the big jump in life, who will jump with me?
  145. Looks like a port of Venti by rdebath · · Score: 1
    http://cm.bell-labs.com/sys/doc/venti.html

    Of course the amount of 'compression' you get is firmly under YMMV.

  146. ProTracker by Schraegstrichpunkt · · Score: 1

    It looks like somebody discovered the immense data storage capability of ProTracker modules and PostScripts...

  147. It's the Commonality, Stupid by robathome · · Score: 1

    What these guys are doing is not compression. It's commonality factoring. No piece of data is ever stored more than once. Typically, this is done by a hashing algorithm that starts at a high level and indexes everything down to a discrete block size of a few K.

    Each block gets an index checksum, then each file, each subdirectory, each parent directory, and so forth until the entire disk volume has a cumulative hash. Then, it's very easy to determine a) what has changed (and where), and b) what has been seen before.

    When a backup starts, the client compares the volume hash signature to that on the backup system. If it matches - nothing has changed. Backup over. If it doesn't, then you walk the indexes to find out exactly what has changed, and then only prepare to send those dirs, files, or discrete blocks - whatever's the smalles object that expresses the delta. When those objects are queued to send to the repository, the client first generates a hash of the object and asks the repository if it's seen it before. If not, it sends the index and the data. If so, it sends nothing, since the repository's already got that particular chunk stored somewhere. There's some re-hashing and index reverification on the other end to make sure that all is consistent.

    Therefore, each backup appears a "full" backup, not a file-level diff, since the entire image is comprised of a map of every object in the volume. In reality, each backup is a set of pointers into a hashed data store (commonly called a CAS, or Content Addressed Store) from which is is reconsitituted as needed.

    Having tested and deployed one of these types of systems, I can say that a) it's great for desktops, where most of the data between boxes is identical - the OS, the core apps, etc, and only the user data and localization is different, and b) it's awful for pre-compressed data like streaming audio, video, JPEGs, PDFs, etc. Since compressed data is entropic data, there can be no commonality within the file or versions of it, unless the file itself is identical and present from multiple sources. Change one byte of a file and recompress it, and all the blocks are unique.

    However, this is not new. Giggle Avamar Technologies and Arsenal Digital. BTW - this tech is pretty good for remote backups over low-bandwidth links, since it vastly reduces the amount of data that needs to traverse the wire.

    --

    At 3 A.M. you can see people's auras; at five you can see their contrails...
    1. Re:It's the Commonality, Stupid by Knights+who+say+'INT · · Score: 1

      Precisely. That's what Sorcerer, the Linux distribution, does.

  148. I have installed this product multiple times . . . by ccGecko · · Score: 1

    . . . so I might be able to clear up some confusion. The word 'compression' is probably not the right choice. 'De-duplication' is probably a better word. Try this: "ProtecTIER can achieve a 25:1 de-duplication ratio." That sounds more accurate to me. Currently it works as a virtual tape engine. Take 10+ TB of disk and attach to a Linux server (x86_64 only). ProtecTIER makes that disk look like a tape library and tape drives filled with tape cartridges for use by an enterprise backup system like Veritas NetBackup, IBM/Tivoli TSM, Legato NetWorker, etc. Most large companies today use a pretty similar backup strategy: Fulls once a week, incrementals the other days; weekly fulls are kept for 2-8 weeks, 'monthly' fulls are kept 2-6 months, daily incrementals are kept for 7-21 days. Depending on the retentions chosen, that's 10-30 or more copies of the same data, plus the maybe 5-10% that actually changed. ProtecTIER gets the 25:1 ratio by eliminating the redundent copies.

    The algorithm is pretty elegant, actually. It holds a meta data index in RAM. As data comes in (at rates up to 200MB/s) it looks for a similar data set already stored. It reads the old data in, does a diff against the new data, stores the unique data untouched and uses pointers to refer to the duplicate data. With this method even if the system is completely wrong about which existing data set to match with, the data will be safely stored (with a low de-duplication ratio in this instance).

    Yes, the product works as advertised. If you don't have several terabytes of data to protect in an enterprise environment, it's probably not for you. But, if you do have a large environment and are tired of dealing with tape, this product rocks.

  149. Re:100X - 1000X by Anonymous Coward · · Score: 0

    Try this: write a program to output to a file the integers from 1 to a million using some universal code. Then try compressing the file using (eg) gzip. I bet that comes close to what you refer to as the mathematical limit. But as you can see it's nowhere near. The program itself is the optimum compression.

    Some search strings for you to try: "Claude Shannon" "information theory" "Shannon limit" "Lempel-Ziv compression"

    So really, it all depends on how much structure is inherent in the data and how easy it is to detect that structure.

    Yes, it depends on the characteristics of the data, but no a priori knowledge of possible structures is assumed. You can come up with all kinds of ways to generate randomness, but your compression algorithm would need a lot of overhead to be able to utilize all of them. Real data is also unlikely to perfectly match any given type of specific randomness, so now you'll have to add complexity to the algorithm if you want to make use of these structures; you'll need to figure out how to optimally correlate data segments with known structures or slight offsets from these structures. At some point, you'll use more data for the overhead that describes each segment's structure type than you had in source data.

    The situation you describe works for very specific cases, but it isn't particularly useful in reality.

  150. Deduplication and compression by bbiles · · Score: 1

    OK, my bad.

    Diligent is not using the term compression AFAIK, but neither are they really deploying this approach yet outside of initial testbeds. Data Domain has been selling a product like this for years, has hundreds of happy customers using it and more than a thousand units in the field. And we came up with a brand, Global CompressionTM, in 2003 to mean the combination of finding long sequences and storing them uniquely across many TB's of stored data (see below) + traditional LZ-style compression.

    We sell our system only as a target for backup data, which is extremely redundant. On a first full, we tend to see 2x-4x compression effect. Subsequent file incrementals, 6x-8x. Subsequent fulls, 50x-60x. Aggregate compression effect across a couple months of retention tends toward 20x in a weekly full / daily incremental policy. Exchange or Oracle fulls-daily can be 50x, short retention can be 10x. Mileage varies especially by backup policy, but also (within the 2x factor) by data type. And as mentioned in the postings, the challenge is to get it to go fast; our implementation does this. Early alternatives, such as the Venti filesystem in Plan 9, don't.

    Should it be called compression? In lieu of a better term, at least compression is descriptive to a user -- the effect is to compress the backup data. In network equipment they call this technology Wide Dictionary Compression, but it has a half dozen other names. The mechanism of finding a sequence and referring to the original the next time it comes up is pretty much the same as traditional compression, it's just harder to put into silicon because of the size of the referencing window. But it wasn't anticipated by the seminal compression papers many years ago, so there's some debate. In storage, lately, it's starting to get called Deduplication, despite the existing use of that term in databases, and despite another half-dozen vendor terms. Examples of alternatives include capacity optimization, factoring, data coalescensce and sequence reduction. It's only starting to settle down.

    Full disclosure: I was at VA Linux in the team that acquired Andover, thus Slashdot, back in the day. Hope that worked out OK.

  151. Already exists by gweihir · · Score: 1

    I belive Hamilton 95 had sub-Heisenberg compression a long time ago. Sub-Heisenberg compression can be used iteratively to compress compressed data again, using irrational numbers and advanced quantum mechanics. It could store the whole OS in 1 bit! You can still get this great software by FTP to 127.0.0.1.

    These people are just offering a cheap rip-off that is limited to 25x compression. Don't be fooled!

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  152. Sorry, but you still don't win. by Anonymous Coward · · Score: 0

    ... In fact, you not only don't win, you don't even break even.

        And what's more, you can't even get out of the game.

        Listen, there really is /no/ way around Shannon and the laws of thermodynamics.

    "This mighty database holds the number pi up to bazzillion digits. Via a very easy and quick algorithm I can find certainly sequences in that number. Now let me find the ISO for Vista for you.

    [4.2gb of digits"

        Stop there for a minute.

        Everything you've said is true, but let's just consider how big "a bazzillion" has to be before it would stand a reasonable chance of containing any given 4.2gb sequence of bytes. After all, if your database holds 4.2 billion digits of pi, then you'll only have one sequence that long to offer to compress for people. You have two possible sequences of length (4.2billion minus one) and three possible sequences of length (4.2billion minus two) and four possible sequences of length (4.2billion minus three) and ... I think you see the pattern.

        So how big would your database have to be in order for it to have a (for example) 50-50 chance of being able to represent any given 4.2gb of data?

        Umm, the maths is beyond me actually, but it's going to be somewhere of the order of 4.2 billion factorial.

        You haven't beaten the rules. Sure, with a database of size 'N', you can get great compression ratios on those subset of the strings of length less than 'N' that occur in your database. But that's only a tiny tiny tiny fraction of all possible strings of length less than 'N'. All those other ones get longer. The incredible, huge, massive, ginormous compression ratio that you get on the (tiny tiny tiny fraction of) strings that you can compress is /exactly/ balanced out by the small increases in size that get applied to the (vast vast vast majority of) strings that you can't compress.

        Beyond that, there's also the problem that, as you say, pi contains no repeating sequences. If the data you want to compress is a repeating sequence, pi isn't gonna help any.

  153. Re:right. sure. by petermgreen · · Score: 1

    my experiance is that bzip2 is great for things like source archives but on a lot of other data rar beats it.

    --
    note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
  154. Compression algos can't compress all data streams by silverdirk · · Score: 1
    SilverDirk's Theory of Compression: (not that someone else probably hasn't said the same thing...)

    For every compression algorithm, there exists a data stream which, when "compressed", will actually grow in size.

    This is pretty easy to prove. Compression maps a string of N bytes to a string of M bytes. If you consider that there are 256^N strings of that length, and (-1+256^N)/255 strings shorter than N, then there will have to be strings which when "compressed", stay the same size. Moreover, an algorithm also needs to map strings smaller than N into the same space, and you can't have collisions if you expect to restore the original string via decompression. This means that some strings will need to get bigger.

    If you look at it from the other side, a compressed string of M bytes can only represent 256^M possible uncompressed strings. For an example of M=1, you could design an algorithm that compresses the entire works of William Shakespeare, and 254 other works, into a single byte [1], but it would only be useful for those 255 data sets. Any other data set would need more bytes, and an additional byte at the start to indicate it was not one of the stock data sets.

    The point is that a compression algorithm is only useful for the kinds of data it was designed for. Most compression is designed for data with repeating patterns. Data without patterns cannot be compressed by these algorithms. If they claim 25x compression, then they need to tell us which specific kinds of data they expect it to work with, because there exist many files which would get an actual ratio of 0.9999~.

    [1] Shakespeare et. al. Compression Algorithm If first bytes is 0, return entire works of shakespeare. If {1..254} then return Work{1..254}. If 255, read all following bytes as entire contents of original file.

    --
    Mark of the Coder fades from you. You perform Opening on World of Warcraft. Warcraft crits GPA for 4. GPA dies.
  155. Re: parent miscounts the number of pigeonholes by poopdeville · · Score: 1

    I already proved that it won't work. Your argument is obviously flawed.

    --
    After all, I am strangely colored.
  156. Re: sticks and stones... by Anonymous Coward · · Score: 0

    poopdeville wrote:
    > I already proved that it won't work. Your argument is obviously flawed.

    Yeah, but I proved that your proof is flawed, so nyyyyyyaaah. :P
    Seriously though, please read this and respond to anything you disagree with.

    Step 1:
    Input = N bits. Therefore, we're permitted N bits for output before compression ratio is over 1. That means in loose terms we can scan through 2^N bits of pi looking for N bits, except we need to reserve space for the size of the input and some sort of delimiter, so really we're only allowed to search through 2^(N-lgN-k) bits of pi looking for N bits. (* It might be more practical to assume that k is O(lglgN) or even O(lgN), but that won't change our calculations.)

    Step 2:
    If we search on 1-bit boundaries, we're going to find at most N-lgN-k unique patterns, so *at best* we can only get 1:1 compreession on (N-lgN-k)/N of the inputs if we use only pi as the encoder, or in other words we won't compress (lgN + k)/N of the inputs. Even if we conservatively estimate k as 2*lgN + 8, this still gives us likelihood of compressing 99.91% of the inputs in the range of 65536 bit long (in fact, it makes you really wonder which 56 inputs wouldn't be compressed -- we might even consider hardcoding them somewhere. hehe).

    Step 3:
    The previous step says that (lgN+k)/N won't be compressed. To combat this, we select some other transcendental number. We pay one extra bit in penalty to doubles our odds of finding a compression, and now statistically speaking everything has more than 100% chance of being compressed (or rather, that we're "likely" to find 2 matches for "most inputs" and only 1 match for the rest). If you're not satisfied, add one more bit and allow us to consider 4 different transcendental numbers. (* Strictly speaking, we just scaled it as 2*N/(N+1) or 4*N/(N+2), which may have an adverse effect on the compression ratio for small inputs, but the final expression is X * (N-lgN-k) / (N-lgX), so the limit approaches X as N gets large.)

    Challenge:
    Find one 256-bit input that cannot be encoded in 256 bits or fewer using this method with pi and e as the encoders (let the first bit be 0 if you use pi or 1 if you use e). Using the conservative estimate of k, we'd expect 25% of the inputs to fail on either one, so you've got a decent chance of finding a counterexample that fails both. I'll be waiting. :-)

  157. My understanding is... by Reece_Arnott · · Score: 1
    First off the 25X compression is either pure marketing hype or an average for some 'real world' scenario that they dreamed up. Either way its not a hard and fast figure.

    Having said that I believe this is similar to another backup data compression algorithm I saw a presentation for a couple of days ago. There are two parts:

    1) A database of unique chunks of data.

    2) A blueprint of index numbers that define how the data fits together.

    It takes a look at the data stream in X bit chunks and if its unique stores it in a database and stores an index pointer to it; if it has been seen before then it just stores the index pointer.

    Obviously as this index gets bigger it takes longer to search through but there is less chance of a non-unique chunk. If this is done in a Disk-2-Disk-2-Tape situation it can take the backup of the server(s) onto a HDD and then run this algorithm at its lesuire to get the compressed version for tape. I'm assuming as they are marketing this for TB levels of data that they have this one worked out - at least for this level of data.

    Another issue is that you get less and less compression as your index number takes up more bits (i.e. more and more unique chunks). This isn't going to be a practical problem in the near future as the one I was looking at was taking 8KB chunks. This means that to get enough unique chunks to get the index to be the same size as the data its replacing (8KB) you need at a minimum 2^65536 bits (10^19709 exbytes). This is simplified but even if you have a couple of orders of madnitude as a fudge factor for overhead in storing the index numbers you aren't going to run into this problem soon.

    There is also a problem if you don't have too many repeating chunks. In fact there may be an *increase* in file size if you don't have many as you now have the overhead of the database to worry about.

    So whats the the answer to the scoffers 'can you feed its output to itself?'

    The answer would probably be yes but each time through you have less repeating chunks, therefore more unique ones so the database overhead eventually gets to be a problem i.e. you keep running its output through itself and it eventually comes near the theoretical minimum and oscillates, getting bigger then smaller, then bigger again.

  158. LOOK AT USPTO.GOV, ONE PUBLISHED APPLICATION by Anonymous Coward · · Score: 0

    Nobody knows how to search. There is one published app and its interesting.

  159. The post isn't funny, so it must be stupid by irritating+environme · · Score: 1

    nuff said

    --


    Hey, I'm just your average shit and piss factory.
  160. Re: sticks and stones... by poopdeville · · Score: 1
    Yeah, but I proved that your proof is flawed, so nyyyyyyaaah. :P

    Seriously though, please read this and respond to anything you disagree with.

    Honestly, fuck that. I'm not going to waste my time decyphering the argument supporting your claim when I already know it's false. That sort of thing can be instructive, but only if the flawed argument is essentially insightful. This isn't.

    Some flaws I gleamed with a quick scan:

    1. One-to-one compression ratios are trivial. The identity function works just fine.
    2. You seem to be ignoring the fact that using multiple transcendentals is going to incur a hefty overhead.
    --
    After all, I am strangely colored.
  161. I actually have found a means around the math by MichaelHH · · Score: 1
    Compression CAN do huge volumes

    I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.

    I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.

    All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.

    I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.

    --
    I am ready for the big jump in life, who will jump with me?
  162. Lossless Compression without patterns EXISTS by MichaelHH · · Score: 1
    Compression CAN do huge volumes

    I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.

    I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.

    All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.

    I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.

    --
    I am ready for the big jump in life, who will jump with me?
  163. Re:Sad truths about data compression. {FALSE} by MichaelHH · · Score: 1
    Compression CAN do huge volumes

    I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.

    I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.

    All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.

    I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.

    --
    I am ready for the big jump in life, who will jump with me?
  164. This is my White Paper proving lossless repeatable by MichaelHH · · Score: 1
    Compression CAN do huge volumes

    I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.

    I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.

    All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.

    I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.

    --
    I am ready for the big jump in life, who will jump with me?
  165. Shannons theories are WRONG. Proven here. by MichaelHH · · Score: 1
    Compression CAN do huge volumes

    I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.

    I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.

    All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.

    I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.

    --
    I am ready for the big jump in life, who will jump with me?
  166. Entropy does not need to exist! PROVEN by MichaelHH · · Score: 1
    Compression CAN do huge volumes

    I use a variety of means that first actually increase the file at a variety of stages, but make it into an easily reduced format.

    I skew the ratio's statistically early, giving me a larger chance of occurrence elsewhere. I use a variety of change outs to make the most likely ratio to occur items into then the best to compress types of data. I also utilize a revolutionary new means to track actual data flow, reducing the size in one region by 25%. In all I can achieve a standardized compression rate of approximately 84% on random binary data, with no loss, and repeatable.

    All the information is on this website http://www.security1.free2host.net/Compress/compre ssstart.php and will prove without a doubt in the math section the actual capabilities of this code for those who can follow higher end computer based mathematics.

    I even go so far as to say that I can compress the entire worlds knowledge to a DVD at worst, and a floppy at best, and I can prove it on this website.

    --
    I am ready for the big jump in life, who will jump with me?
  167. Re:This is my White Paper proving lossless repeata by pilkul · · Score: 1

    I don't even need to read your silly paper to call bullshit because it's already been proven that it's impossible to compress truly random data in the general case by even 1 bit. Do you also spend your days trying to draw maps that violate the 4-color theorem? Crank.