Ten Dropbox Engineers Build BSD-licensed, Lossless 'Pied Piper' Compression Algorithm
An anonymous reader writes: In Dropbox's "Hack Week" this year, a team of ten engineers built the fantasy Pied Piper algorithm from HBO's Silicon Valley, achieving 13% lossless compression on Mobile-recorded H.264 videos and 22% on arbitrary JPEG files. Their algorithm can return the compressed files to their bit-exact values. According to FastCompany, "Its ability to compress file sizes could actually have tangible, real-world benefits for Dropbox, whose core business is storing files in the cloud."The code is available on GitHub under a BSD license for people interested in advancing the compression or archiving their movie files.
...Horn and his team have managed to achieve a 22% reduction in file size for JPEG images without any notable loss in image quality....
Without any notable loss in image quality.
.
Hmmm... that does not sound like "bit-exact" to me.
How much CPU time to compress/decompress. Standard compression is hardly the best, just a good compromise between compression and usability.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
"22% better compression" without "notable" quality loss on files which are ALREADY compressed in formats in which loss may be apparent is a far cry from their ultimate "goal" of "lossless" compression.
comparing this to PNG or h.265 is missing the point - this is not a compression algorithm for creating new files. this is a way to take files you already have and make them smaller. users are going to upload JPG and h.264 files to dropbox, that is a given - so saying PNG is better is moot.
Would be nice to compare it against PNG, but the context is if you're storing other people's data and you have no control of what format they use.
Meh, doesn't matter. Any processing load will be moved to an unoptimized javascript implementation that runs in the end users browser.
And yet you can download the source code yourself and compile it.
"-1 Troll" is the apparently the same as "-1 I disagree with you."
Link to a layman's description of the algorithm here: https://raw.githubusercontent.... It's bit exact and lossless. We haven't done comprehensive studies, but on the included test files it gets 13% compression on H.264 movies. Similarly the not-committed, but similar JPEG algorithm gets 22% on a comprehensive sample set of photos from a variety of devices.
Time for the new wave of Stacker clones, maybe a new DoubleSpace err DriveSpace?
Twinstiq, game news
We put a spoiler on a Prius.
Have gnu, will travel.
Can it compress 3d videos? That seems to be a real challenge.
I wonder if somebody can develop this into a transparent kernel-module.
13-22% of a video library could mean saving several hundred GB on a multi-terabyte collection. Depending on if it decompresses on-the-fly and how hard it is on a CPU, it may also reduce disk I/O somewhat.
H.264 and JPEG are supposed to output random-looking bytes, by definitions.
If you can compress those, something is very wrong.
Where'd you get that idea?
$ bzip2 test.jpg ... I also tried it on a max-compressed file. Opened that test.jpg up in gimp, then saved with quality at 0 (lowest), and re-did the compressing on both:
$ gzip -9 test.jpg
$ ls -la
-rw-r--r-- 1 me me 1519279 Feb 7 2012 test.jpg
-rw-r--r-- 1 me me 1430059 Aug 28 16:42 test.jpg.bz2
-rw-r--r-- 1 me me 1427872 Aug 28 16:44 test.jpg.gz
-rw-rw-r-- 1 me me 189230 Aug 28 16:50 test2.jpg
-rw-rw-r-- 1 me me 111623 Aug 28 16:50 test2.jpg.bz2
-rw-rw-r-- 1 me me 117971 Aug 28 16:51 test2.jpg.gz
Feel free to try the same experiment yourself on random jpg's you find online, or your own.
The goal of H.264 and JPEG isn't minimum file size at all costs. It's also not encryption. Your premise is wrong, and even old tech can compress this stuff further than it may already be.
H.264 and JPEG are supposed to output random-looking bytes, by definitions. If you can compress those, something is very wrong.
Well, it seems to be applied per codec not a general compression algorithm like zip. And they probably say mobile-encoded for a reason, simple encoders have to work on low power and in real time, random JPGs from the Internet is probably the same. From what I can gather the algorithm basically take a global scan of the whole media and applies an optimized variable-length transformation making commonly used values shorter at the expense of making less commonly used values longer. Nothing you couldn't do with a proper two-pass encoding in the codec itself, the neat trick is doing it to someone else's already compressed media afterwards in a bit-reversible way. Very nice when you're a third party host, assuming the increase in CPU time is worth it but not so useful for everyone else.
Live today, because you never know what tomorrow brings
Exactly. We need to see the Weissman score.
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
Try the -h param for ls, calculating is for computers.
Not really useful in this context, because it truncates significant digits.
Actually, perfect compression WILL make the output resemble random bits if looked at statistically. What OP was missing is that perfect compression is hard and that most compression features a number of compromises for CPU speed, memory requirements, seekability, resilience, etc.
The tiny bit of slashdot community that is left still talks about the actual things. If this were on Reddit, it would just be a stream of lame, overused references to the Silicon Valley show. Somebody would say "This guy fucks". Somebody else would make a joke about "Optimal tip-to-tip efficiency". Then somebody would ask "Do you know what tres commas means".
Those things were hilarious when put forth by a group of comedic actors. They are incredibly lame when they are overused every single time something even comes tangentially close to referencing them.
So while this particular story still sucks...it could be a lot worse.
Bottles.
H.264 and JPEG are supposed to output random-looking bytes, by definitions.
Bullshit. JPEG, *by its definition*, after the quantization step, uses a fairly modest & inefficient compression algorithm, because it was designed to be run on embedded systems with very modest processing power.
This guy fucks.
Had to be done.
I still have more fans than freaks. WTF is wrong with you people?
when they have made Nip Alert a reality.
Lossy codecs typically have two major stages -- the lossy parts (e.g. dct while throwing out some component frequencies, motion prediction, etc.) -- followed by lossless entropy coding (e.g. Huffman in JPEG) to further compress the resultant data.
These compression algorithms just decompress the lossless part of the process and then recompress it with a more efficient lossless algorithm. On decompression, it then recompresses with the standard algorithm. In some cases (e.g. JPEG) you can keep a copy of the Huffman table that lets you recompress the data into a bit-accurate copy of the original file (you can include a small bit of extra information to make sure any remaining metadata matches up exactly).
The MacOS compression software StuffIt did this years ago.
After reducing all this dropbox grandstanding filler and chest thumping (is that corporate policy or something? this is certainly not the first time), it all boils down to:
You took frequency space transformed H264 (pre-cabac) and wrote better range coder for it.
Yes/No?
Still pretty impressive, but for the love of god, please use succinct _technical_ descriptions. - https://raw.githubusercontent.... - is god awful, as it just describes general operation of a range coder.
Beating jpeg entropy coding is not that impressive, as thats just huffman which really awful. CABAC is better, but still decade behind behind top of the line research (I suppose you're encode.ru regulars).
Superb work, Danielreiterhorn . Amazing work, and amazing, providing it as open source.
Would you mind if I ask for the motivation to put it as open source?
When it provides 10-20% compression, it would be worth a bit of money, right. In such a case why are you keeping it under BSD licence?
I am in awe of people who do great things without expecting anything in return. Because try as I may, I can never be truly altruistic. So, I try to pick the brains of the ones who are to really understand their motivations.
Are there any hidden selfish motivations, or is it purely altruistic? If I can understand, I will be able to understand a bit more about people. And not me alone, many others in the forum too. Will you be able to help, Danielreiterhorn?
rajmohan_h@yahoo.com
It depends if the goal is to a) market a hip algorithm or b) store movies more efficiently.
Open source makes it easy for anyone to contribute to the algorithm.
The more people contribute, the better the code will be at compressing movies.
The better it is at compressing movies, the fewer resources it will take to store them.
This isn't a zero-sum game we're talking about: it's about making the world a more efficient place, one bit at a time.
But the bottom line is that, it's a lot easier for many organizations to contribute to a code base if there are no strings attached.
Interest from an article like this can get people playing around with compression.
Maybe another 10% gain is right around the corner.
I think the poster mixed up his compression. Saying bit-exact compression is usefull for cloud services is .... DUH.. Though a little late to the playing field. Any on disk compression will be loss-less by definition. otherwise you'd be screwed anytime you zip a file.
Now if he found a better streaming compression for video that keeps h.264 size but ups the quality.. COOL! But on-disk bit-exact compression is pretty mature now. See ZFS/BTRFS. Or Stacker/Doublspace if your over 35.
The goal of H.264 and JPEG isn't minimum file size at all costs. It's also not encryption. Your premise is wrong, and even old tech can compress this stuff further than it may already be.
True, but that's obvious to you and me - which does reinforce the point that the article & Dropbox "innovation" is pretty stupid.
Not to mention JPEG and H.264 are old news - if you want to compare "new" development JPEG2000 and H.265 are the benchmarks...
And they probably say mobile-encoded for a reason, simple encoders have to work on low power and in real time,
Actually, the encoders are rarely limited by power or CPU cycles. The decoders are, but the great thing about lossy encoding like JPEG/H.264/H.265 is the encoders can continually be improved without affecting the decoders.
That said, the reason this article is pointless is you can't USE the results - it breaks H.264 standards so HW decoders can't handle it, and no one wants to decode some proprietary format on the fly to stream to standard H.264 decoders...
It's not pointless for dropbox, since they can store it compressed a bit more and decompress it when a user asks for it. It's also not pointless for software decoders such as VLC that have access to a bit more memory and CPU capability to deal with it.
Two points is a bit more than pointless by my count.
Along those lines it doesn't seem that long ago that arguments about using floating point in mp3 decoding was seen as a flaw.
> Weissman Score you fucking prick!
You have a +10 Wiseguy Score.
I'm not repeating myself
I'm an X window user; I'm an ex-Windows user
t's also not pointless for software decoders such as VLC that have access to a bit more memory and CPU capability to deal with it.
If you are talking flexibility of decoders and extra CPU, why make a non-compatible file based on a decade+ old codec when you could just re-encode to H.265? Same with JPEG92 vs JPEG2000.
Along those lines it doesn't seem that long ago that arguments about using floating point in mp3 decoding was seen as a flaw.
Totally different issue, since it wasn't about making the codec non-compatible, but how it's decoded.
So, not entirely pointless, but not nearly as interesting as the article pretends it is, which is typical of business journalists who don't really understand tech. Many of the "inefficiencies" have already been solved with more recent codecs. Retrofitting things like variable macroblock sizes and alternate compression strategies onto old formats is not particularly revolutionary...
Dropbox appear to be in the business of storing the existing files of clients and not forcing them to upgrade their hardware or software to support a new standard. That's where a bit of reversable compression on top instead of a complete re-encode makes sense.
On a personal scale maybe it makes sense for a user to completely re-encode all of their video files to a new standard but I don't think many people will be doing that. On an "industrial" scale with many users it makes even less sense so the reversable hack that saves space seems a better fit than a full unasked for re-encode of clients video files.
Perfect compression is also noncomputable.