The End of Video Coding? (medium.com)
An anonymous reader writes: Netflix's engineering team has an insightful post today that looks at how the industry is handling video coding; the differences in their methodologies; and the challenges new comers face. An excerpt, which sums up where we are:
"MPEG-2, VC1, H.263, H.264/AVC, H.265/HEVC, VP9, AV1 -- all of these standards were built on the block-based hybrid video coding structure. Attempts to veer away from this traditional model have been unsuccessful. In some cases (say, distributed video coding), it was because the technology was impractical for the prevalent use case. In most other cases, however, it is likely that not enough resources were invested in the new technology to allow for maturity.
"Unfortunately, new techniques are evaluated against the state-of-the-art codec, for which the coding tools have been refined from decades of investment. It is then easy to drop the new technology as "not at-par." Are we missing on better, more effective techniques by not allowing new tools to mature? How many redundant bits can we squeeze out if we simply stay on the paved path and iterate on the same set of encoding tools?"
"MPEG-2, VC1, H.263, H.264/AVC, H.265/HEVC, VP9, AV1 -- all of these standards were built on the block-based hybrid video coding structure. Attempts to veer away from this traditional model have been unsuccessful. In some cases (say, distributed video coding), it was because the technology was impractical for the prevalent use case. In most other cases, however, it is likely that not enough resources were invested in the new technology to allow for maturity.
"Unfortunately, new techniques are evaluated against the state-of-the-art codec, for which the coding tools have been refined from decades of investment. It is then easy to drop the new technology as "not at-par." Are we missing on better, more effective techniques by not allowing new tools to mature? How many redundant bits can we squeeze out if we simply stay on the paved path and iterate on the same set of encoding tools?"
Should they just adopt new and inferior solutions and hope for the best?
To me this is the "science" part of Computer Science. Do research into new algorithms and methods of video encoding, but it would be stupid to start adopting any of that into actual products or live usage until and unless it tops the more traditional methods in performance.
"People who think they know everything are very annoying to those of us who do."-Mark Twain
Video codecs are not the only example of this, there are many.
There's nothing "insightful" about saying "there may be something better out there."
The insightful thing would be to find or create it.
Netflix is one of the big beneficiaries of efficient video coding. If they want to give new approaches the time to mature, then they should pay for that. Don't complain about nobody doing something when you're the one who should be doing it.
This is one case where the actual article is well worth reading, with a ton of links off to other areas to explore, and more interesting detail than the summary presents... well worth taking a look if you are at all interested in video compression and where the state of the art is going.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
What a stupid statement.
Is the expectation we adopt crappy replacements to "allow them to mature?"
They can mature until they're as good as what we have, not replace it with something which doesn't work to give it room to grow into something which doesn't suck.
Either you have a working replacement, or you have a good idea and a demo.
"Not-at-par" means the latter -- you don't have a mature product, and nobody is going to adopt it if it can't do what they can do now. Saying "ti will eventually be awesome" tells me that eventually we'll give a damn, but certainly not now.
It's bad enough I have to fight my vendors that I'm not accepting a beta-rewrite and suffering through their growing pains to get to the mature product they're trying to replace. I'm not your fucking beta tester, so please don't suggest I grab your steaming turd and live with it until you make it not suck.
Boo hoo, immature technologies which don't cover what the technology they're trying to replace aren't being allowed to blossom into something useful. Make it useful, and then come to us.
If the math says a new technique is better, it won't matter if the first implementation isn’t good. Someone will fix the implementation and then it will match the mathematically predicted performance (or the guy who did the math with fix his error).
Let's say for argumentation that a new and much more efficient video codec was just invented.
The trouble is that it will immediately be locked up behind patents, free implementations will be sued, and it'll be packed with DRM and require per-play online-permission.
Our main problem isn't technology, it's the legal clusterfuck that has glommed onto the technology landscape.
It's Video Encoding.
Can we get some tech-literate editors please?
Would this mean the end of new video codecs which bring major technical advances?
Maybe yes.
Would this mean the end of new video codecs which are used for vendor lock-in schemes?
Hell no!
H.264 was king. Now we've got H.265 and AV1 which have not entirely replaced H.264 due to compatibility purposes, but have still gained significant traction.
On the audio side, AAC replaced MP3, and Opus is set to replace AAC. Opus can generally reach the same quality as MP3 in less than half the bits!
So I don't see this stagnation they talk about. These algorithms are generally straightforward and codec devs, even if they don't have a hyper-efficient implementation yet, will be able to see the benefit -- it's just a matter of investing in their time to develop high quality code and hardware for it.
Seriously the title and summary would have been much better and easier to understand if they used a single word "Research": "The End of Video Coding Research". The article discusses that while video coding use is pretty much everywhere, there hasn't been much progress or change made into newer standards despite lots of interest and investment. New codecs are coming out but there are all variations of the "block-based hybrid video coding structure" of MPEG-2/H.264/VP9, etc. Netflix is one company that would benefit from newer encoding standards.
Well, there's spam egg sausage and spam, that's not got much spam in it.
Or are we forgetting that livestreaming is getting more popular? First guy to beat the x264 encoder in realtime encoding performance is going to be a big winner.
You heard me.
they're getting more power efficient, but not much faster. I'm not expert, but from what I could tell the revolution in video encoding came because client hardware got a _lot_ faster at decoding high def video. That led to new codecs to take advantage of the increased power. I remember in 2005 needing special software to decode a 1080p stream on my GTX 240 video card and Athlon x64. By 2013 my phone could do it with VLC.
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
We should invest a shitton of money in order to create a new codec that everyone can use and benefits everyone... You literally just described AV1. The entire process of it "being inferior while being iterated until better" also directly describes the past few years of AV1 until recently where it started to pull ahead in the compression vs quality game compared to other leading codecs.
The "hired very large codec dev team" they were contributing to is called "AOMedia - Alliance for Open Medi", and one of the potential rabbit hole that got considered and worked on was Daala by Xiph (tons of new crazy idea, including stuff like extending block as lapped blocks, a perceptual vector quantisation that doesn't rely on residual coding, etc.)
At the end of the day, the first thing that currently came out of AOMedia, by combining work such as Xph's Daala, Google's VP10 and Cisco's Thor, is AV-1.
It's much tamer that what it could have been, but still incorporate some interesting idea.
(they didn't go all the way to using the ANS entropy coders suggested more recently by experiment such as Daala, but at least replaced the usual arithmetic encoder with Daala's range encoder).
By the time AV-2 gets out, we should see some more interesting stuff.
Probably this speech was meant as a rousing speech to encourage developers to go crazy and try new stuff.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
The existing technologies are probably fairly close to the pareto curve at this point, meaning newer technologies that provide better compression will also likely consume much more power for marginal gains. Improving the performance per watt of hardware may be more important at present than finding more computationally expensive algorithms.
Captcha: discuss
We should just declare one of the current schemes as "good enough", use it long enough for all relevant patents to expire, universally implement it on all devices, and serve it by default from almost all media sources.
It would be kind of like mp3 and jpg, and it would lower everybody's stress level.
At some point, you have to start asking why you need certain quality of experience in limited environments, and what infrastructure it takes to get there.
The biggest ongoing cost for streaming movies today is CDN storage, in the sense of having enough bitrates and resolutions to be able to accommodate all target devices and connection speeds. As much as people would like to deliver an HD picture to a remote village in the Philippines over a mobile connection on a feature phone, it isn't feasible at the moment for two reasons: they don't need or care about that level of experience, and it isn't technically feasible. The goal of CDN storage is to ensure the edge delivers the content, and the industry has toyed with real-time edge transcoding/transrating to address some of these issues, but fundamentally we are dropping asymptotically to a point on visual quality for a given bitrate and amount of computing power that a codec can deliver at the playback device.
In that sense, I'm shocked that Anne's post didn't mention Netflix's own VMAF, which is a composite measure of different flavors of PSNR, SSIM and some deep learning. But even here, the fundamental is that we are still using block-based codecs for operations simply because of the fundamental nature of most video, i.e. objects moving around on a background. I'm also shocked that Anne didn't discuss alternative coding methods like wavelet-based (e.g. JPEG 2000), but - again - these approaches have their own limitations and don't address interframe encoding in the same way that a block-based codec can. If there was a novel approach to coding psychovisually-equivalent video that would address computing power, bitrate and quality reasonably, I believe it would have been brought forward already.
I think 5G deserves a big mention here that was lacking in Anne's post, because faster connections may solve many of the types of issues that affect perceived visual quality at low bitrates. Get more bandwidth, and you have a better experience. Hopefully 5G will proliferate quickly, but this will be tricky in the developing world where its inherently decentralized nature and the political environments will make its ubiquitous deployment a serious challenge.
In the end, we're all fighting entropy, particularly when it comes to encoding video. Our ability to perceive video is affected by an imperfect system - the human eye and brain. That's why we've made such gains in digital video since the MPEG-1 days. But the fantasies of ubiquitous HD video to everyone in the world on 100kbps connections are just that. When you're struggling to get by and don't have good health care or clean drinking water, the value of streaming high-quality video isn't there from a business perspective, much less a technical perspective. Everyone will get an experience relative to the capabilities of technology and the value it brings to them accordingly. All else is idealistic pipe dreams until otherwise proven.
"For the video codec community to innovate more quickly, and more accurately, automated video quality measurements that better reflect human perception should be utilized."
Yup because innovation is at a stand still and new technologies don't get created to solver an actual problem. Hey wait a tic, they do.
Hmmm, so what's your fucking problem again? Could it be you want adoption of your codec but your codec doesn't pas the standard testing suites. So the solution is not to improve your codec - that would be too much work - just re-write the tests to say the codec is just good enough. Problem solved.
Moron.
but it would be stupid to start adopting any of that into actual products or live usage until and unless it tops the more traditional methods in performance.
The logic behind the article is that the new techniques will never top more traditional (or at least could not have a way to achieved in the current state of affair), because most of the resources (dev time, budget, etc.) are spent optimizing the "status-quo" codecs, and not enough is spent on the new comer.
By the time something interesting comes up, the latest descendant of the "status-quo" would have been much more optimized.
It doesn't matter that the PhD thesis "Using Fractal Wavelets in non-Euclidian spaces to compress video" shows some promising advantages over MPEG-5 : it will not get funded, because by then "MPEG-6 is out" and is even better just by minor tweaking every where.
Thus new idea like a PhD thesis never get funded and explored further, and only further tweaking of what already exist gets funded.
I personally don't agree.
The most blatant argument is the list it self.
With the exception of AV-1, the list is exclusively only the actual list of block based algorithm : MPEG-1 and it's evolutions (up to HEVC) and things that attempts to do something similar while avoiding the patents (the VPx serie by On2, Google).
It completely ignores stuff like Dirac and Schroedinger :
completely different approach to video compression (based on wavelets) that got funded, developed and are actually in production (by no less than the BBC).
It completely ignores the background behind AV-1 and how it relates to Daala.
AV-1 was designed from the ground up not as an incremental evolution (or patent circumvention) over HEVC, it was designed to go along a different direction (if nothing else, at least for the reason to avoid the patented techniques of MPEG, as avoiding patent madness was the main target behind AV-1 to begin with).
It was done by AOMedia, where lots of group poured resources (including Netflix themselves).
Yes, on one side of the AV-1 saga, you have entities like Google that donates their work on VP10 to serve as a basis - so were's again at the "I can't believe it's not MPEG(tm)!" clones.
But among other code and techniques contributions (beside Cisco's Thor which I'm not considering for the purpose of my post), there's also Xiph who provided their work on Daala.
There's some crazy stuff that Xiph has been doing there : stuff like replacing the usual "block"-based compression with slightly different "lapped blocks", more radical stuff like throwing away the whole idea of "coding residuals after prediction" and replacing it with what "Perceptual Vector Quantization", etc.
Some of these weren't kept for the AV-1, but other crazies actually made it into the final product (the classic binary arithmetic coding used by the MPEG family was thrown away for integer range-encoding, though they didn't go as far as use the proposed alternative ANS - Asymmetrical Number System)
Overall, incrementally improving on MPEG (MPEG 1 -> MPEG 2 -> MPEG 4 ASP -> MPEG 4 AVC/H264 -> MPEG 4 HEVC/H265) get hit hard by the law of diminishing returns. There's only so far that you can reach be incremental improvement.
Time to get some new approaches.
Even if AOMedia's AV-1 isn't that much revolutionnary, that's more out of practical considerations (we need a patent-free codec available as fast as possible, including available quickly in hardware, better end up selecting thing that are known to work well) than for not having tried new stuff.
And even if some of the more out of the box experiment didn't end up in AV-1, they might end up in some future AV-2 (Xiph is keeping experimenting with Daala).
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
It's about efficiency. That's what engineers do, the most efficient solution based upon all the given constrains...including you.
Any new mousetrap is going to be inferior to the highly refined existing solutions, no point in building a better mouse trap.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Competitors to JPEG so far have only reached 30% less data under laboratory conditions compared to the standard JPEG encoder. That amount of improvement can however also be experienced by using more sophisticated JPEG encoders. Even if we'd reach half the file size for a realistic set of images, I doubt we would switch, as JPEG is largely good enough.
We might see a successor to JPEG for moving images yet, as those codecs slowly get good enough that the key frames matter and can even take up most of the storage space for some videos.
yes, it's dead. Use png for images, ogg theora for video and flac for sound. End of discussion.
Yes! It was things like that totally missing from the summary that made it interesting to fully read through.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
But they can die of starvation much better if they can watch Netflix on their phones! The dying experience would be awesome! So much winning!
It's the next frontier.
Instead of breaking images up into rectangles and compressing each rectangle separately (which produces block artifacts), we should just to a wavelet compression on the entire image at once for best viewing.
I've abandoned my search for truth; now I'm just looking for some useful delusions.
Don't they encode on the fly? I can understand having copies of the most popular formats but it seems much easier to do the oddball ones on the fly. I use Univeral Media Server and cpu usage hardly registers on my old first gen i7 and it has no hardware acceleration.
love is just extroverted narcissism
Why not convince the people behind a good open source video player (vlc springs to mind) and a good converter tool (handbrake springs to mind) to support a promising new codec? Geeks start using it, and we rapidly see whether it's worth pursuing or not. This strategy has worked for other codecs.
If we sit on our hands waiting for the industry to adopt a new standard, we'll still be using mp4 when cockroaches inherit the earth.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Just pulling numbers out of a hat, starting from raw video, lossless compression perhaps drops the bitrate needed by an order of magnitude. Existing lossy algorithms drop maybe another order of magnitude. It is very likely that with a lot of work, that could drop another 50%, but it's fairly unlikely to drop another order of magnitude off of existing systems.
Netflix should watch what it wishes for, though. Dropping another order of magnitude maybe would make things cheaper for Netflix, but that's also the kind of thing which enables new competitors to come along and usurp the current vendors.
In a word. NO. There is nothing missed.
New technology, per 1991 ATT research, doesn't have a chance of disrupting older technology without order of magnitudes greater than 10X measurable improvement. GOOG failed upon search to locate the research. Here's the latest I can Google https://bit.ly/2tc6oJl
No.
I never understood why video specific autoencoders were not used instead of existing codecs. I understand the hardware side of this is difficult (hardware could be created that could handle autoencoders but they don't currently exist), but for laptops and cell phones, an autoencoder would likely work much more efficiently for when bandwidth was limited. Perhaps bandwidth is good enough and saving cell phone batteries is higher priority but general purpose codecs are a very heavy hammer with which to solve issues due to limited network bandwidth. Seems like mathematicians looking for a problem to solve rather than a real attempt to find the most efficient solution here.
"Those that start by burning books, will end by burning men."
Its the quantum theory of technology. New tech cannot be just a little better. It has to be a quantum leap. It has to be significant enough to overcome the inertia of an establish tech and ecosystem. Or it must fulfill a specific need.
I'm also shocked that Anne didn't discuss alternative coding methods like wavelet-based (e.g. JPEG 2000), but - again - these approaches have their own limitations and don't address interframe encoding in the same way that a block-based codec can.
I mean, I guess you could use JPEG-2000 for the iFrames, but it's very seriously not designed for video. An interesting potential method would be extrapolating the wavelet to the third dimension (in this case, the time series) for videos, but your working memory would go up dramatically (which would make hardware decoders prohibitively expensive). Also, data loss in hierarchical encoding schemes is often catastrophic to the entire block. Not so bad on a single frame, pretty devastating if you just torched 64 frames of a video.
One if the issues is getting newer ideas added to existing standards or even developing standards. I had found that grey coding (only 1 bit changes while counting: 000,001,011,010,110,111) graphical images helped with early compression and proposed that when png was being developed but it didn't go anywhere. Mapping color space was another idea because about 8 million of the colors in 24 bit RGB are brown or grey. 24 bit HSV was trivial to add to the analog section of VGA and would display far more shades of orange by using a few parts and dropping half the bit waste. That was a soldering iron at home type hack. Even the compression of usenet never quite got around to not rebuilding the Huffman compression tables for each message. The techniques now used for optimized jpg would have reduced the bandwidth of text groups by about half. It would have reduced the compression compute costs even more.
So youâ(TM)re saying Netflix just discovered the âExplore vs Exploitâ(TM) family of Computer Science problems..? âHow far down one path should you go when that path is presumed to be the best (Exploit), vs experimenting with alternative paths that may turn out to be better (explore)?â(TM)
The current approach are very mathematical, looking at pixels. But what ends up in our brains is high level symbols. If an AI can get at those, then extremely tight coding is necessary.
For example, it takes a lot of bytes to represent "a man walking under a tree". But that phrase only took 50 bytes. The reconstructed video does not have to have the same type of tree as the original, just some sort of tree.
That's taking it to the extreme. But if an AI can recognize the types of objects in a video, and produce a model of them, then it can simply render the model in different ways. Huge compression.
It does not matter if the rendered video is quite different from the original as long as it feels the same to watch it.
That all said, massive increases in available bandwidth make this rather pointless.
Arithmetic coders are mathematically equivalent to range coders. Is it an encoding speed increase they were looking for? Or perhaps the ease with which you can modify range coder mid-stream compared to arithmetic?
The key point is that MPEG's coders are binary coders. The code bit after bit.
On hardware, that means a part of that needs to run at a very high frequency.
Mean that you also need binarize (to convert the symbols into a bit stream) and manage contexts for all these individual bits.
(CABAC = *Context adaptive* binary arithmetic coder).
Implementations can rely on non-integers.
Daala's entropy coder can work on any discrete list of symbols (not necessary 1 bit).
In practice it works on value encoded in a few bits (e.g.: 4 bits number coding any symbol from a list of up to 16).
This gives lower frequency of operations in hardware (you run it once to get you symbol out instead of 4x bit-by-bit), and makes managing the entropy probability prediction easier (just a singles list of probabilities of any of the 16 symbols coming up, instead of several bit context). It is also much easier to implement 100% using only integer math.
So yeah, in theory arithmetic is just a subtype of range encoding.
In practice, MPEG's CABAC is implement in a way that is more costly in hardware and performance than Daala's entropy encoder.
(And if you squint at it, a tANS is a weird type of range encoder, excepted that you've flipped the bit over and work in reverse. And end up not with "slices" of varying size, but a series numbers that more or less likely to show up in the list. And except that you use a table instead of maths ops. So it's also a distant relative to range encoding) .5 bits), but will get coded in 2bits in tANS, with the extra .5 bits "carried over" into the next operation, which could then output 3bits, giving you the exact 2.5bit average predicted).
(And if you squint the other way, tANS looks like some weird cousin of Huffman, except that you use multiple table and "carry over" the non-integer part of the entropy. i.e.: a symbol that has 2.5 bits according to Shanon, will get always codec with 3 bits in Huffman (giving you an overhead of extra
(and tANS are implemented 100% with integer maths and RAM for the tables, but this relying on RAM is why AV-1 decided to use the range encoder for hardware implementation : cheaper on silicon).
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
This isn't exactly the article's point, but.... the resolution war has to stop soon. Right now H.265 covers 4K just fine, and probably a little higher. Humans can't discern more than that at normal TV viewing distances. Do we really need 8K or 16K TV's? I'm still rocking my 1080p just fine.
Now before people say I sound like Gates when he said 640K is enough RAM, we're talking about a human sensory limitation. CD's are 44.1k and hi-def sources go up to 96k and a (honestly...) ridiculous resolution of 192k in some cases. We've hit our limit. Nobody can argue that 384k is needed. When do we reach that point for video, outside of theater size screens?