The End of Video Coding? (medium.com)
An anonymous reader writes: Netflix's engineering team has an insightful post today that looks at how the industry is handling video coding; the differences in their methodologies; and the challenges new comers face. An excerpt, which sums up where we are:
"MPEG-2, VC1, H.263, H.264/AVC, H.265/HEVC, VP9, AV1 -- all of these standards were built on the block-based hybrid video coding structure. Attempts to veer away from this traditional model have been unsuccessful. In some cases (say, distributed video coding), it was because the technology was impractical for the prevalent use case. In most other cases, however, it is likely that not enough resources were invested in the new technology to allow for maturity.
"Unfortunately, new techniques are evaluated against the state-of-the-art codec, for which the coding tools have been refined from decades of investment. It is then easy to drop the new technology as "not at-par." Are we missing on better, more effective techniques by not allowing new tools to mature? How many redundant bits can we squeeze out if we simply stay on the paved path and iterate on the same set of encoding tools?"
"MPEG-2, VC1, H.263, H.264/AVC, H.265/HEVC, VP9, AV1 -- all of these standards were built on the block-based hybrid video coding structure. Attempts to veer away from this traditional model have been unsuccessful. In some cases (say, distributed video coding), it was because the technology was impractical for the prevalent use case. In most other cases, however, it is likely that not enough resources were invested in the new technology to allow for maturity.
"Unfortunately, new techniques are evaluated against the state-of-the-art codec, for which the coding tools have been refined from decades of investment. It is then easy to drop the new technology as "not at-par." Are we missing on better, more effective techniques by not allowing new tools to mature? How many redundant bits can we squeeze out if we simply stay on the paved path and iterate on the same set of encoding tools?"
At some point, you have to start asking why you need certain quality of experience in limited environments, and what infrastructure it takes to get there.
The biggest ongoing cost for streaming movies today is CDN storage, in the sense of having enough bitrates and resolutions to be able to accommodate all target devices and connection speeds. As much as people would like to deliver an HD picture to a remote village in the Philippines over a mobile connection on a feature phone, it isn't feasible at the moment for two reasons: they don't need or care about that level of experience, and it isn't technically feasible. The goal of CDN storage is to ensure the edge delivers the content, and the industry has toyed with real-time edge transcoding/transrating to address some of these issues, but fundamentally we are dropping asymptotically to a point on visual quality for a given bitrate and amount of computing power that a codec can deliver at the playback device.
In that sense, I'm shocked that Anne's post didn't mention Netflix's own VMAF, which is a composite measure of different flavors of PSNR, SSIM and some deep learning. But even here, the fundamental is that we are still using block-based codecs for operations simply because of the fundamental nature of most video, i.e. objects moving around on a background. I'm also shocked that Anne didn't discuss alternative coding methods like wavelet-based (e.g. JPEG 2000), but - again - these approaches have their own limitations and don't address interframe encoding in the same way that a block-based codec can. If there was a novel approach to coding psychovisually-equivalent video that would address computing power, bitrate and quality reasonably, I believe it would have been brought forward already.
I think 5G deserves a big mention here that was lacking in Anne's post, because faster connections may solve many of the types of issues that affect perceived visual quality at low bitrates. Get more bandwidth, and you have a better experience. Hopefully 5G will proliferate quickly, but this will be tricky in the developing world where its inherently decentralized nature and the political environments will make its ubiquitous deployment a serious challenge.
In the end, we're all fighting entropy, particularly when it comes to encoding video. Our ability to perceive video is affected by an imperfect system - the human eye and brain. That's why we've made such gains in digital video since the MPEG-1 days. But the fantasies of ubiquitous HD video to everyone in the world on 100kbps connections are just that. When you're struggling to get by and don't have good health care or clean drinking water, the value of streaming high-quality video isn't there from a business perspective, much less a technical perspective. Everyone will get an experience relative to the capabilities of technology and the value it brings to them accordingly. All else is idealistic pipe dreams until otherwise proven.