The Wretched State of GPU Transcoding
MrSeb writes "This story began as an investigation into why Cyberlink's Media Espresso software produced video files of wildly varying quality and size depending on which GPU was used for the task. It then expanded into a comparison of several alternate solutions. Our goal was to find a program that would encode at a reasonably high quality level (~1GB per hour was the target) and require a minimal level of expertise from the user. The conclusion, after weeks of work and going blind staring at enlarged images, is that the state of 'consumer' GPU transcoding is still a long, long way from prime time use. In short, it's simply not worth using the GPU to accelerate your video transcodes; it's much better to simply use Handbrake, which uses your CPU."
...since the results of OpenCL code is static across GPUs rather than being an arbitrary output.
The quick sync hardware is part of the IGP block but it is specialized hardware specifically geared towards transcoding. For example, it is not using the main GPU pipeline and shader hardware to do the transcoding.
AntiFA: An abbreviation for Anti First Amendment.
Quick Sync uses dedicated HW on the die. Intel's solution that uses their GPU is called Clear Video.
As the author of the story, that's an error that slipped past in formatting. I'm uploading the proper graph right after I hit "Reply" on this.
Actually, recent GPUs *were* meant to do exactly this type of thing, and have been marketed by Nvidia and ATI heavily for this purpose. Of course there needs to be a CPU as well. The CPU runs the operating system and application code, and offloads very specific, parallelizable work to the GPU. This sort of architecture has existed almost as long as modern CPUs have existed.
And Quick Sync is even less of a general purpose CPU solution than using a GPU. Quick Sync uses dedicated application specific hardware on the die to do its encoding.
Here's a link to the article in 1 page.
Well see, that's the thing. A GPU is better suited to some kinds of massively parallel tasks, like video encoding. After all, you're applying various matrix transforms to an image, with a bunch of funky floating point math to whittle all that transformed data down to its most significant/perceptible bits. GPUs are supposed to be really really good at this sort of thing.
My hunch is that the problems we're seeing are caused by two big issues:
1. lack of standardization across GPU processing technologies. CUDA vs OpenCL vs Quicksync, and a bunch of tag-alongs too. Each one was designed around a particular GPU architecture, so porting programs between them is non-trivial.
2. lack of expertise in GPU programming. Let's be fair here: GPUs are a drastically different architecture than any PC or embedded platform we're used to programming. While I could follow specs and write an MPEG or H.264 encoder in any high-level language in a fairly straight-forward manner, I can't even begin to envision how I would convert that linear code into a massively parallel algorithm running on hundreds of dumbed-down shader processors. It's not at all like a conventional cluster, because shaders have very limited instruction sets, little memory but extremely fast interconnects. We have a hard enough time making CPU encoders scale to 4 or 8 cores, this requires some serious out-of-the-box thinking to pull off.
Moving to a GPU virtually requires starting over from scratch. This is a set of constraints that are very foreign to the transcoding world, where the accepted trend was to use ever-increasing amounts of cheaply available CPU and memory, with extensively configurable code paths. The potential is there, but it will take time for the hardware, APIs and developer skills to converge. GPU transcoding should be seen as a novelty for now, just like CPU encoding was 15 years ago when ripping a DVD was extremely error-prone and time-consuming. If you want a quick, low quality transcode, the GPU is your friend. If you're expecting broadcast-quality encodes, you're gonna have to wait a few years for this niche to grow and mature.
-Billco, Fnarg.com
What I don't understand is how this happens. Why would the same calculation get different results on different GPUs? Are they doing the math incorrectly?
Give me Classic Slashdot or give me death!
Hint: Not all GPUs have IEEE FP compliant math. Often they break the standard, or do something else altogether just to improve performance.
has anyone tried Badaboom?
Not much point. It's been discontinued.
Breakfast served all day!
Let's be clear here: the x264 guys will never be happy. QuickSync, AMD's Video Codec Engine, and NVIDIA's NVENC all use fixed function blocks. They trade flexibility for speed; it's how you get a hardware H.264 encoder down to 2mm2. There are no buttons to press or knobs to tweak and there never will be, because most of the stuff the x264 guys want to adjust is permanently laid down in hardware. The kind of flexibility they demand can only be done in software on a CPU.
Because they're not using the same encode paths.
All 3 hardware encode paths - Intel QuickSync, AMD AVIVO, and NVIDIA's CUDA video encoder - are largely black boxes. Programs such as MediaEspresso are basically passing off a video stream to the device along with some parameters and hoping the encoder doesn't screw things up too badly. Each one is going to be optimized differently, be it for speed, more aggressive deblocking, etc. These are decisions built into the encoder and cannot be completely controlled by the application calling the hardware. And you have further complexities such as the fact that only Intel's QuickSync has a hardware CABAC encoder, while AMD and NV do it in software (and poorly since it doesn't parallelize well).
Or to put this another way, everyone has their own idea on what the best way is to encode H.264 video and what kind of speed/quality tradeoff is appropriate, and that's why everything looks different.
Because behind the scenes your "encoder" program is actually using several different encoders. Generally the encoder has to be custom written specifically for the specialized GPU hardware it is targeting.
This has largely ceased to present a problem, thanks to OpenCL.
GPU code no longer needs to run as custom-written shaders targetting 20 different platforms. One program, written in fairly straightforward C, will run on just about any modern platform. And it will do so at speeds that absolutely dwarf a CPU - The Radeon x9yy cards (for x>=5) easily crush a modern CPU at OpenCL code by a factor of a thousand. The x8yy cards still perform admirably, over three hundred to one. For NVidia, the Tesla series do well, while the GX... Well, ten to fifty times faster doesn't exactly suck...
The real problem here? Most people have really crappy GPUs. Even compared to the $100 card range, your GPU sucks ass, and hard. And you can't really blame people, because honestly, even modern IGPs will run just about anything fairly well, so why would you pay for more?
But don't blame the GPUs, or the concept in general. If you target OpenCL and the user has a halfway decent modern GPU, it will give consistent, reliable results, and will blow away your CPU many times over.
What strikes me as a bad sign is not so much that the GPU transcoding doesn't necessarily produce massive speed improvements; but that the products tested produce overtly broken output in a fair number of not-particularly-esoteric scenarios.
Expecting super-zippy magic-optimized speedups on all the architectures tested would be the mark of expecting serious maturity. Expecting a commercially released, GPU-vendor-recommended, encoding package to manage things like "Don't produce an h264 lossy-compressed file substantially larger than the DVD rip source file" and "Please don't convert a 24FPS video to 30FPS for no reason on half the architectures tested" seems much more reasonable.
I can imagine that the subtle horrors of the probably-makes-the-IEEE-cry differences in floating point implementations, or their ilk, might make producing identical encoded outputs across architectures impossible; but these packages appear to be flunking basic sanity checks, even in the parts of the program that are presumably handled on the CPU(when a substantial portion of iPhone 4S handsets are 16GB devices, letting the 'iPhone 4S' preset return a 22GB file while whistling innocently seems like a bad plan...)
but, at least in this context, speed is nearly irrelevant because it fails at the task at hand, producing high quality video.
who cares how fast it completes a task if it's failing? Nobody gives little jimmy props when he finishes the hour-long test in 5 minutes but scores a 37% on it.
Fuzzy,
You pretty much nailed my problem with the output. :P That's the reason why Arcsoft, with compatibility problems, ultimately ranked above Cyberlink. Arcsoft doesn't do very good work on the Radeon 7950 and it can't handle CUDA, but it at least gets something right. Quick Sync video is very good.
Cyberlink got nothing right anywhere. And it's the program most-often recommended to reviewers as a benchmark when we want to review GPU encoding.
I set out to test presets. Specifically, I set out to test the presets of software packages which are sold on the purported *strength* of those presets. I say so in the first paragraph:
" Our goal was to find a program that would encode at a reasonably high quality level (~1GB per hour was the target) and require a minimal level of expertise from the user."
That's why MediaCoder results weren't included.
The entire article came about because Cyberlink's iPhone 4S preset yielded files that were 1.4GB if I used CPU encoding or a GTX 580, and 188MB if I used Quick Sync. That disparity is what I noticed when I went to check encode quality for the initial IVB review.
Can you build custom profiles in CME and create outputs that avoid these problems? You can -- though some options aren't available. That, however, is not the point. If I'm going to build my own custom profiles, I can download a copy of MediaCoder for free and do it with a more powerful piece of software that offers a huge number of options.
I did a review of "Software that claims to automate the GP encode process." I did not do a review of "Can Cyberlink MediaEspresso EVER create a decent image?" Given what I set out to evaluate, my ability to tweak profiles to achieve a satisfactory result is not a valid criteria for my conclusions.
Hint: Not all GPUs have IEEE FP compliant math. Often they break the standard, or do something else altogether just to improve performance.
I can't speak for ATI, but actually all FP32 math on Nvidia architectures for many generations now has been IEEE compliant, excluding NAN and -inf +inf and exception handling cases, and except for their hardware sin, cos, log implementations, and except when using the fused multiply add instruction (though the last one you could actually get around by using special compiler intrinsics to avoid the fusing).
That while the x264 guys aren't wrong to want to keep working on a software encoder that is tweakable, there is nothing wrong with a fixed function hardware encoder for some tasks. Sometimes, speed is what you want and "good enough" is, well, good enough.
Like at work I edit instructional videos for our website (I work at a university) using Vegas. I use its internal H.264 encoders, which can be accelerated using the GPU. They are quite zippy, I can generally get a realtime or better encode, even when there is a decent amount of shit going on in the video that needs to be processed (remember that Vegas isn't for video conversion, I'm doing editing, effects, that kind of thing).
Now the result is not up to x264 quality, per bit. I could get better quality by mucking around setting up an avisynth frameserver and having x264 do the encoding using some tweaked settings for high quality. However it would be much slower.
Not worth it. I'll just encoder a reasonably high bitrate video. It is getting fed to Youtube anyhow, so there's a limit to how good it is going to look. The faster hardware assisted encode speeds are worth it.
If I was mastering a Blu-ray? Ya I might do the final encode to go off to fabrication with x264 (actually more likely an expensive commercial solution that can generate mastering compliant bitstreams). Spend the extra time to get it as quality as possible because of all the other work and because it could actually be noticable.
There is room for both approaches.