The Wretched State of GPU Transcoding
MrSeb writes "This story began as an investigation into why Cyberlink's Media Espresso software produced video files of wildly varying quality and size depending on which GPU was used for the task. It then expanded into a comparison of several alternate solutions. Our goal was to find a program that would encode at a reasonably high quality level (~1GB per hour was the target) and require a minimal level of expertise from the user. The conclusion, after weeks of work and going blind staring at enlarged images, is that the state of 'consumer' GPU transcoding is still a long, long way from prime time use. In short, it's simply not worth using the GPU to accelerate your video transcodes; it's much better to simply use Handbrake, which uses your CPU."
Yes, they are cheating. That is exactly how they are getting it to be so fast.
The GPU isn't meant to do everything. If it were, there wouldn't be a CPU. Considering the hatred that was poured on Quicksync here, and that Quicksync still produces better quality Transcodes than GPUs while being substantially faster, I don't think we'll be seeing the end of CPU transcoding anytime soon.
AntiFA: An abbreviation for Anti First Amendment.
...since the results of OpenCL code is static across GPUs rather than being an arbitrary output.
There is a screwed up graph on page two where they use the same graphic twice, and the caption describes aspects of the one that is missing. I really wanted to see the comparison too. You would think in an article of that size and scope someone would be responsible for checking layout as well as copy. It is no wonder we are losing to china. Their English may be worse, but their work ethic and attention to detail is possibly better.
Silence is a state of mime.
Here's a link to the article in 1 page.
has anyone tried Badaboom?
They're using their grammar skills there.
What I don't understand is how this happens. Why would the same calculation get different results on different GPUs? Are they doing the math incorrectly?
Give me Classic Slashdot or give me death!
The real problem is a lack of a common API for encoding regardless of GPU/CPU, which leads to vendor-specific implementations with varying degrees of quality. The most efficient way to pretty much do anything is a dedicated HW block (from both perf and power point of view), so there is no question that there is value in encoding using dedicated hardware, but the software has to catch up.
Help! I am a self-aware entity trapped in an abstract function!
Hint: Not all GPUs have IEEE FP compliant math. Often they break the standard, or do something else altogether just to improve performance.
Because behind the scenes your "encoder" program is actually using several different encoders. Generally the encoder has to be custom written specifically for the specialized GPU hardware it is targeting.
has anyone tried Badaboom?
Not much point. It's been discontinued.
Breakfast served all day!
CPUs used to be like that too.
As the author:
Because 3000-word articles with PNGs at ~300K per large image and 100K per preview image aren't fun reading in a single go. There's ~1.5MB of imagery just on the third page . Pages 3-8 have about the same, and that's with the images only loaded as thumbnails.
If you've got a fast net connection, you won't care. If you don't have a fast net connection, loading 16MB of images at once isn't a lot of fun.
Visual quality comparisons are one area where you can't use low-quality JPGs. A 9-page article at ET is a real rarity, it's not something we do because we want to spam ads.
There are probably two issues here, but the kind of calculations we're talking about here are floating-point calculations. And as every programmer should know floating-point calculations done by different CPUs or GPUs don't give you consistent results: http://developers.slashdot.org/story/10/05/02/1427214/what-every-programmer-should-know-about-floating-point-arithmetic
Also, we're talking about GPUs here. GPUs aren't even designed to give you IEEE standard results. Instead they're designed to give approximate results intended to be used for real time graphics display.
Because they're not using the same encode paths.
All 3 hardware encode paths - Intel QuickSync, AMD AVIVO, and NVIDIA's CUDA video encoder - are largely black boxes. Programs such as MediaEspresso are basically passing off a video stream to the device along with some parameters and hoping the encoder doesn't screw things up too badly. Each one is going to be optimized differently, be it for speed, more aggressive deblocking, etc. These are decisions built into the encoder and cannot be completely controlled by the application calling the hardware. And you have further complexities such as the fact that only Intel's QuickSync has a hardware CABAC encoder, while AMD and NV do it in software (and poorly since it doesn't parallelize well).
Or to put this another way, everyone has their own idea on what the best way is to encode H.264 video and what kind of speed/quality tradeoff is appropriate, and that's why everything looks different.
Because behind the scenes your "encoder" program is actually using several different encoders. Generally the encoder has to be custom written specifically for the specialized GPU hardware it is targeting.
This has largely ceased to present a problem, thanks to OpenCL.
GPU code no longer needs to run as custom-written shaders targetting 20 different platforms. One program, written in fairly straightforward C, will run on just about any modern platform. And it will do so at speeds that absolutely dwarf a CPU - The Radeon x9yy cards (for x>=5) easily crush a modern CPU at OpenCL code by a factor of a thousand. The x8yy cards still perform admirably, over three hundred to one. For NVidia, the Tesla series do well, while the GX... Well, ten to fifty times faster doesn't exactly suck...
The real problem here? Most people have really crappy GPUs. Even compared to the $100 card range, your GPU sucks ass, and hard. And you can't really blame people, because honestly, even modern IGPs will run just about anything fairly well, so why would you pay for more?
But don't blame the GPUs, or the concept in general. If you target OpenCL and the user has a halfway decent modern GPU, it will give consistent, reliable results, and will blow away your CPU many times over.
Pity the Handbrake devs are dickwads.
1. It's not funny.
2. They make an excellent bit of software that I have been using for free for years. Unless you helped them out you can't complain.
3. The guys creating Handbrake and the guys making video encoders are not the same people, so your rant is misdirected.
4. I mailed them two suggestions for improvements, and both got implemented. Now this may be because my suggestions were the kind of things that were (a) genuine improvements and (b) interesting for the developer and therefore would have been implemented anyway, but in my experience they are responsive to the right kind of suggestions.
but, at least in this context, speed is nearly irrelevant because it fails at the task at hand, producing high quality video.
who cares how fast it completes a task if it's failing? Nobody gives little jimmy props when he finishes the hour-long test in 5 minutes but scores a 37% on it.
Nice shill paper you got there... Of course a paper made by the throughput computing lab and the Intel architecture group (both Intel corp) will advocate there's not much speedup by a GPU when compared to a CPU.
The big thing to note is that with a GPU, you have to do what you did when working with the original SSE (intel...) instruction set on regular CPU's, FP16 numbers will not have a significant amount of precision, so you must take that into account when programming with that instruction set in mind. It's not as if people haven't been performing calculations with numbers bigger than the bit width of the cpu's instructions. Modern GPU are getting much beefier with double precision math as well
the 5000 series Radeon's (not examined in the paper) have much better DP performance than the geforce GTX 280 compared. The Radeon 5970 for example has 18x the DP GFLOPS that the i7-960 has, and they both went to market at the same time. For SP data, the 5970 is 46x better than the i7-960.
who cares how fast it completes a task if it's failing? Nobody gives little jimmy props when he finishes the hour-long test in 5 minutes but scores a 37% on it.
I agree that presents something of a problem for current implementations; the concept of GPU transcoding doesn't fail, however. Only the fact that those currently pushing it have tried to show at least modest gains for everbody - meaning those with massively inappropriate hardware - has made it such an abysmal failure to date.
To repeat my earlier post, if you target an OpenCL-capable GPU, you will get consistent results; and if you target a card with a reasonable number of compute units, (58xx/59xx/68xx/69xx/tesla), you'll see performance far beyond what a modern CPU can give.
Does that make GPU transcoding the best choice for the general public at present? No! But for those with the hardware, the comparison counts as literally laughable.
That's why video professionals and tv stations rely on hardware based transcoding, and this solutions tend to be expensive.
x264 can encode 1080p in realtime on a modern Intel CPUs (Sandy Bridge, etc.) with pretty much as good a quality for the same bitrate as most hardware solutions. For non-HD, x264 just smokes hardware, as it can do better than realtime encodes at very high quality on those same CPUs.
I set out to test presets. Specifically, I set out to test the presets of software packages which are sold on the purported *strength* of those presets. I say so in the first paragraph:
" Our goal was to find a program that would encode at a reasonably high quality level (~1GB per hour was the target) and require a minimal level of expertise from the user."
That's why MediaCoder results weren't included.
The entire article came about because Cyberlink's iPhone 4S preset yielded files that were 1.4GB if I used CPU encoding or a GTX 580, and 188MB if I used Quick Sync. That disparity is what I noticed when I went to check encode quality for the initial IVB review.
Can you build custom profiles in CME and create outputs that avoid these problems? You can -- though some options aren't available. That, however, is not the point. If I'm going to build my own custom profiles, I can download a copy of MediaCoder for free and do it with a more powerful piece of software that offers a huge number of options.
I did a review of "Software that claims to automate the GP encode process." I did not do a review of "Can Cyberlink MediaEspresso EVER create a decent image?" Given what I set out to evaluate, my ability to tweak profiles to achieve a satisfactory result is not a valid criteria for my conclusions.
No, the article says that GPU encoding software runs the gamut from outright awful to simply broken and limited. Quick Sync video is great in Arcsoft, terrible in Cyberlink, unsupported in Xilisoft, and looks decent in MediaCoder. Check the GTX 580's output in Xilisoft for plenty of proof that no, you don't need insane bitrates to create decent-looking output.
And remember that this is not necessarily lower quality! There are valid reasons for not following the complexities of IEEE floating point if you have no need for portability.
Ok so i have a Sandy Bridge K processor. What else do i need to make QS work?
Good-bye
I've got a first generation fermi-based GTX 470. Considering that, at the time, the parallel compute power was the big halo selling point of the new fermi gpu, I was very underwhelmed when I finally found some software that would actually use it. I saw speedups of only about 3x or so above and beyond my core 2 duo (only a 2-core!) e8400, and the quality was abysmal in comparison
I'm not saying that GPU transcoding -shouldn't- be a better option than cpu transcoding, it completely should be, but current implementations seem like completely ignored why we're transcoding in the first place and what our goals are. having a faster transcode is nice, yes. faster is always nice, but it's simply not worth the tradeoff in quality, which is the point.
I transcode blu-ray rips to mkv at dual-pass 8000kbps with x.264's "slower" setting on an athlon II 245. the average encode takes anywhere from 36 to 48 hours. But I'm cool with that. do it right that once, and you're set for life. They're beautiful encodes.
Evem with IEEE things aren't that simple: http://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/
Hint: Not all GPUs have IEEE FP compliant math. Often they break the standard, or do something else altogether just to improve performance.
I can't speak for ATI, but actually all FP32 math on Nvidia architectures for many generations now has been IEEE compliant, excluding NAN and -inf +inf and exception handling cases, and except for their hardware sin, cos, log implementations, and except when using the fused multiply add instruction (though the last one you could actually get around by using special compiler intrinsics to avoid the fusing).
Props to Jimmy, he got 37% right in 8.3% of the time, and even more credit since I assume not everyone could get a 100% in an hour, or what would be the point of the test.
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
That while the x264 guys aren't wrong to want to keep working on a software encoder that is tweakable, there is nothing wrong with a fixed function hardware encoder for some tasks. Sometimes, speed is what you want and "good enough" is, well, good enough.
Like at work I edit instructional videos for our website (I work at a university) using Vegas. I use its internal H.264 encoders, which can be accelerated using the GPU. They are quite zippy, I can generally get a realtime or better encode, even when there is a decent amount of shit going on in the video that needs to be processed (remember that Vegas isn't for video conversion, I'm doing editing, effects, that kind of thing).
Now the result is not up to x264 quality, per bit. I could get better quality by mucking around setting up an avisynth frameserver and having x264 do the encoding using some tweaked settings for high quality. However it would be much slower.
Not worth it. I'll just encoder a reasonably high bitrate video. It is getting fed to Youtube anyhow, so there's a limit to how good it is going to look. The faster hardware assisted encode speeds are worth it.
If I was mastering a Blu-ray? Ya I might do the final encode to go off to fabrication with x264 (actually more likely an expensive commercial solution that can generate mastering compliant bitstreams). Spend the extra time to get it as quality as possible because of all the other work and because it could actually be noticable.
There is room for both approaches.
Shill paper? I guess you prefer your papers sponsored by Nvidia, showing a device of no more than 5x memory bandwidth improvement and no more than 5x flops improvements getting 1000x peformance increases? *Those are shill papers*. Today the situation is slightly changed, but not hugely.
i7-3770K makes 112 DP GFLOPS and 25.6GB/s memory bandwidth.
5970 makes 928 DP GFLOPS and 256GB/s memory bandwidth.
So they're both within 10x. It makes 4.64 SP TFLOPS, which is about 20x the SP FLOPS.
Still not going to get 100x performance increases, are you? 1000x? Pfft
Please see Elemental Technologies GPU-accelerated H.264 transcodes.
The math units on every nVidia card made since at least late 2009, both single and double precision, are ieee754 compliant. The only excuse for it being wrong is that someone deliberately used the __fast non-primitive operations (sqrt/log/exp & friends), which compromise the algorithms used to compute transcendental operations. The exact extent of the compromise is detailed in the back of the nVidia CUDA guide.
I agree it would be pathetic if this were because someone passed -ffast-math or whatever it is to nvcc.