Slashdot Mirror


The Wretched State of GPU Transcoding

MrSeb writes "This story began as an investigation into why Cyberlink's Media Espresso software produced video files of wildly varying quality and size depending on which GPU was used for the task. It then expanded into a comparison of several alternate solutions. Our goal was to find a program that would encode at a reasonably high quality level (~1GB per hour was the target) and require a minimal level of expertise from the user. The conclusion, after weeks of work and going blind staring at enlarged images, is that the state of 'consumer' GPU transcoding is still a long, long way from prime time use. In short, it's simply not worth using the GPU to accelerate your video transcodes; it's much better to simply use Handbrake, which uses your CPU."

158 comments

  1. Lack of standards, quality. by Anonymous Coward · · Score: 1

    I've heard from a lot of sources that the quality of output from various GPU accelerated video encoding schemes almost invariably lacks when compared to an established, known good CPU based video encoding scheme. When the GPU encoders can match quality, will they still be fast? Are they just cheating now? What gives?

    1. Re:Lack of standards, quality. by Anonymous Coward · · Score: 2, Informative

      Yes, they are cheating. That is exactly how they are getting it to be so fast.

    2. Re:Lack of standards, quality. by DragonTHC · · Score: 2

      has anyone tried Badaboom?

      --
      They're using their grammar skills there.
    3. Re:Lack of standards, quality. by Anonymous Coward · · Score: 0

      Yep, it's shit.

    4. Re:Lack of standards, quality. by Hatta · · Score: 4, Insightful

      What I don't understand is how this happens. Why would the same calculation get different results on different GPUs? Are they doing the math incorrectly?

      --
      Give me Classic Slashdot or give me death!
    5. Re:Lack of standards, quality. by cheesybagel · · Score: 5, Informative

      Hint: Not all GPUs have IEEE FP compliant math. Often they break the standard, or do something else altogether just to improve performance.

    6. Re:Lack of standards, quality. by Anonymous Coward · · Score: 0

      You mean to tell me that a device with no more than 5x advantage in memory bandwidth or peak FLOPS is *not* actually capable of 100x or 1000x or even 10x speed advantage?? Say it aint so.

      Yes, they cheat. What they do is take a poorly optimized, single threaded program, and run it on the biggest, baddest CPU there is, and get a pretty sad performance number. Then they optimize the shit out of it, vetorize it, parallelize it, and then run it on their GPU to compare with. Sometimes they go so far as to completely change the algorithm used to be a faster or less accurate one.

      http://www.cs.utexas.edu/users/ckkim/papers/isca10_ckkim.pdf

      That's not to say there are no advantages. They do typically have significantly better peak flops and bandwidth numbers, which often can translate into a good advantage in like-for-like comparisons. But it's more like 2x. 5x in favorable cases.

    7. Re:Lack of standards, quality. by TD-Linux · · Score: 3, Interesting

      Because behind the scenes your "encoder" program is actually using several different encoders. Generally the encoder has to be custom written specifically for the specialized GPU hardware it is targeting.

    8. Re:Lack of standards, quality. by PCM2 · · Score: 4, Informative

      has anyone tried Badaboom?

      Not much point. It's been discontinued.

      --
      Breakfast served all day!
    9. Re:Lack of standards, quality. by SplashMyBandit · · Score: 2

      CPUs used to be like that too.

    10. Re:Lack of standards, quality. by grouchomarxist · · Score: 2

      There are probably two issues here, but the kind of calculations we're talking about here are floating-point calculations. And as every programmer should know floating-point calculations done by different CPUs or GPUs don't give you consistent results: http://developers.slashdot.org/story/10/05/02/1427214/what-every-programmer-should-know-about-floating-point-arithmetic

      Also, we're talking about GPUs here. GPUs aren't even designed to give you IEEE standard results. Instead they're designed to give approximate results intended to be used for real time graphics display.

    11. Re:Lack of standards, quality. by rsmith-mac · · Score: 4, Informative

      Because they're not using the same encode paths.

      All 3 hardware encode paths - Intel QuickSync, AMD AVIVO, and NVIDIA's CUDA video encoder - are largely black boxes. Programs such as MediaEspresso are basically passing off a video stream to the device along with some parameters and hoping the encoder doesn't screw things up too badly. Each one is going to be optimized differently, be it for speed, more aggressive deblocking, etc. These are decisions built into the encoder and cannot be completely controlled by the application calling the hardware. And you have further complexities such as the fact that only Intel's QuickSync has a hardware CABAC encoder, while AMD and NV do it in software (and poorly since it doesn't parallelize well).

      Or to put this another way, everyone has their own idea on what the best way is to encode H.264 video and what kind of speed/quality tradeoff is appropriate, and that's why everything looks different.

    12. Re:Lack of standards, quality. by pla · · Score: 4, Informative

      Because behind the scenes your "encoder" program is actually using several different encoders. Generally the encoder has to be custom written specifically for the specialized GPU hardware it is targeting.

      This has largely ceased to present a problem, thanks to OpenCL.

      GPU code no longer needs to run as custom-written shaders targetting 20 different platforms. One program, written in fairly straightforward C, will run on just about any modern platform. And it will do so at speeds that absolutely dwarf a CPU - The Radeon x9yy cards (for x>=5) easily crush a modern CPU at OpenCL code by a factor of a thousand. The x8yy cards still perform admirably, over three hundred to one. For NVidia, the Tesla series do well, while the GX... Well, ten to fifty times faster doesn't exactly suck...


      The real problem here? Most people have really crappy GPUs. Even compared to the $100 card range, your GPU sucks ass, and hard. And you can't really blame people, because honestly, even modern IGPs will run just about anything fairly well, so why would you pay for more?


      But don't blame the GPUs, or the concept in general. If you target OpenCL and the user has a halfway decent modern GPU, it will give consistent, reliable results, and will blow away your CPU many times over.

    13. Re:Lack of standards, quality. by Skarecrow77 · · Score: 5, Insightful

      but, at least in this context, speed is nearly irrelevant because it fails at the task at hand, producing high quality video.

      who cares how fast it completes a task if it's failing? Nobody gives little jimmy props when he finishes the hour-long test in 5 minutes but scores a 37% on it.

    14. Re:Lack of standards, quality. by Anonymous Coward · · Score: 2, Insightful

      Nice shill paper you got there... Of course a paper made by the throughput computing lab and the Intel architecture group (both Intel corp) will advocate there's not much speedup by a GPU when compared to a CPU.

      The big thing to note is that with a GPU, you have to do what you did when working with the original SSE (intel...) instruction set on regular CPU's, FP16 numbers will not have a significant amount of precision, so you must take that into account when programming with that instruction set in mind. It's not as if people haven't been performing calculations with numbers bigger than the bit width of the cpu's instructions. Modern GPU are getting much beefier with double precision math as well

      the 5000 series Radeon's (not examined in the paper) have much better DP performance than the geforce GTX 280 compared. The Radeon 5970 for example has 18x the DP GFLOPS that the i7-960 has, and they both went to market at the same time. For SP data, the 5970 is 46x better than the i7-960.

    15. Re:Lack of standards, quality. by pla · · Score: 3, Informative

      who cares how fast it completes a task if it's failing? Nobody gives little jimmy props when he finishes the hour-long test in 5 minutes but scores a 37% on it.

      I agree that presents something of a problem for current implementations; the concept of GPU transcoding doesn't fail, however. Only the fact that those currently pushing it have tried to show at least modest gains for everbody - meaning those with massively inappropriate hardware - has made it such an abysmal failure to date.

      To repeat my earlier post, if you target an OpenCL-capable GPU, you will get consistent results; and if you target a card with a reasonable number of compute units, (58xx/59xx/68xx/69xx/tesla), you'll see performance far beyond what a modern CPU can give.

      Does that make GPU transcoding the best choice for the general public at present? No! But for those with the hardware, the comparison counts as literally laughable.

    16. Re:Lack of standards, quality. by Darinbob · · Score: 3, Informative

      And remember that this is not necessarily lower quality! There are valid reasons for not following the complexities of IEEE floating point if you have no need for portability.

    17. Re:Lack of standards, quality. by Skarecrow77 · · Score: 2

      I've got a first generation fermi-based GTX 470. Considering that, at the time, the parallel compute power was the big halo selling point of the new fermi gpu, I was very underwhelmed when I finally found some software that would actually use it. I saw speedups of only about 3x or so above and beyond my core 2 duo (only a 2-core!) e8400, and the quality was abysmal in comparison

      I'm not saying that GPU transcoding -shouldn't- be a better option than cpu transcoding, it completely should be, but current implementations seem like completely ignored why we're transcoding in the first place and what our goals are. having a faster transcode is nice, yes. faster is always nice, but it's simply not worth the tradeoff in quality, which is the point.

      I transcode blu-ray rips to mkv at dual-pass 8000kbps with x.264's "slower" setting on an athlon II 245. the average encode takes anywhere from 36 to 48 hours. But I'm cool with that. do it right that once, and you're set for life. They're beautiful encodes.

    18. Re:Lack of standards, quality. by ville · · Score: 2
    19. Re:Lack of standards, quality. by Anonymous Coward · · Score: 0

      Even with the IEEE GP compliant math you have poor quality transcode, so this can't be the issue.

    20. Re:Lack of standards, quality. by parlancex · · Score: 5, Informative

      Hint: Not all GPUs have IEEE FP compliant math. Often they break the standard, or do something else altogether just to improve performance.

      I can't speak for ATI, but actually all FP32 math on Nvidia architectures for many generations now has been IEEE compliant, excluding NAN and -inf +inf and exception handling cases, and except for their hardware sin, cos, log implementations, and except when using the fused multiply add instruction (though the last one you could actually get around by using special compiler intrinsics to avoid the fusing).

    21. Re:Lack of standards, quality. by Surt · · Score: 2

      Props to Jimmy, he got 37% right in 8.3% of the time, and even more credit since I assume not everyone could get a 100% in an hour, or what would be the point of the test.

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    22. Re:Lack of standards, quality. by ldobehardcore · · Score: 1

      I think of it with a different analogy. Instead of little Jimmy and the test, I prefer the BK Lounge metaphor:

      Burger King manages to hand you "food" in 11 Seconds, compared to Shari's (or wherever, insert a middle-of-the-road place here) 20 minutes,
      The "food" is consistently inedible at BK, whereas at Shari's or wherever won't make you sick after bite 2.

      The GPGPU is shitty at video transcoding, but boy howdy it's fast. And that's completely beside the point.
      What good is a "burger" in 11 seconds if you can't keep it down? What good is a 24fps video transcoded at 450fps when the end result is nearly universally unwatchable?

      --
      Hectice, baby, Mercator says hello to you
    23. Re:Lack of standards, quality. by Anonymous Coward · · Score: 2, Interesting

      Shill paper? I guess you prefer your papers sponsored by Nvidia, showing a device of no more than 5x memory bandwidth improvement and no more than 5x flops improvements getting 1000x peformance increases? *Those are shill papers*. Today the situation is slightly changed, but not hugely.

      i7-3770K makes 112 DP GFLOPS and 25.6GB/s memory bandwidth.
      5970 makes 928 DP GFLOPS and 256GB/s memory bandwidth.

      So they're both within 10x. It makes 4.64 SP TFLOPS, which is about 20x the SP FLOPS.

      Still not going to get 100x performance increases, are you? 1000x? Pfft

    24. Re:Lack of standards, quality. by Anonymous Coward · · Score: 1

      Also, the radeon requires about 3x the amount of power to do this as the Intel CPU alone, not including the CPU which will be required to drive the GPU.

      Total cost will also be significantly higher for the GPU.

      They're simply way over-hyped. Anybody who fell for the rash of 100x, 200x, 500x, 1000x claims were simply deluding themselves.

      They struggle to give even a single order of magnitude increase, most of the time.

    25. Re:Lack of standards, quality. by tyrione · · Score: 1

      Thank you. I'm personally getting sick and tired of badly written articles on Parallel Programming discussing CUDA and having to wade through crap before one sharp post discusses OpenCL. OpenCL 1.2 is very robust and we'll be seeing OpenCL 2.0 this August.

    26. Re:Lack of standards, quality. by The+Master+Control+P · · Score: 2

      The math units on every nVidia card made since at least late 2009, both single and double precision, are ieee754 compliant. The only excuse for it being wrong is that someone deliberately used the __fast non-primitive operations (sqrt/log/exp & friends), which compromise the algorithms used to compute transcendental operations. The exact extent of the compromise is detailed in the back of the nVidia CUDA guide.

      I agree it would be pathetic if this were because someone passed -ffast-math or whatever it is to nvcc.

    27. Re:Lack of standards, quality. by The+Master+Control+P · · Score: 1

      Every GPU from nVidia for 3 full hardware generations (Since compute architecture 1.3 - 2009 at least, possibly earlier) has had IEEE754 compliant fp32 and fp64 math. I imagine ATI has been compliant for as long also.

      It is true that code can be compiled using libraries that deliberately compromise the algorithms for transcendental functions to make them faster, but that's 100% the programmer's fault.

    28. Re:Lack of standards, quality. by The+Master+Control+P · · Score: 1

      It's not just about theoretical FLOPS and main memory bandwidth.

      A properly written GPU program is ideally never relying on main memory except to keep a buffer filled - it feeds all its FPUs from shared memory, which can deliver an aggregate bandwidth of TBps to about 750KB (total) of space in a good card.

    29. Re:Lack of standards, quality. by Anonymous Coward · · Score: 0

      Obviously properly written CPU programs don't do through various complex hoops to achieve cache obliviousness.

    30. Re:Lack of standards, quality. by Anonymous Coward · · Score: 1

      No, it largely is about memory bandwidth.

      Caches on the card and caches on CPUs obviously are very important too, but that is hardly where the bottlenecks are on this kind of number crunching job, and the caches tend to be matched pretty well with compute capabilities.

      Actually if it *was* about theoretical FLOPS and memory bandwidth, GPUs would do far better than they do in practice, because far more threads and vectorization required to do work (Amdahl's law), and because memory has to additionally be transferred onto and off GPU local memory.

      In practice, CPUs come far closer to their theoretical limits than GPUs do. As you see in the paper linked, GPU has around 5x advantage over CPU, but averages just 2.5x speedup. A GPU with 10x advantage today, would be lucky to average a 5x speedup on the average HPC code that is suited to GPU use. There are some very special cases where GPU texture units can be used for more computation than the usual total FLOPS number suggests. So in rare cases you might see larger speedup than the mem and flops suggests. But 100x, 1000x? no.

    31. Re:Lack of standards, quality. by TheBogBrushZone · · Score: 1

      I used the first version of Badaboom which was pretty good until I upgraded to an unsupported nVidia GPU and then had to wait several months for a new, almost abandonware version that would cost me again. The focus was definitely on speed rather than video quality. PavTube is a good replacement since it supports Blu-ray as well as DVD though there isn't much of a speed difference apparent between GPU transcoding on a GTX580 and CPU transcoding using a 3.2GHz i7.

      --
      And behold, a command prompt and he who sat upon it, his name was shutdown and -h 3:11 followed with him
    32. Re:Lack of standards, quality. by oneofthose · · Score: 1

      They are doing the match correctly. Floating point operations are commutative: a + b = b + a. But they are not assoziative: (a + b) + c != a + (b + c). This leads to different results on GPUs where algorithms are parallelized in different ways. Let's assume in your algorithm you have to calculate the inner product of a vector v = [v1, v2, v3, ..., vn]. A CPU might calculate rho = v1 + v2 + v3 + ... + vn. GPU1 might calculate in parallel: rho1 = v1 + v2, rho2 = v3 + v4, ... and finally rho = rho1 + rho2 + rhon. Depending on the GPU the way such operations are parallelized differs because the number of floaint point processors differs. So the hardware modifies the formular in a subtile way that leads to varying results. Great references for this are "Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs" and of course "What Every Computer Scientist Should Know About Floating-Point Arithmetic".

    33. Re:Lack of standards, quality. by smash · · Score: 0

      Ahh but OpenCL is apple-developed, and apple is bad at the moment on slashdot.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    34. Re:Lack of standards, quality. by beelsebob · · Score: 1

      Because it's not the same calculation. All they must do is succeed in outputting h264 compliant files, there are many h264 compliant files that look something like the original source video. Different methods of encoding produce results closer or further from the original, at higher or lower file sizes.

    35. Re:Lack of standards, quality. by Anonymous Coward · · Score: 0

      Totally agree with you on terrible quality food and data (except I'm a BK fan, so I've trained my stomach to tolerate it :P).

      I do have a shades-of-gray counterpoint to your point though.

      I've had a few situations where time's of the essence. With poor planning or "surprises" coming up while being on call, being able to make a 90% unwatchable transcoded file in less time than it takes to take a shower and change clothes makes the difference between having something 10% watchable and having nothing at all.

      Its been my experience that, when starving for food or data, "inedible" becomes much less of a concern than having nothing at all.

    36. Re:Lack of standards, quality. by Anonymous Coward · · Score: 0

      I'm sorry to break it to you, but OpenCL is not as simple as you make it out to be. It is specifically *not* a solution to have the same kernel work everywhere... well, yes, your kernel will most likely run a different OpenCL implementation if you don't hit a bug anywhere a long the way. The real kicker is that it just won't perform. You need to be very aware of the specific target hardware to write OpenCL kernels that are fast. So you're still writing a new implementation for each device that you want to support. Don't even dare to think that nVidia and ATI chips like the same kinds of optimizations - they don't.

    37. Re:Lack of standards, quality. by StuartHankins · · Score: 1

      Maybe I'm missing something... but if it's taking you 36 to 48 hours to transcode a single video, and assuming you can justify dedicating a system to that purpose for such an extended time, wouldn't the power savings you'd get by purchasing a much faster system be worthwhile? I'm guessing that the power draw for a process which takes so long to complete is substantial, and that you aren't intending to transcode only a couple of videos.

      One source of comparative CPU benchmarks is here http://www.cpubenchmark.net/cpu_lookup.php?cpu=AMD+Athlon+II+X2+245

    38. Re:Lack of standards, quality. by Anonymous Coward · · Score: 0

      Burger King is not quality food, but won't make you physically sick after bite 2 unless you're already VERY sick.

      Your claim makes you seem foolish and prone to hyperbole.

    39. Re:Lack of standards, quality. by Anonymous Coward · · Score: 0

      "who cares how fast it completes a task if it's failing?"

      I'd guess most people.
      Nobody has time and when you want to encode your dvd so you can watch it on your iphone, quality is the last thing you will be thinking of. Speed would be the first. When will it be done? My train leaves in half an hour. Hurry up stupid software..

    40. Re:Lack of standards, quality. by Anonymous Coward · · Score: 0

      So it's compliant except for the parts where it's not?

    41. Re:Lack of standards, quality. by Skarecrow77 · · Score: 1

      Between my wife and I, we have 7 computers. dedicating one to transcoding isn't a big deal. it's also our file server, so it's on 24/7 anyway.

      I transcode 2-5 videos at a time usually. once every few weeks/months.

    42. Re:Lack of standards, quality. by Anonymous Coward · · Score: 0

      I think sick could be taken as "disgusted, going to puke" sick as if one was biting into a burger with a patty made of pure uncooked feces, as opposed to eating under-cooked ground beef that has salmonella, leading to sickness tomorrow. Apologies in advance for any hyperbole

    43. Re:Lack of standards, quality. by pla · · Score: 1

      I'm sorry to break it to you, but OpenCL is not as simple as you make it out to be.

      Sorry to break this to you, but I write OpenCL, and yes, it really does work that simply.

      Now, I will readily admit that getting every last drop of performance out of a given GPU still requires hand-tuning your code to target the specifics of its architecture; But you could say the same for CPUs, as well.


      Don't even dare to think that nVidia and ATI chips like the same kinds of optimizations - they don't.

      Very true - Yet irrelevant. At 50% efficient (and in practice, you - or at least, I - can do comfortably better than that), whether you run on an HD5870 or a GTX480, you will make the CPU look like a relic from the 1980s by comparison.

  2. And the moral of the Story is... by CajunArson · · Score: 3, Informative

    The GPU isn't meant to do everything. If it were, there wouldn't be a CPU. Considering the hatred that was poured on Quicksync here, and that Quicksync still produces better quality Transcodes than GPUs while being substantially faster, I don't think we'll be seeing the end of CPU transcoding anytime soon.

    --
    AntiFA: An abbreviation for Anti First Amendment.
    1. Re:And the moral of the Story is... by Verunks · · Score: 1

      correct me if I'm wrong but doesn't quicksync use the integrated gpu of sandy/ivy bridge cpus?

    2. Re:And the moral of the Story is... by CajunArson · · Score: 5, Informative

      The quick sync hardware is part of the IGP block but it is specialized hardware specifically geared towards transcoding. For example, it is not using the main GPU pipeline and shader hardware to do the transcoding.

      --
      AntiFA: An abbreviation for Anti First Amendment.
    3. Re:And the moral of the Story is... by Dahamma · · Score: 4, Informative

      Quick Sync uses dedicated HW on the die. Intel's solution that uses their GPU is called Clear Video.

    4. Re:And the moral of the Story is... by Dahamma · · Score: 4, Insightful

      Actually, recent GPUs *were* meant to do exactly this type of thing, and have been marketed by Nvidia and ATI heavily for this purpose. Of course there needs to be a CPU as well. The CPU runs the operating system and application code, and offloads very specific, parallelizable work to the GPU. This sort of architecture has existed almost as long as modern CPUs have existed.

      And Quick Sync is even less of a general purpose CPU solution than using a GPU. Quick Sync uses dedicated application specific hardware on the die to do its encoding.

    5. Re:And the moral of the Story is... by Anonymous Coward · · Score: 0

      What's the point if it still sucks? It seems that while better, it it is still inadequate.
      Sucking faster, is still sucking.

    6. Re:And the moral of the Story is... by PopeRatzo · · Score: 3, Interesting

      The GPU isn't meant to do everything.

      But since "Graphics Processing" is part of their name, wouldn't you expect them to at least do that?

      Especially considering the price of high-end GPUs is getting up there compared to high-end CPUs.

      --
      You are welcome on my lawn.
    7. Re:And the moral of the Story is... by Anonymous Coward · · Score: 1

      Today's "graphics processing units" are essentially designed to render a large number of triangles on a screen in a highly efficient way. If any other graphics operation is thrown at them, they may simply not be designed to execute it well. Just because it has "graphics" in the name it doesn't mean that it can handle every graphics technology perfectly well.

    8. Re:And the moral of the Story is... by billcopc · · Score: 5, Insightful

      Well see, that's the thing. A GPU is better suited to some kinds of massively parallel tasks, like video encoding. After all, you're applying various matrix transforms to an image, with a bunch of funky floating point math to whittle all that transformed data down to its most significant/perceptible bits. GPUs are supposed to be really really good at this sort of thing.

      My hunch is that the problems we're seeing are caused by two big issues:

      1. lack of standardization across GPU processing technologies. CUDA vs OpenCL vs Quicksync, and a bunch of tag-alongs too. Each one was designed around a particular GPU architecture, so porting programs between them is non-trivial.

      2. lack of expertise in GPU programming. Let's be fair here: GPUs are a drastically different architecture than any PC or embedded platform we're used to programming. While I could follow specs and write an MPEG or H.264 encoder in any high-level language in a fairly straight-forward manner, I can't even begin to envision how I would convert that linear code into a massively parallel algorithm running on hundreds of dumbed-down shader processors. It's not at all like a conventional cluster, because shaders have very limited instruction sets, little memory but extremely fast interconnects. We have a hard enough time making CPU encoders scale to 4 or 8 cores, this requires some serious out-of-the-box thinking to pull off.

      Moving to a GPU virtually requires starting over from scratch. This is a set of constraints that are very foreign to the transcoding world, where the accepted trend was to use ever-increasing amounts of cheaply available CPU and memory, with extensively configurable code paths. The potential is there, but it will take time for the hardware, APIs and developer skills to converge. GPU transcoding should be seen as a novelty for now, just like CPU encoding was 15 years ago when ripping a DVD was extremely error-prone and time-consuming. If you want a quick, low quality transcode, the GPU is your friend. If you're expecting broadcast-quality encodes, you're gonna have to wait a few years for this niche to grow and mature.

      --
      -Billco, Fnarg.com
    9. Re:And the moral of the Story is... by BLKMGK · · Score: 2

      Yeah now go look at the heaping scorn that was heaped on the Intel rep when he approached the x.264 guys way late in development. Had they been smart enough to come forward sooner we might have gotten accelerated instructions the x.264 guys would have used - not so now it seems. :-(

      --
      Build it, Drive it, Improve it! Hybridz.org
    10. Re:And the moral of the Story is... by Nemyst · · Score: 3, Interesting

      You mean yesterday's, surely. Rasterizers still are required obviously but GPUs nowadays are very much shader-based and not so much polygon-centric (we're far from T&L). They're built to efficiently process short but otherwise arbitrary floating-point operation sequences in extremely parallel scenarii.

    11. Re:And the moral of the Story is... by Anonymous Coward · · Score: 0

      Your mom sucks faster and it seems to work for her.

    12. Re:And the moral of the Story is... by Bengie · · Score: 1

      It does and it shows Intel's GPU may not be the fastest in all areas, but they're quite well rounded as they are a few factors faster than $300+ GPUs.

    13. Re:And the moral of the Story is... by rsmith-mac · · Score: 4, Informative

      Let's be clear here: the x264 guys will never be happy. QuickSync, AMD's Video Codec Engine, and NVIDIA's NVENC all use fixed function blocks. They trade flexibility for speed; it's how you get a hardware H.264 encoder down to 2mm2. There are no buttons to press or knobs to tweak and there never will be, because most of the stuff the x264 guys want to adjust is permanently laid down in hardware. The kind of flexibility they demand can only be done in software on a CPU.

    14. Re:And the moral of the Story is... by rsmith-mac · · Score: 1

      For example, it is not using the main GPU pipeline and shader hardware to do the transcoding

      No, but it is using it for post-processing such as deinterlacing, noise reduction, etc. The shader pipeline is still involved whenever you need to decode something, be it for QuickSync or for just playing back a video on a PC.*

      *Consequently this is why Intel can't quite match AMD or NV in video playback quality; they lack the shader performance to do as much resource intensive processing

    15. Re:And the moral of the Story is... by Ranguvar · · Score: 2

      Except even when you compare the fixed function H.264 encoders to x264 at those exact settings, x264 still dominates.

    16. Re:And the moral of the Story is... by rsmith-mac · · Score: 2

      That's my point though. Fixed function encoders won't be able to match x264 because of a lack of flexibility. They can't be optimized for specific niches, they need to be generalist in order to be decent at everything since the hardware can't be changed.

    17. Re:And the moral of the Story is... by fuzzyfuzzyfungus · · Score: 4, Insightful

      What strikes me as a bad sign is not so much that the GPU transcoding doesn't necessarily produce massive speed improvements; but that the products tested produce overtly broken output in a fair number of not-particularly-esoteric scenarios.

      Expecting super-zippy magic-optimized speedups on all the architectures tested would be the mark of expecting serious maturity. Expecting a commercially released, GPU-vendor-recommended, encoding package to manage things like "Don't produce an h264 lossy-compressed file substantially larger than the DVD rip source file" and "Please don't convert a 24FPS video to 30FPS for no reason on half the architectures tested" seems much more reasonable.

      I can imagine that the subtle horrors of the probably-makes-the-IEEE-cry differences in floating point implementations, or their ilk, might make producing identical encoded outputs across architectures impossible; but these packages appear to be flunking basic sanity checks, even in the parts of the program that are presumably handled on the CPU(when a substantial portion of iPhone 4S handsets are 16GB devices, letting the 'iPhone 4S' preset return a 22GB file while whistling innocently seems like a bad plan...)

    18. Re:And the moral of the Story is... by nabsltd · · Score: 3, Interesting

      No, but it is using it for post-processing such as deinterlacing, noise reduction, etc.

      I use the GPU to do FFT noise reduction before some encodes, and it's essentially "free" as it's faster than the 8 threads used by x264 for encoding.

    19. Re:And the moral of the Story is... by Dputiger · · Score: 4, Interesting

      Fuzzy,

      You pretty much nailed my problem with the output. :P That's the reason why Arcsoft, with compatibility problems, ultimately ranked above Cyberlink. Arcsoft doesn't do very good work on the Radeon 7950 and it can't handle CUDA, but it at least gets something right. Quick Sync video is very good.

      Cyberlink got nothing right anywhere. And it's the program most-often recommended to reviewers as a benchmark when we want to review GPU encoding.

    20. Re:And the moral of the Story is... by Anonymous Coward · · Score: 0

      Armchair guessing at its best. I would like to subscribe to your newsletter.

    21. Re:And the moral of the Story is... by CityZen · · Score: 1

      No, the moral is that having new hardware is worthless without good software. Just because someone writes some new code that uses the new hardware doesn't make that code any better than the polished code that runs on the old hardware. This applies to much more than just transcoding on PCs.

    22. Re:And the moral of the Story is... by Anonymous Coward · · Score: 0

      Video encoding isn't massively parallel (unless you do it stupidly or in corner cases), so there goes you whole argument.

    23. Re:And the moral of the Story is... by simonloach · · Score: 2

      A GPU is better suited to some kinds of massively parallel tasks, like video encoding. After all, you're applying various matrix transforms to an image, with a bunch of funky floating point math to whittle all that transformed data down to its most significant/perceptible bits. GPUs are supposed to be really really good at this sort of thing.

      And there's your problem.

      An h.264 encoder takes a frame of video and splits it up into 16x16 pixel macroblocks. Each macroblock is heavily dependent on those surrounding it (spatially and temporally). For an intra block, a prediction of the content of the current block is made using the decoded content of the top and left blocks. For inter blocks, a previous frame is used as a reference. The decoder has no idea what the original source file looked like, so any predictions made in the encoding process must be from the decoded frames. This leads to massive data dependencies in the encoder which cause a cascade of blocks that need to be encoded before the current block can be.

      Many, many, people have come onto #x264dev and tried implementing GPU accelerated encoding, some of them with impressive backgrounds. All of them left once they realised how difficult this problem is.

    24. Re:And the moral of the Story is... by Anonymous Coward · · Score: 0

      If you grab an Intel Core Quad (3, almost 4 generations old) you can encode DVD quality at hundreds of frames per second.

    25. Re:And the moral of the Story is... by BLKMGK · · Score: 1

      Actually they seemed to have some ideas for functions that were bound and could be accelerated. However Intel contacted them having apparently already decided what instructions they were going to accelerate and they weren't useful. Additionally, as I recall, shortly after contacting the development team Intel sort of let on that these guys were somehow on board when in fact they were really only just being contacted and weren't. things didn't go really well after that and I couldn't find any more contact on the mailing list between the devs and the guy who claimed to be from Intel after that.

      I do a HUGE amount of x.264 encoding at home. I'd LOVE to be able to accelerate it and in fact did my own testing of the various GPU accelerators awhile back. I didn't have as great an issue with the quality as I did the lack of flexability and control over the encoders and thus gave up on it. In fact some of the GPU encoders would ONLY leverage the GPU and my CPU was nearly as fast! When both were used together the GPU encoders were certainly quicker but with so little flexability I decided against using them . I had high hopes that the new generation i7 stuff with accelerated instructions would help but I guess not :-(

      --
      Build it, Drive it, Improve it! Hybridz.org
  3. Or just use an OpenCL-powered encoder... by carlhaagen · · Score: 4, Interesting

    ...since the results of OpenCL code is static across GPUs rather than being an arbitrary output.

    1. Re:Or just use an OpenCL-powered encoder... by Mia'cova · · Score: 4, Informative

      Only the more modern GPU support it. And of those, there are still different levels of support. Even if it's supported, you would probably get much better perf on an nvidia card by using cuda for example. So in today's world, you can't just use an onpencl-powered encoder, it depends on what hardware you have.

    2. Re:Or just use an OpenCL-powered encoder... by carlhaagen · · Score: 1

      I'm sorry, but you don't seem to have a clue as to what you are talking about. I don't understand why your response was upvoted, as it's highly fallacious - If a person is already sitting on a GPU modern enough to support the direct GPU-powered transcoding solutions offered by the manufacturer, then said GPU is also OpenCL-capable.

      ATI's GPUs have been OpenCL 1.0 (and up) capable for 4 years now. OpenCL 1.0 opened up already on the HD4xxx series released in 2008. Nvidia was a bit behind as they were still dabbling with CUDA - which has never really taken off - and didn't offer OpenCL capability until two years ago, in early 2010.

    3. Re:Or just use an OpenCL-powered encoder... by Anonymous Coward · · Score: 0

      My experience with rendering bitcoin is that openCL speed scales reasonably well with price in the last few generations of ati cards.

  4. The summary of the summary is by BitZtream · · Score: 0

    that Cyberlink's software is pretty damn shitty.

    I've done a little bit of playing around with GPU encoding myself and its not real hard to turn out something faster than your CPU on the GPU with identical quality. Getting varied quality from different cards means you're doing something VERY wrong.

    --
    Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    1. Re:The summary of the summary is by PCM2 · · Score: 1

      Getting varied quality from different cards means you're doing something VERY wrong.

      Maybe it means you're good at programming one GPU but you're not as good at programming the other. Or if another person did the code for the other GPU, maybe the other person doesn't code as well as you do.

      But if all these chips have different instruction sets and APIs, it sounds kinda like saying, "If your program runs slower on iOS than it does on Android, you're doing something very wrong." Maybe. The point is that things were supposed to be getting easier, but apparently they're not.

      --
      Breakfast served all day!
    2. Re:The summary of the summary is by Anonymous Coward · · Score: 0

      You're right about mediaspresso. It's absolute CRAP. I got it free with a powerdvd purchase and used it to try a few encodes for my transformer prime. It's slow and the output is terrible.This was on an 8 core Xeon system with a radeon HD card. I switched to Pavtube and got much better performance and results.

  5. Does anyone have editors anymore? by wbr1 · · Score: 2

    There is a screwed up graph on page two where they use the same graphic twice, and the caption describes aspects of the one that is missing. I really wanted to see the comparison too. You would think in an article of that size and scope someone would be responsible for checking layout as well as copy. It is no wonder we are losing to china. Their English may be worse, but their work ethic and attention to detail is possibly better.

    --
    Silence is a state of mime.
    1. Re:Does anyone have editors anymore? by Dputiger · · Score: 5, Informative

      As the author of the story, that's an error that slipped past in formatting. I'm uploading the proper graph right after I hit "Reply" on this.

    2. Re:Does anyone have editors anymore? by wbr1 · · Score: 1

      Much obliged then sir. You can also delete my comment on your story at your site then! I posted there as well, not knowing you were watching /. Maybe we do have a chance after all (as long as crusty old cynics like me don't depress everyone too much).

      --
      Silence is a state of mime.
    3. Re:Does anyone have editors anymore? by Anonymous Coward · · Score: 0

      There is a screwed up graph on page two where they use the same graphic twice, and the caption describes aspects of the one that is missing. I really wanted to see the comparison too. You would think in an article of that size and scope someone would be responsible for checking layout as well as copy. It is no wonder we are losing to china. Their English may be worse, but their work ethic and attention to detail is possibly better.

      I would have uploaded the correct graph, but I was too busy playing the role of a demanding consumer, decrying the decay of American civilization.

  6. GPUs will be great once we ... by kbrafford · · Score: 1

    I think that the real benefit of GPUs for transcoding will be seen once people start making new as-yet unimagined encoding schemes that are designed to do data parallel tasks that wouldn't even be considered on a traditional CPU.

    1. Re:GPUs will be great once we ... by PopeRatzo · · Score: 1

      I think that the real benefit of GPUs for transcoding will be seen once people start making new as-yet unimagined encoding schemes that are designed to do data parallel tasks that wouldn't even be considered on a traditional CPU.

      Maybe by then, "traditional" CPUs will be different from the ones we have right now.

      --
      You are welcome on my lawn.
    2. Re:GPUs will be great once we ... by Hentes · · Score: 1

      Encoding should be trivial to paralellize, you just cut up a movie to a sequence of n clips and encode them independently.

    3. Re:GPUs will be great once we ... by nabsltd · · Score: 1

      Encoding should be trivial to paralellize, you just cut up a movie to a sequence of n clips and encode them independently.

      Because the structure of modern codecs is based on Groups of Pictures (GOP), you'd have to run two passes on the video, with the first pass determining where the keyframes go. Although this is commonly done by people who don't have a good understanding of video encoding, the more efficient way is to just run a single pass using a constant quality (which is not the same as a constant quantizer). Then, on that single pass, you parallelize the operations on each frame. This also results in less disk thrashing and more hits on cached data.

      From what I have found, though, most of the computation time in an encode is used up by either filters before the encode (grain removal, etc.) or by increasing the range and quality of motion search, which do parallelize well, but I don't think a GPU would help much.

    4. Re:GPUs will be great once we ... by Anonymous Coward · · Score: 0

      Although this is commonly done by people who don't have a good understanding of video encoding, the more efficient way is to just run a single pass using a constant quality (which is not the same as a constant quantizer).

      Single pass has always produced garbage results for me unless you knock the bitrate WAY up, but why even bother recoding, then? 2-pass let's you determine your final filesize (IMO, the reason one recodes) and herd the bandwidth where it's needed.

    5. Re:GPUs will be great once we ... by ldobehardcore · · Score: 1

      What about slice-based parallel processing?
      Correct me if I'm wrong, (I wouldn't be surprised if it turns out I am...) but doesn't x264 have an option to do slice-based parallel processing? As I understand it, if there are 4 running threads, each frame is chopped into four quadrants with a little edge room buffer in each slice, then independently encoded, then glued back together at the other end? That's how I remember that option being described in some forum or other. Not the standard multi-threading, but the slice-based option.

      --
      Hectice, baby, Mercator says hello to you
    6. Re:GPUs will be great once we ... by pthisis · · Score: 1

      2-pass let's you determine your final filesize (IMO, the reason one recodes)

      That's a weird IMO. Certainly 2-pass is the best way to go if you care about exact filesize, but that's not what most people care about. They care about having video that will play back so they can watch and hear it, and the primary reason anyone I know recodes is to convert to a format that they can actually play (generally for ipad/smartphone or ps3 playback).

      --
      rage, rage against the dying of the light
    7. Re:GPUs will be great once we ... by Anonymous Coward · · Score: 0

      Well, yes, that is important also, but if your portable media player is capable of playing it, you're still not going to make a 50G blu ray rip for it. You're going to make it fit on the storage media you're using. Hence, controlling file size, using parameters that conform to the device's limits.

    8. Re:GPUs will be great once we ... by jkflying · · Score: 1

      The primary reason for not doing it this way is that it muddies your cache. Working on one slice at a time means you have many more cache hits.

      --
      Help I am stuck in a signature factory!
    9. Re:GPUs will be great once we ... by Anonymous Coward · · Score: 0

      There's a more concrete benefit: Imagine you're encoding for a certain bandwidth _or_ for a certain final size, while using VBR (as one typically does). 2-pass lets the encoder decide on a bandwidth/time allocation that puts more bits into the more demanding scenes, while still ending up at the targeted average rate.

    10. Re:GPUs will be great once we ... by Anonymous Coward · · Score: 0

      Now to make this efficient on the GPU you just have to find a way to fit the context for those 10000 clips you encode in parallel into the lowest-level cache of the GPU.
      You have a few kB for each frame basically.
      Oh, and you need to layout the memory accesses so that groups of 16 of those 10000 clips read and write mostly the same memory locations.
      Of course splitting the video into 10000 clips also means that depending on the overall length of the video your clips are less than 1 second, so you can't use scene change detection to place the keyframes intelligently or in general use a keyframe interval of more than about 20 frames.
      So you didn't even get beyond the concept stage and already it is clear the encoder would suck.

    11. Re:GPUs will be great once we ... by nabsltd · · Score: 1

      Single pass has always produced garbage results for me unless you knock the bitrate WAY up, but why even bother recoding, then?

      Two-pass x264 with an average bitrate just has the first pass compute the CRF value that will give you the target bitrate. After just a few encodes of similar material, all the rest of your encodes can use one-pass with the CRF set to near the value that was computed for the two-pass encodes.

      One of the things this does is that every movie gets exactly the bitrate it deserves, so easy to compress (less movement, less film grain, etc.) movies get a lot lower bitrate at the same CRF value. This is how I end up with some Blu-Ray rips at 2Mbps, despite using the same "high quality" settings on every encode. Some movies just don't need as much bitrate. The lowest bitates come from clean tradition animation (since you only have 12 pictures per second, and easy to determine movement), but the latest Harry Potter movies are just so freaking dark that they also end up at really low average bitates.

      Now, if you need to fit to absolute file size because of limitations like file system (4GB file size limit for FAT32) or device (CD-R, etc.), then two-pass might be the right way to go, but for something like a phone with relatively large storage compared to the file size once you re-size the source, you're still better off with a single-pass CRF encode, where 800x480/24p will take up between 500MB and 1GB per hour of movie for very good quality. If that's too big for your taste, change the CRF value to lower the quality. Once you find a CRF value that works for you, stick with it for encodes for the same target.

    12. Re:GPUs will be great once we ... by nabsltd · · Score: 1

      What about slice-based parallel processing?

      According to the wiki, you're still better off using normal multithreading even if you are using slices (as you must if you are encoding for Blu-Ray).

    13. Re:GPUs will be great once we ... by kbrafford · · Score: 1

      >Encoding should be trivial to paralellize, you just cut up a movie to
      >a sequence of n clips and encode them independently.

      That works great for multiple CPUs, but would be a disaster for GPU architectures.

  7. 9 Pages??? by Anonymous Coward · · Score: 0

    It sounded like an interesting read. However, I didn't get past the summary. Why would you split it into 9 pages?

    1. Re:9 Pages??? by Dputiger · · Score: 3, Informative

      As the author:

      Because 3000-word articles with PNGs at ~300K per large image and 100K per preview image aren't fun reading in a single go. There's ~1.5MB of imagery just on the third page . Pages 3-8 have about the same, and that's with the images only loaded as thumbnails.

      If you've got a fast net connection, you won't care. If you don't have a fast net connection, loading 16MB of images at once isn't a lot of fun.

      Visual quality comparisons are one area where you can't use low-quality JPGs. A 9-page article at ET is a real rarity, it's not something we do because we want to spam ads.

    2. Re:9 Pages??? by bill_mcgonigle · · Score: 1

      If you've got a fast net connection, you won't care. If you don't have a fast net connection, loading 16MB of images at once isn't a lot of fun.

      Speaking of which, can anybody recommend a software package that cleanly implements that "load images upon scrolling near the active viewport" that I see on some sites? It seems like a nice way to do things.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  8. Single Page Version of Article by Anonymous Coward · · Score: 5, Informative

    Here's a link to the article in 1 page.

  9. Wretched State of Reviews by Anonymous Coward · · Score: 0

    I'm confused as to how a review of transcoding applications that utilize GPUs and is user friendly doesn't include DVDFab??? DVDFab is user-friendly, supports CUDA, DXVA, Intel Quick Sync and Software (CPU) encoding which supports the CoreAVC codec. DVDFab is available for Windows and Mac OSX. Perhpas it wasn't selected because there isn't a Linux version...

    Using CUDA with DVDFab and 2-pass encoding, I get consistently excellent results and my high quality encoding time of a Blu-ray (for backup purposes) is between 90 and 120 minutes. 1-pass encoding is faster. These results have been consistent.

    1. Re:Wretched State of Reviews by Lunix+Nutcase · · Score: 1

      Because it was a review the actual GPU encoders themselves not various frontends to those GPU encoders.

    2. Re:Wretched State of Reviews by swalve · · Score: 1

      The problem is that the processor (GPU in this case) shouldn't make a difference as to the results of the calculations. Sure, a shittier GPU is going to have a shittier picture when forced to run at a certain framerate beyond its capabilities, but when used as a processor for a process that isn't time constricted, it should just take longer. Instead, feeding the same input into one brand of GPU is giving different results when it is run on a different GPU.

    3. Re:Wretched State of Reviews by Dputiger · · Score: 1

      Simple reason: Because DVD Fab never came up. I Googled several variations on the term and asked Nvidia, Intel, and AMD for their own recommendations as far as products were concerned. Cyberlink and Arcsoft were recommended by multiple sources. Badaboom, I knew about and was familiar with. Xilisoft and MediaCoder were added as a result of additional research.

      I never came across DVD Fab. That's not a judgment on its quality or output.

    4. Re:Wretched State of Reviews by FullCircle · · Score: 1

      The review and summary are giving mixed signals then, as I had the same reaction to the article.

      If this is a review of the encoders and not the front ends, then why is Handbrake specifically pointed out for ease of use?

      Handbrake is only a front end to an encoder that can easily give similar or vastly worse results if you don't know how to use it.

      --
      If tyranny and oppression come to this land, it will be in the guise of fighting a foreign enemy. - James Madison
    5. Re:Wretched State of Reviews by Dputiger · · Score: 1

      That's a distinction that the average user doesn't make. At the end of the day, I don't care if the front-end secretly passes the video to a collection of manatees who perform FFT calculations using colored balls they pick out of a pit. The criteria was a piece of software with easy-to-use presets that produces decent-quality video after I push "Ok."

      If Program X does that, and Program Y doesn't, then Program X wins. The reason *why* is interesting and pertinent, but the question wasn't "Why do two different front-ends give different results using the same encoder?"

    6. Re:Wretched State of Reviews by FullCircle · · Score: 1

      Mine was a reply to "Because it was a review the actual GPU encoders themselves not various frontends to those GPU encoders." which you and I both seem to disagree with.

      That said, I do believe that there are better GPU assisted applications than those tested, such as DVDFab mentioned above.

      I'd be very interested to see how it compares using this methodology, but testing every available application could become a full time job.

      I have no affiliation with DVDFab, but it comes to mind as a decent encoder well before any of the ones tested.

      --
      If tyranny and oppression come to this land, it will be in the guise of fighting a foreign enemy. - James Madison
    7. Re:Wretched State of Reviews by jkflying · · Score: 1

      Just FYI, I had lots of problems on DVDFab using Intel-based GPU acceleration, such as temporal misalignment frames and lots of juddering; the video seemed to speed up and slow down around 2-3x per second. I ended up leaving DVDFab entirely and switching to Handbrake to take advantage of the queue features. DVDFab has some nice features for breaking the encryption on DVDs though, so may be worth keeping around for that now that Handbrake has removed support for it.

      --
      Help I am stuck in a signature factory!
  10. The real problem... by Sulik · · Score: 2

    The real problem is a lack of a common API for encoding regardless of GPU/CPU, which leads to vendor-specific implementations with varying degrees of quality. The most efficient way to pretty much do anything is a dedicated HW block (from both perf and power point of view), so there is no question that there is value in encoding using dedicated hardware, but the software has to catch up.

    --
    Help! I am a self-aware entity trapped in an abstract function!
  11. Explain to me how its the GPU's fault by catmistake · · Score: 1

    that encoders inexplicably insist on codex and wrappers that predate the Millenium? The problem with transcoding is that it exists at all. Strongarm the holdout encoders into using h264 or mp4v with mp4 wrappers, and transcoding will be like... well, like anything no one does anymore.

    1. Re:Explain to me how its the GPU's fault by nabsltd · · Score: 1

      The problem with transcoding is that it exists at all. Strongarm the holdout encoders into using h264 or mp4v with mp4 wrappers, and transcoding will be like... well, like anything no one does anymore.

      There will always be transcoding, since you can't fit the 20GB H.264 stream from a Blu-Ray on a phone. And, why would you want to? Resize the 1920x1080 to 800x480 or so, and it will look great on every phone.

      For tablets or other devices with more resolution, you still don't need all the bits that most Blu-Ray encodes use. Most are essentially constant bit rate around 25-30Mbps. For movies that are essentially "talking heads" (courtroom dramas like A Few Good Men and Presumed Innocent are the best examples), most of those bits aren't actually adding anything to the picture quality. Even action movies can easily get by with 10Mbps average on a full-resolution transcode, as long as the action scenes get enough bits.

    2. Re:Explain to me how its the GPU's fault by catmistake · · Score: 1

      There will always be transcoding, since you can't fit the 20GB H.264 stream from a Blu-Ray on a phone.

      You are thinking about this all wrong. You think you own that movie... but you don't, you own a license. That license entitles you to transcode... if you want to go ahead and do work that, chances are, has already been done, and is constantly being done for you by others that create far better quality transcodes. The obtuse talk about how great their hardware is, and how fast they can rip their movies... but the astute keep all their movies backed up on the Internet in every format and resolution imaginable.

    3. Re:Explain to me how its the GPU's fault by jedidiah · · Score: 1

      > You are thinking about this all wrong. You think you own that movie... but you don't, you own a license.

      Your attempt to spread that pro-corporate propaganda simply won't work here. We know better.

      --
      A Pirate and a Puritan look the same on a balance sheet.
    4. Re:Explain to me how its the GPU's fault by nabsltd · · Score: 1

      if you want to go ahead and do work that, chances are, has already been done, and is constantly being done for you by others that create far better quality transcodes.

      In general, there are two kinds of rips available on torrents: too large so that quality stays high, and really small to fit on phones or similar devices. It's very hard to find something in-between, where you the encode is close to transparent, but as small as possible. Most rips still use two-pass average bitrate mode, which is basically "find the CRF value that gives me this bitrate", which is the wrong way to maintain quality. The right way is to pick a CRF value and let x264 figure out the bitrate needed for that quality.

      Although recent rips are getting the word on how to use x264 correctly, there's an awful lot out there that don't use the right "tune", don't set the correct colorspace, and generally tinker too much with settings that don't matter, while ignoring the ones that do. The forums at doom(9|10).org are a great wealth of information, much of it direct from the x264 developers.

      In addition, things like the green tint on "Fellowship of the Ring" can be removed if you do the transcode yourself. And, let's not talk about missing audio or subtitle tracks, glitches in the movie, or the fact that I own at least 10 movies on DVD and Blu-Ray that just aren't available anywhere on the Internet.

    5. Re:Explain to me how its the GPU's fault by Anonymous Coward · · Score: 0

      In general, there are two kinds of rips available on torrents: too large so that quality stays high, and really small to fit on phones or similar devices.

      This guy.

      He's a third kind... Insane 720p quality, tiny file sizes... fits on phones. I've been transcoding for at least 10 years (and everything but ffmpeg sucks). For the life of me, I can't figure out how YIFY does it... maybe if I had a huge 80" television I'd see a difference, but every time I compare the 4gb and 10gb rips to YIFY's 500MB rips (on 32-42" screens), I can't tell teh difference. (please note: If you need or want the dickpull that surround sound is, then don't bother... YIFY is always in stereo.) Enjoy!

    6. Re:Explain to me how its the GPU's fault by nabsltd · · Score: 1

      I've been transcoding for at least 10 years (and everything but ffmpeg sucks).

      For H.264 encoding, ffmpeg just spawns x264, but you don't have the ability to control all x264 options through ffmpeg, so using x264 directly is generally better.

      For the life of me, I can't figure out how YIFY does it...

      Lots of pre-filtering to remove hard-to-encode detail. Even a relatively gentle temporal smoother can cut bitrates by half on very grainy material, and not affect quality much unless you freeze-frame.

  12. Intel QuickSync is the true winner by TPoise · · Score: 1

    So basically the article says GPU rendering is bad, but QuickSync is good enough for prime time.

    Duh. QS is made to do a very specific task (encoding/decoding video) and it can do it super fast at decent quality rates. There's always the tradeoff of quality vs. encoding time. With QS, I can rip an entire 50GB Blu-Ray in 12 minutes to a 1080p MKV @ 8000kbps. It takes about 16 hours doing the same task with a normal x264 encoder such as Handbrake even though the quality is a little bit better. Is it worth waiting around 16 hours for me? Nope.

    With enough bitrate, anything looks good. The key is to just bump up the bitrate in MediaCoder when using QuickSync for encoding to something very high.

    1. Re:Intel QuickSync is the true winner by nabsltd · · Score: 1

      With QS, I can rip an entire 50GB Blu-Ray in 12 minutes to a 1080p MKV @ 8000kbps. It takes about 16 hours doing the same task with a normal x264 encoder such as Handbrake even though the quality is a little bit better.

      Even using the "slower" preset on x264, 1080p encodes take about 3 times as long as the movie, so no more than 8 hours. This is on a slower CPU (since you have QuickSync) than you use, and end up at about 4Mbps

      If I used a less-intensive preset, I would get encodes at about the same bitrate as yours, but taking just a little more than the running time of the movie to do it. QuickSync may be even faster, but 3 hours to encode most movies is good enough.

      With enough bitrate, anything looks good.

      In general, this is true, but very poor encoders can still screw up a high bitrate encode. That's why I use x264...it's going to give me the absolute best quality and lowest bitrate for the amount of encoding time. Since I only encode my movies one time, taking 6-8 hours to do it and getting bitrates as low as 2Mbps for 1080p with high picture quality is worth the extra time. It's also nice to be able to use the full power of AVISynth during the encode. My Blu-Ray rip of "A New Hope" doesn't have the useless "Jabba in the hangar" scene, and I'm working on getting Han shooting first.

    2. Re:Intel QuickSync is the true winner by Dputiger · · Score: 2, Informative

      No, the article says that GPU encoding software runs the gamut from outright awful to simply broken and limited. Quick Sync video is great in Arcsoft, terrible in Cyberlink, unsupported in Xilisoft, and looks decent in MediaCoder. Check the GTX 580's output in Xilisoft for plenty of proof that no, you don't need insane bitrates to create decent-looking output.

    3. Re:Intel QuickSync is the true winner by spire3661 · · Score: 2

      Ok so i have a Sandy Bridge K processor. What else do i need to make QS work?

      --
      Good-bye
    4. Re:Intel QuickSync is the true winner by Anonymous Coward · · Score: 0

      Unfortunately the GPU in Sandy Bridge is disabled if you have a discrete video card installed. The only way to use it is to install Lucid Virtu. With it you can use both the Sandy Bridge and the discrete card. The only downside is if your motherboard is not officially supported you can only use the software for only 30 days. Official site: http://www.lucidlogix.com/product-virtu-gpu.html

  13. Hardware transcoding, not GPU by mapuche · · Score: 1

    That's why video professionals and tv stations rely on hardware based transcoding, and this solutions tend to be expensive. There should be many systems than encode H264 videos really fast, something like this: http://www.blackmagic-design.com/products/teranex/
     

    1. Re:Hardware transcoding, not GPU by nabsltd · · Score: 3, Interesting

      That's why video professionals and tv stations rely on hardware based transcoding, and this solutions tend to be expensive.

      x264 can encode 1080p in realtime on a modern Intel CPUs (Sandy Bridge, etc.) with pretty much as good a quality for the same bitrate as most hardware solutions. For non-HD, x264 just smokes hardware, as it can do better than realtime encodes at very high quality on those same CPUs.

    2. Re:Hardware transcoding, not GPU by jedidiah · · Score: 1

      Fascinating. The fact that hardware based transcoding is a disaster is why "professionals" use hardware based transcoding?

      That simply makes no sense.

      --
      A Pirate and a Puritan look the same on a balance sheet.
    3. Re:Hardware transcoding, not GPU by mapuche · · Score: 1

      So you think doing transcoding using a GPU preceded dedicated encoding systems? or what are you talking about?

    4. Re:Hardware transcoding, not GPU by gl4ss · · Score: 1

      that it's just all hw.

      the point of the article however is that that the state of hw encoding for consumers using their already bought equipment is a mess.

      kinda like video decoding using hw was a mess too for so many years and still is on some levels. (vlc being the equivalent of handbrake there).

      a company providing shit in both cases? cyberlink - the pinnacle of how to be the number one shipping sw solution for something and being in the bottom 5% of actual use!

      --
      world was created 5 seconds before this post as it is.
    5. Re:Hardware transcoding, not GPU by Anonymous Coward · · Score: 1

      My hardware solution is a 100Mbit broadband connection, and it handily beats any transcoder I've tried.

    6. Re:Hardware transcoding, not GPU by EdZ · · Score: 1

      The hardware based transcoding is not necessarily better (see: the dire state of BBC's terrestrial HD broadcasts compared to the earlier 'test' HD broadcasts, and their stonewalling whenever people call them out on it and explain how to improve it). Hardware transcoders are used
      1) Because they're guaranteed real-time, so you can pipe video through and just factor in a set time delay
      2) Designed to be robust, so you don't need to worry about overheating, or the encoder choking on a certain bit of video

      If you're not working on real-time video, you're always going to get better results with x.264 (and competence) than with a hardware box, or a hardware solution like QuickSync or the like. Hell, if you turn the quality settings down so the x.264 output looks as bad as QuickSync, the two work at roughly the same speed anyway.

  14. Agreed- it's not meant to do everything by Anonymous Coward · · Score: 0

    after it all, it is a GRAPHICS processing unit, an it's designed for a very specific sub section of computing.. known as GRAPHICS PROCESSING.

    if I wanted some jack of all trade type computation- I'd use something a little more common like a CPU..

    Transcoding- hey! that's GRAPHICS PROCESSING isn't it? gosh- I hope my GPU can do me some of that!

  15. Re:Welp by gnasher719 · · Score: 3, Interesting

    Pity the Handbrake devs are dickwads.

    1. It's not funny.

    2. They make an excellent bit of software that I have been using for free for years. Unless you helped them out you can't complain.

    3. The guys creating Handbrake and the guys making video encoders are not the same people, so your rant is misdirected.

    4. I mailed them two suggestions for improvements, and both got implemented. Now this may be because my suggestions were the kind of things that were (a) genuine improvements and (b) interesting for the developer and therefore would have been implemented anyway, but in my experience they are responsive to the right kind of suggestions.

  16. How about DVDFab? by artor3 · · Score: 1

    I use DVDFab to rip DVDs using my GPU, and it positively flies. Most 2 hr movies take around 10 minutes to convert to H.264. It doesn't support VBR, but outside of that I've never had trouble with it. The resulting video quality is quite good as well (except with files that need deinterlacing, but that's always a problem). I think the person who wrote the articles just didn't try the right programs.

    1. Re:How about DVDFab? by PhrostyMcByte · · Score: 1

      Have you tried x264 with --preset veryfast? My experience is that x264 is able to match a GPU encoder's speed while still giving significantly better quality. I'd only bother with a GPU encoder if I had a terrible CPU (netbook, phone?).

  17. Re:Incompetent Author? by Dputiger · · Score: 5, Informative

    I set out to test presets. Specifically, I set out to test the presets of software packages which are sold on the purported *strength* of those presets. I say so in the first paragraph:

    " Our goal was to find a program that would encode at a reasonably high quality level (~1GB per hour was the target) and require a minimal level of expertise from the user."

    That's why MediaCoder results weren't included.

    The entire article came about because Cyberlink's iPhone 4S preset yielded files that were 1.4GB if I used CPU encoding or a GTX 580, and 188MB if I used Quick Sync. That disparity is what I noticed when I went to check encode quality for the initial IVB review.

    Can you build custom profiles in CME and create outputs that avoid these problems? You can -- though some options aren't available. That, however, is not the point. If I'm going to build my own custom profiles, I can download a copy of MediaCoder for free and do it with a more powerful piece of software that offers a huge number of options.

    I did a review of "Software that claims to automate the GP encode process." I did not do a review of "Can Cyberlink MediaEspresso EVER create a decent image?" Given what I set out to evaluate, my ability to tweak profiles to achieve a satisfactory result is not a valid criteria for my conclusions.

  18. Also let's be clear by Sycraft-fu · · Score: 4, Informative

    That while the x264 guys aren't wrong to want to keep working on a software encoder that is tweakable, there is nothing wrong with a fixed function hardware encoder for some tasks. Sometimes, speed is what you want and "good enough" is, well, good enough.

    Like at work I edit instructional videos for our website (I work at a university) using Vegas. I use its internal H.264 encoders, which can be accelerated using the GPU. They are quite zippy, I can generally get a realtime or better encode, even when there is a decent amount of shit going on in the video that needs to be processed (remember that Vegas isn't for video conversion, I'm doing editing, effects, that kind of thing).

    Now the result is not up to x264 quality, per bit. I could get better quality by mucking around setting up an avisynth frameserver and having x264 do the encoding using some tweaked settings for high quality. However it would be much slower.

    Not worth it. I'll just encoder a reasonably high bitrate video. It is getting fed to Youtube anyhow, so there's a limit to how good it is going to look. The faster hardware assisted encode speeds are worth it.

    If I was mastering a Blu-ray? Ya I might do the final encode to go off to fabrication with x264 (actually more likely an expensive commercial solution that can generate mastering compliant bitstreams). Spend the extra time to get it as quality as possible because of all the other work and because it could actually be noticable.

    There is room for both approaches.

    1. Re:Also let's be clear by Anonymous Coward · · Score: 0

      Like at work I edit instructional videos for our website (I work at a university) using Vegas. I use its internal H.264 encoders, which can be accelerated using the GPU. They are quite zippy, I can generally get a realtime or better encode, even when there is a decent amount of shit going on in the video that needs to be processed (remember that Vegas isn't for video conversion, I'm doing editing, effects, that kind of thing).

      Now the result is not up to x264 quality, per bit. I could get better quality by mucking around setting up an avisynth frameserver and having x264 do the encoding using some tweaked settings for high quality. However it would be much slower.

      Are you sure about x264 being much slower? You should try running it with the superfast or maybe ultrafast presets and see how fast it goes. Depending on your CPU/GPU you might get better performance and quality.

      If I was mastering a Blu-ray? Ya I might do the final encode to go off to fabrication with x264 (actually more likely an expensive commercial solution that can generate mastering compliant bitstreams). Spend the extra time to get it as quality as possible because of all the other work and because it could actually be noticable.

      x264 is actually able to generate Blu-ray compliant bitstreams and still beats the commercial encoders quality-wise.

    2. Re:Also let's be clear by Ranguvar · · Score: 1

      But that's exactly what we're saying -- x264 can match the speed of GPU encoders or QuickSync, and still provide better quality.
      Or, it can provide the same quality, faster.

      Sure, GPU/QuickSync encoders will have a niche once they can be faster than x264. But they're still not. They have no niche at the moment.
      I'm not a fanboy of x264, it's simple fact.

    3. Re:Also let's be clear by Sycraft-fu · · Score: 1

      I take issue with your fact. I say you have not tested it against Sony's encoder built in to Vegas (which isn't QuickSync, it is a hybrid CPU, CUDA, and OpenCL encoder). If you want me to believe that x264 is faster and better you need to show me a test that shows:

      1) Settings with x264 that produces a render rate equal to or greater than the rate that the Sony AVC encoder does.

      2) That the quality is superior, in picture and video comparisons.

      3) That the resulting file size is the rough same (within 10% say).

      4) That the difference is enough to make it worth my while to fuck with an external render chain through avisynth and not just use the built in workflow.

      In particular I'm interested in 1920x1080@30p output at about 20mbps. That's what I use for Youtube uploads.

      I'm not interested in hypotheticals or tests of things you like. I'm interested in when I hit "render" how long it takes to get a result.

      I think maybe you are confusing consumer GPU transcoding where someone is fiddling with reducing the video size for a mobile or something, with actually output from an NLE.

  19. Please see real transcoders by TheSync · · Score: 2

    Please see Elemental Technologies GPU-accelerated H.264 transcodes.

    1. Re:Please see real transcoders by MartinSchou · · Score: 1

      Considering the article ruled out some software, because it was considered too difficult to use, I suspect Elemental Technologies' software would be ruled out due to cost.

  20. Handbrake ? by Anonymous Coward · · Score: 0

    Am I the only one to find this software the most unintuitive tool ever created ?

    1. Re:Handbrake ? by chrispitude · · Score: 1

      I found it pretty terrible too. After uninstalling it, I came across Avidemux which is much easier (for me) to use. I've been using that since.

      http://fixounet.free.fr/avidemux/

  21. Re:Agreed- it's not meant to do everything by Anonymous Coward · · Score: 0

    Except that you want it to do compression. As those encoders prove, it does the graphics just fine, you just don't get any compression out of it. And it is not called a "compression" processing unit.
    Compression is about hte worst thing you can do on a GPU since for really good results it ends up with massive data dependencies and is very difficult to parallelize.
    The only thing worse is decompression (well, at least the lossless part), which is why it is handled by special purpose hardware and not at all the graphics part of the GPU.

  22. What do you expect? by Anonymous Coward · · Score: 0

    When you have a bunch of morons who cannot make a consistent naming scheme for their graphics cards, do you really think the qualities will be good either?
    I'd be surprised if the damn things even had the same transistor count (same version) with a design that makes most of the processor parallel that isn't required to be done sequentially for the sake of saving money on what would normally be low yields.

    I seriously had to use features to find my card in a list 3 days ago because there was 4 cards with the same codenames and titles. Fuck ATI.

  23. Re: by kurkosdr · · Score: 1

    Use OpenCL and not the H.264-specific APIs the vendor provides? Yes, GPU vendors cheat, I 've seen pictures. Now, how about x264 supporting OpenCL?

  24. Poor software, not poor GPUs by Arakageeta · · Score: 1

    Surprise, surprise, I have the feeling that most of you haven't actually read the article. The article is not arguing that GPUs are inherently flawed. Also, the article is not an NVIDIA-vs-AMD competition. Rather, the author tests software on each platform. It's the software that is bad, not the GPUs themselves. For instance, the NVIDIA GPU does quite well with Arcsoft and Xilisoft; this wouldn't be possible if GPUs were somehow broken for transcoding. After all, as others have pointed out here, floating point support is actually quite good on modern GPUs.

    Still, poor software shouldn't come too much as a surprise. While CUDA and OpenCL certainly make GPU-based computing easier, it is still a relatively new technology that only a few programmers know how to use efficiently. I'm also not sure that the market pressure is there yet from consumers for efficient GPU-based applications (how many of them actually know what a GPU is?).

  25. Whoa there. You're plainly wrong. by Arakageeta · · Score: 1

    CUDA was released, supported by NVIDIA GPUs, in early 2007. The first OpenCL specification was not released until late 2008 (OpenCL has not been around for 4 years, as you claim). As for which is more popular, I'm afraid that you have this backwards too. The dominant market force for GPU computing is supercomputing. How many of the top 5 supercomputers used AMD GPUs? Zero. How many use NVIDIA GPUs? Three. And they're all using CUDA because it's more feature rich---it can do fancy things like direct memory copies between infiniband interconnects and GPU memory.

    FYI: OpenCL on NVIDIA is implemented on top of CUDA, so you're still using CUDA if you're using OpenCL on NVIDIA.

  26. Why not an FPGA or CPLD? by Anonymous Coward · · Score: 0

    Why not use gate arrays and logic devices?

    http://www.altera.com/literature/wp/wp-brdcst0306.pdf
    http://www.xilinx.com/support/documentation/topicaudiovideoimageprocess.htm

    1. Re:Why not an FPGA or CPLD? by Dputiger · · Score: 1

      Sure! Send me one and I'll test it. :)

  27. DVDFab and NVidia GTX 560 - Finally! by Petersko · · Score: 1

    After waiting and trying and waiting and trying and waiting and trying... finally conversion to 6GB mkv with full DTS works reliably. I converted my library of 600+ blu rays over the last few weeks.

    Using the GPU I get about 70fps, and I've watched about 15 of the movies without noticing any problems at all.

    I flat out gave up with trying to support my fricking PS3.

  28. Yeesh - not 600! by Petersko · · Score: 1

    I do not own 600 blu rays. That was supposed to be 200+.

  29. assume 200W power consumption by Chirs · · Score: 1

    Around here running that 24x7 would cost ~ $200. You'd need to run it for several years to pay for the cost of a new system.

  30. Re:Whoa there. You're plainly wrong. by carlhaagen · · Score: 1

    I'm not sure which year you are discussing in, but the situation of the article refers to how the available options stand TODAY, not as they stood in 2008 or 2007, when not even direct GPU transcoding was available in a functional form. If you have a 4 years old HD4xxx series GPU, you can run OpenCL 1.0/1.1 software on it. Period. I don't see the point of you mentioning super computer clusters running CUDA in this discussion. Are these clusters available to us for transcoding video on our GPUs? Not likely. Take a look at how much OpenCL software is available compared to how much CUDA software is available, and you will see which "camp" is the popular one. Hint: it's not CUDA.

  31. Really depends by LostMyBeaver · · Score: 1

    CABAC doesn't scale well in massively threaded environments that is true. However there are ways to avoid the issues involved and this really isn't the issue either. It's not the CABAC so much as the bit stream writing for the most part. CABAC scales fine if you parallelize it across slices. Of course no modern encoders make use of multiple slices per field/frame, so it's more of an issue of whether latency is an issue. You can run parallel CABAC encoders by buffering frames.

    The real problem especially when dealing with a NVidia vs. ATI issue is that while floating point performance on these two GPUs rock, the NVidia chips have piss poor support for shift/rotate etc... bit level operations on internal registers which makes reading and writing bit streams utterly painful at best. The CABAC code obviously takes a pretty severe hit from this. A solution to this problem is a single shared table across parallel threads for all 8 bit position states. Though, this will likely still suck since there will be huge numbers of mutexes on the table for the lookup and the table is just too large to duplicate for each core. But on the NVidia, binary manipulation operations seriously are lacking where ATI has had those sorted out for a while. This is also why doing hash brute force cracking on an NVidia appears much slower than on a ATI.

    I personally use NVidia for games and ATI for computing.

  32. Re:Welp by Anonymous Coward · · Score: 0

    Funny/relevant or not, the AC's complaint is largely accurate, in my opinion. For a support forum, the Handbrake forums are an incredibly hostile environment, with the devs often being the worst offenders. Yes, they've made a great piece of software, but I don't see why that excuses their rude behavior. I can't imagine the devs interact with people in the real world like that, and I don't see why they should interact with people like that on the Internet.