Which Open Source Video Apps Use SMP Effectively?
ydrol writes "After building my new Core 2 Quad Q6600 PC, I was ready to unleash video conversion activity the likes of which I had not seen before. However, I was disappointed to discover that a lot of the conversion tools either don't use SMP at all, or don't balance the workload evenly across processors, or require ugly hacks to use SMP (e.g. invoking distributed encoding options). I get the impression that open source projects are a bit slow on the uptake here? Which open source video conversion apps take full native advantage of SMP? (And before you ask, no, I don't want to pick up the code and add SMP support myself, thanks.)"
Use the -threads switch.
Interested in open source engine management for your Subaru?
transocde uses separate processes for everything.
My blog
x264 use slices and scales pretty well across multiple cores. I use it on windows via megui, but you could easily use it in Linux as well. You could use mencoder to pipe out raw video to a fifo and use x264 to do the actual conversion, for instance.
...makes excellent use of multiple cores. It is however Mac-only. Interestingly, what it does is split a file into chunks and spawns multiple ffmpeg processes to do the conversion. Which is to say, perhaps you can do some (relatively simple) scripting with ffmpeg that will do the job.
The secret to creativity is knowing how to hide your sources. - Albert Einstein
x264 and avisynth can make pretty decent use of threads. check out meGUI.
x264 via meGUI from Doom9 is what I use to compress HD-DVD and BD movies - also on a quad core. I have some tutorials posted out and about on how I'm doing it. Near as I can tell you cannot dupe the process on Linux due to the crypto - Slysoft's AnyDVD-HD is needed.
Playback - I use XBMC for Linux. It is also SMP enabled using the ffmpeg cabac patch. the developers of this project have been VERY aggressive at taking cutting edge improvements to the likes of ffmpeg and incorporating them into the code. Since Linux has no video acceleration of H.264 SMP really helps on high bitrate video!
Build it, Drive it, Improve it! Hybridz.org
I'm still not sure where this idea that "multi-threaded programming is hard" comes from. It's not. It seems that most people are just afraid of it because they're not familiar with it.
Or perhaps I just overestimate the mental capacity of most programmers? Having looked at a lot of code, there may be merit to that theory.
Cyrano de Maniac
don't balance the workload evenly across processors
Why is balancing the load evenly important, as long as one thread is not bottlenecking the others? Loading a particular core or set of cores might even be beneficial depending on the cache implementation, especially when other applications are also contending for CPU time.
Sure, a nice even load distribution might be an indicator for good design, but it doesn't have to apply in every case. I don't think software should be designed so you can be pleased with the aesthetics of the charts in task manager.
Since his suggestion was to do some scripting that does essentially what VisualHub does using ffmpeg I'm not sure I see how he missed the Open Source requirement.
Handbrake has always used both of the cores on my system for transcoding.
OP is asking for open source tools. You cited a commercial one that doesn't provide source.
VisualHub (the front-end app) may be closed, but ffmpeg is LGPL.
And the GP was suggesting using ffmpeg, not VisualHub.
How can I believe you when you tell me what I don't want to hear?
The problem with MPEG encoding and decoding is that the data itself is not well suited to multi-threaded analysis.
Multi-threading is most efficient when it is applied to discrete data sets that have little or no dependency on each other.
For example, suppose I have a table with four columns -- three holding input values (A, B, and C) and one holding an output value (X). If the data in a given row of the table has nothing to do with the data in any other row, multi-threading works efficiently, because none of the threads are waiting for data from any of the other threads. If I want to process multiple rows at once, I simply spawn additional threads.
On the other hand, for data such as MPEG video, the composition of the next frame is equal to the composition of the current frame, plus some delta transformation - the changed pixels.
This introduces a dependency which precludes efficient multi-threaded processing, because each succeeding frame depends on the output of the calculations used to generate the prior frame. Even if more than one core is dedicated to processing the video stream, one core would wind up waiting on another, because the output from the first core would be used as the input to the second.
Running multiple instances of the same code concurrently in multiple threads is simple. Even running mutually exclusive parts of the same code concurrently in separate threads is easy. Converting complex serial algorithms to effectively utilize multiple cores is generally not simple. And writing code that can scale and balance across n number of cores/threads is extremely hard. There are all sorts of synchronization issues to deal with, scheduling issues, data transport issues, etc.. and it becomes increasingly hard to debug code the more cores/threads you throw in. I think the stigma is justified.
Huh? I am using AGK and my CPU never does anything. It is always waiting for I/O. I must be doing something wrong...
Amen
If you truly understand the problem domain you are operating in, parallelism becomes readily apparent. Implementing it isn't difficult even on old code, again, if you truly understand where the parallelism exists.
And told him how it uses an open source program in an easily-replicatable way.
Actually, the MPEG stream resets itself every n frames or so (n is often a number like 8, but can vary depending on the video content). These are called keyframes (K) and the delta frames (called P and I frames) are generated against them. Because of this, it is really easy to apply parallel processing to video encoding.
Is there anything out there that can play a high-bitrate obese .mkv Blueray backup rip efficiently on 2 or 4 cores?
The mpeg algorithm is called DCT Cosine. If this is parallaizable, then mpeg encoding/decoding should be, although there is no way a general processor can beat an asic in silicon.
You don't say if you're running on Windows or Linux or something else. If you are running on Windows, the latest versions of VirtualDub have made big improvements to SMT/SMP encoding.
VirtualDub home
VirtualDub 1.8.1 announcement
VirtualDub downloads
Make sure you grab 1.8.3 - 1.8.1 was pretty good, but had a few teething problems. 1.8.2 has a major regression which is fixed in 1.8.3. The comments in the 1.8.1 announcement contain a few important tips for using the new features (some of which I posted BTW).
The two major new features that would be of interest to you are:
1. You can run all VirtualDub processing in one thread, and the codec in another. This works very well in conjunction with a multi-threaded codec - this one change improved my CPU utilitisation from approx 75% to 95% on my dual-core machines - with an equivalent increase in encoding performance.
2. VD now has simple support for distributed encoding. You can use a shared queue across either multiple instances of VD on a single machine, or across multiple machines (must use UNC paths for multiple machines). Each instance of VD will pick the next job in the queue when it finishes its current job. Instances can be started in slave mode (in which case they will automatically start processing the queue).
I use 3 machines for encoding (all dual-core). With VD 1.8.x I start VD on two of the machines in slave mode, and one in master mode. I add jobs to the queue on the master instance, and the other two instances immediately pick up the new jobs and start encoding. When I've added all the jobs, I then start the master instance working on the job queue.
To achieve a similar effect on your quad-code, start two instances of VD on the same machine - one slave, the other master.
It's not perfect (if you've only got one job, you won't use your maximum capacity) but it has greatly simplified my transcoding tasks, and reduced the time to transcode large numbers of files.
I've noticed a lot of talk about commandline options, but not the nice guis that use them. Avidemux is open source, cross-platform, gives you a decent interface, and uses multithreaded libraries like ffmpeg and x264 on the backend to do the encoding, so it generally makes optimal use of your multicore system.
-- sudo.ca
Multi core no....
But unix apps have been running on multi processor systems for years, and geeks have had access to such systems for years too. I did video encoding in 2000 on a quad cpu alphaserver and a dual cpu sparc, but i just did as someone else suggested and ran multiple encodes simultaneously.
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
As posted elsewhere, it is difficult to divide a project up that is really pretty linear. Instead, you should try to do more jobs at once. Encode four videos at once.
How the hell is this modded interesting (as opposed to informative)?
Do people really not know this stuff (thus making it interesting to them)?
For the gp and the others who still don't get it.
Multi-threaded programming (getting your shit to run in separate threads) is easy, now.
Multi-threaded / distributed algorithms (getting your shit to do some coherent, useful shit while scaling well) are not easy at all.
If you do a lot of H.264 conversion, look into picking up a hardware encoder. There's the Turbo.264; it's Mac-only, but I'm fairly sure it's a rebranded PC device. Plug into a USB port, and it speeds up H.264 encoding -- even on single-core systems. Imagine that with your quad-core. It's not a free solution, but if you find yourself doing a *lot* of encodes, it may be worth your money.
Yep.. you understood your problem domain, and easily recognized where parallelism existed. Then you stated your solution like a practical intelligent person, not like some moron trying to claim that everything is always simple because he is so damn smart that he transcodes all his videos using a neural interface to his own brain while he sleeps. It was simple you know, because brains are massively parallel, and can kick the shit out of your PC when it comes to overall processing power.
It's refreshing to see that, rather than having us all answer questions and think about it, only to THEN find out he doesn't want to do any work.
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Exactly. Too many people assume that any given programmer can write any given program. What isn't generally realized (at least by the masses) is that programming really is about acquiring expertise in a particular domain and then solving problems in that domain through the use of computer programs. Generally some of the most effective programs I've seen have been written, on their first pass, by a person with intimate domain knowledge, and mediocre programming/computer knowledge. The program then becomes a standout when someone with intense programming and computer architecture knowledge improves the code from there (they need not be a subject domain expert, but it helps).
I do take issue with sexconker assuming that I "just don't get it". Heh. If s/he only knew. Whatever, no biggie. I do agree that distributed algorithms are generally more difficult to implement/design than non-distributed, but that's not exactly the same thing as serial versus parallel algorithms (non-distributed generally involves access to data through a common address space, distributed doesn't, though even those pseudo-definitions come up a bit short).
Again and again I read in industry rags and on various web sites that multi-threaded programming is hard, and nobody knows how to do it, and that it's difficult to debug, and all that. I believe what they're really saying is "The set of programmers who are accustomed to multi-threaded programming/debugging is (relatively) small, and thus applications aren't going to make good use of the shift to multicore CPU packages." Familiarity with a skill, and the supply of labor familiar with said skill, is distinct from it being easy or hard.
Anyway, I stand by my belief that parallel programming is not as difficult as most people are led to believe. Some problems don't lend themselves well to parallel solutions, or don't merit the added complexity, but many many of them do. In ten years time I predict that most computer programming education will assume the use of threading, and that anyone who isn't competent with threading will severely limit their own job prospects.
Cyrano de Maniac
And FWIW I have contributed patches in the past to both the avidemux AND nzbget prejects , and they have been accepted, but these have addressed more trivial aspects of the software.
As can be demonstrated in Windows XP. In theory, you should be able to run two tasks at once, right?
So open up Notepad and set that process to 'Realtime' and watch as one core will max out and the other core is completely idle while Windows becomes nearly completely unresponsive ( even if you set Notepad to the second core ).
At least, this is what it did when I tried it, naturally YMMV.
The disappearing pencil trick. Let me show you it.
As other commenters have said, decoding video is not, per se, a trivially parallelized algorithm. Especially for modern codecs with lots of temporal encoding. MJPEG would be easily parallelized, buy you'd have to be dealing with fairly ancient sources...MediaComposer 1 for instance.
However, there are different classes of "video app" that are good targets for parallelization. Real world video editing for instance: consider multiple streams of video with overlays, rotations, effects etc. Video and audio decoding can happen in parallel, you can pipeline the effects stages so that each effect is handed off to another core. Modern video editing systems do this with aplomb.
I'm from the commercial end of this so, I can't comment much on open source alternatives. But I will say that a lot of the algorithms in certain products are highly tuned to the particular CPU type.
And they're smart enough to distribute work across only as many cores as actually exist.
Finally. Don't forget that optimization is hard. You have to consider the speed of the hard drive, the cost of sharing data between threads and cpu caches and a bunch of other real constraints. Any half decent cpu of the last five years or so can easily decode most video faster than it can be read and written to disk. So long as this is true, you won't get any benefit from parallelization.
Any video re encode is gonna involve a bunch of steps. Run em' together on different cores. There ya go.
Once you actually learn something about this somewhat, in Linux anyway, black art you'll find you can use all your cores no problem.
The version of Cinelerra from heroinewarrior.com uses SMP. It's highly dependant on the supporting libraries & who implemented the feature. In the worst case, use renderfarm mode & nodes for each processor. Sometimes the libraries work in SMP mode & sometimes they don't. Sometimes the feature was intended for everyone to use on any number of processors & sometimes it was written for 1 person's cheap single processor.
Now I'm a bit curious.
Given that all of the "usual suspects" of encoding apps support SMP on almost every platform, and have done so for quite some time, what was this guy using that didn't support it?
ffmpeg and x264 are just about the only players in town these days.
-- If you try to fail and succeed, which have you done? - Uli's moose
Devede is a really good gui and adds a lot of functionality. And uses mencoder, which as of 1.0rc2 implements SMP quite nicely. I've been using it the last few days for family videos, on an AMD64 X2, and it is working flawlessly using both cores.
Hey, I know exactly what you mean - and what I said was directed at the parent, not to you... sorry if that came across wrong!
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
As posted elsewhere, it is difficult to divide a project up that is really pretty linear. Instead, you should try to do more jobs at once. Encode four videos at once.
Do you mean split a 100 movie into four 25 minute chunks, encode one chunk on each core, and concatenate them? Great idea.
Cheers. I also found these Acidrip patches. PS In case anyone missed it, I really meant to ask about the front end GUI/script tools rather than the engines. PPS I'm actually using Mandriva.
I get the impression that open source projects are a bit slow on the uptake here?
Open source isn't alone in this regard. Many closed source applications also lag behind. Obviously there are exceptions but many apps just haven't caught up to multi-cores, whether that be just 2 (which is ancient tech by now) or 8 cores in a single system.
this nation, under God, shall have a new birth of freedom. -- Lincoln, Gettysburg Address
But Mac users have been living with SMP since 2001
Just for reference:
UNIX System V R4-MP 1993
Windows NT 1993
OS/2 2.11 1993
Linux 2.0 1996
I may have missed something, but in light of the article here: http://tech.slashdot.org/article.pl?sid=08/05/31/1633214, and the wealth of information being offered in this topic, if you are willing to re-make something like ffmpeg to take advantage of the processing capability of your video card you may achieve tremendous efficiency for your task. (My test blew up from mis-managing memory, but before it did it dynamically allocated 22 or 23 threads..the results were uncertain because the system crashed before logging the current status. This is just a concept-learning test written off the cuff in Java, so a real engineered system ahould be able to do something significant. I will probably get back to it when my current workload slows down.) I'm assuming that if you're doing video work you don't have a lame video card, but the video card should be mostly idle during the conversion process.
"The mind works quicker than you think!"
While I agree with you in principle... some algorithms are NOT SIMPLE. Yes, we can write great code that handles "single instance" things like fetch web pages or dump report. We can write "distributed" solution systems using threads - like chess simulations, raytracers and nuclear physics.
But if an algorithm has linear dependencies for forward state then threading it is much, much more difficult.
I said no... but I missed and it came out yes.
There are two main apps FOSS video converters need to be aware of:
winff - great gui for FFMPEG, makes batch conversion simple. Soon to appear in Debian/Ubuntu repos. vlc also includes a more basic video conversion wizard gui, but can't compete with winff. win and lin versions available.
DeVeDe- Already mentioned here I see. Makes DVD and SVCD/VCD/*CD creation under linux simple. Again, another cross platform FOSS app.
Both these apps have SMP support, but only as good as their respective ffmpeg and mencoder backends.
The new QT4 version of KDEnlive is a total re-write of the app and is said to be SMP friendly but has yet to have a proper release.
Multi-threaded programming is getting to be like artificial intelligence. People flip out about how hard it is, and when you point out mundane, useful, easy kinds of multi-threading, they say, "Well, that's not really what I was talking about." If multi-threading only means scalable performance-critical code, or code with lots of fine-grained locking, or code written with no language or library support, then hell yeah, it's hard. Multi-threaded programming is full of hard problems, but you can get plenty of work done without ever facing up to them.
There is a MPlayer fork called MPlayerXP. The purpose of the fork is to make a multithreaded version of MPlayer.
http://mplayerxp.sourceforge.net/
Many of today's video codecs compress data by only storing the differences between frames. As such they do not lend themselves well to that type of splitting up.
But in practice, how much space are you going to lose by inserting only three extra keyframes into a 100 minute film? Look at the three keyframes that a four-core encoder would insert, then compare that to how many cuts in a film already need a keyframe after them. If you're worried that this will insert too many extra keyframes once encoding scales up to dozens of cores, you could just have one core finding cuts and the rest encoding each interval between cuts.
Virtualdub is free, open source and is quite capable of running with several processors.
As for encoding, I'm not yet interested in x264 because of the weak processor (only a D805 dual core ) but I am using the XVID experimental build from Koepi.info (http://www.koepi.info) which has SMP support.
It maxes out my two cores and you can specify in the configuration how many threads it should use.
As for decoding, I'm Media Player Classic Home Cinema which has a DXVA codec built in (hardware decoding).
On 1080p videos that the video card can decode hardware (ATI 4850), the CPU usage is about 10-12%. If it's not possible to decode hardware, the decoding is passed to CoreAVC which uses about 45% of CPU (and yes, it's smp enabled).
He's not even right about Apple either, the first MP Mac was the PowerMac 9500, released in 1995 with the SMP option released in 1996.
Source
Menzoberranzan Networks
On a dual-CPU system, you will see 100% CPU usage on both when using dvdrip/transcode. I would love to see how it looks on a quad-core system.
I actually at first thought you ment open source video editing tools, at which point, i was going to say good luck.
I have had similar experiences trying to find multi-core 64 bit video encoders / converters / editors. The problem actually is not usually the application, but the codec. The codec has to be written to take advantage of multi-core systems and 64 bit extensions, not just the program using the codecs. I think XVid is one of the few codecs that actaully has this ability, but if I remember right, the 64 bit code project is not as active as the main project and is usually several versions behind. I actually finally went back to the 32 bit codec as it was less buggy.
While using the 64-bit version of Vista, the workload seems to get balanced out between the cores as well. I am not sure if that is due to how the processor works, how the OS works, or how the software works, but all the video editing tools I have used seem to balance the workload over both cores rather well. This is true when using Adobe Premiere, Canopus Procoder, XMpeg, and other encoders / decoders. So while it may not be OPTIMIZED for it, it certainly does take advantage of it. Of course, your milage may vary.
Converting complex serial algorithms to effectively utilize multiple cores is generally not simple.
Fortunately, most algorithms that end-users care about don't fall into that category.
Video encoding certainly doesn't fall into that category. It is almost trivial to split a video up into sections of length (total length/number of cores), and then concatenate the encoded sections after you're done. Transcoding is even easier. Core 0 decodes, Core 1 encodes.... Etc.
Synchronization, scheduling, and data transport issues are largely the same as multi-threaded programming on a single core. The problems are well understood.
Multi-threaded programming is hard because most programmers don't understand the theory. They only learned about the tools.
Not at all, I literally mean four different movies. So four 100 minute movies, one for each core.
How many external DVD-ROM drives would one have to buy to make this effective? I don't think a typical tower PC case for the home market can hold four DVD-ROM drives.
SMP enabled H.264 decoder, works for me! @3GHZ I can play most anything no problem. I do tend to transcode my BD and HD-DVD rips though to save space.
Build it, Drive it, Improve it! Hybridz.org
I think it kind of misses the point of SMP though ( starting two instances I mean ). Isn't it supposed to be transparent to the end user?
The disappearing pencil trick. Let me show you it.