Which Open Source Video Apps Use SMP Effectively?

← Back to Stories (view on slashdot.org)

Which Open Source Video Apps Use SMP Effectively?

Posted by kdawson on Wednesday July 23, 2008 @08:54AM from the on-the-one-core-on-the-other-core dept.

ydrol writes "After building my new Core 2 Quad Q6600 PC, I was ready to unleash video conversion activity the likes of which I had not seen before. However, I was disappointed to discover that a lot of the conversion tools either don't use SMP at all, or don't balance the workload evenly across processors, or require ugly hacks to use SMP (e.g. invoking distributed encoding options). I get the impression that open source projects are a bit slow on the uptake here? Which open source video conversion apps take full native advantage of SMP? (And before you ask, no, I don't want to pick up the code and add SMP support myself, thanks.)"

40 of 262 comments (clear)

ffmpeg by bconway · 2008-07-23 08:55 · Score: 5, Informative

Use the -threads switch.

--
Interested in open source engine management for your Subaru?
1. Re:ffmpeg by morgan_greywolf · 2008-07-23 08:58 · Score: 5, Informative
  
  Similarly, mencoder supports threads=# where # is something between 1 and 8.
  
  --
  My blog
2. Re:ffmpeg by Albanach · 2008-07-23 09:03 · Score: 4, Insightful
  
  Or just convert 2 videos at once, or 4 for a quad core etc. They did suggest they have lots to convert, and it's a pretty easy way to get all available cores working hard.
3. Re:ffmpeg by sp332 · 2008-07-23 09:19 · Score: 3, Informative
  
  And it may or may not be useful to actually rune more than one thread per kernel. It depends on the encoder and application how many threads you shall run, so the best is to test with 1, 2 and 4 threads per kernel.
  Isn't that per-core, not per-kernel?
4. Re:ffmpeg by mweather · 2008-07-23 09:22 · Score: 5, Informative
  
  Apple computers ARE PCs. They coined the damn term.
5. Re:ffmpeg by m0rph3us0 · 2008-07-23 09:47 · Score: 4, Informative
  
  No it doesn't the only time you want to use multi-threading in a single CPU environment is because asynchronous methods for IO are unavailable or the code would be too difficult to re-architect to use asynchronous IO. If the application is seriously IO bound threads can even make the situation worse by causing random IO patterns.
  Ideally, the number of threads a program uses should be no more than the number of processors available. Otherwise, you are wasting time context switching instead of processing.
6. Re:ffmpeg by m0rph3us0 · 2008-07-23 10:06 · Score: 4, Insightful
  
  On a two processor system this would result in multi-threading being off.
7. Re:ffmpeg by ydrol · 2008-07-23 10:26 · Score: 4, Informative
  
  Darn, I forgot a minor detail in my question. I was really asking about the various front-end apps (dvd::rip, k9copy, acidrip etc), I got the impression that none seem to notice they are running on an SMP platform and pass the necessary switches by default to the backend.
  Some may argue this is a good thing, but for the time being SMP is the way forward for faster processing as MHz has maxed out, in consumer PCS. So when they start buying octo-core CPUs they dont expect it to run at 1/8th speed by default.
  I was also being a bit lazy. I could have checked up on each app in turn, but I asked /. instead.
8. Re:ffmpeg by Tanktalus · 2008-07-23 10:29 · Score: 5, Interesting
  
  That sounds like a lot of work... I just used make:
  
  %.mpg: %.avi tovid -ntsc -dvd -noask -ffmpeg -in "$<" -out "$(basename $@)" all: $(subst .avi,.mpg,$(wildcard */*.avi))
  
  Then I just ran "make -j4". All four processors working like mad, with a minimal of effort.
  (You may need to change the wildcard for your own scenario.)
9. Re:ffmpeg by Anonymous Coward · 2008-07-23 11:11 · Score: 3, Informative
  
  If thread 1 is doing work while thread 2 is blocked (io, semaphores, etc), then multithreading will be faster.
10. Re:ffmpeg by VGPowerlord · 2008-07-23 11:16 · Score: 4, Informative
  
  True, but in most contexts, "PC" is the shortened form of IBM-compatible PC (which is really outdated), and is usually just stands for Windows these days.
  
  --
  GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011
11. Re:ffmpeg by maglor_83 · 2008-07-23 11:35 · Score: 5, Funny
  
  On a single core system this would result in not being able to run anything!
12. Re:ffmpeg by hedwards · 2008-07-23 12:08 · Score: 5, Insightful
  
  Apple has spent a lot of time and money convincing everybody that they don't sell PCs, they sell Macs. I'm not sure what the point of arguing with both the general public as well as Apple is.
  At this point, the term PC does not include Apple computers. It's a change to the definition which happens when the vast majority of people decide amongst themselves that the definition should change.
  In terms of the topic at hand, most video apps really should be capable of using multiple cores, tasks of this sort are quite easy to finish in parallel. Either by doing ever n frames or subdividing the image into a number of regions which can be completed separately and joined at the end before writing the frame to disk.
13. Re:ffmpeg by 3vi1 · 2008-07-23 12:30 · Score: 5, Insightful
  
  No - HP did (for their calculators), way before there "was" an Apple.
  Also, I don't even think Apple marketing would agree with you - or they wouldn't have "I'm a Mac... and I'm a PC" adverts.
14. Re:ffmpeg by networkBoy · 2008-07-23 12:57 · Score: 3, Informative
  
  I hit I/O throttling when I do the following:
  * rip 2 dvds (two DVDR Drives)
  * transcoding previous DVD rips to XVID
  * Moving completed rips to server over 1 Gbps Ethernet link.
  At this point I can see CPU load start to drop as PCI bus I/O saturates.
  At no point do I hit disk I/O or memory limits.
  Disks are non-RAID non-striped, but rips are to separate disks (thus DVDA rips to HDA DVDB to HDB) and server upload pulls from whatever disk is not currently transcoding (transcode file on HDA, when done start transcode on HDB and move file from HDA).
  -nB
  
  --
  whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
15. Re:ffmpeg by MadnessASAP · 2008-07-23 14:07 · Score: 4, Informative
  
  If I may offer a suggestion, I'm not too sure on what your setup is but on mine I have 2 DVD drives each on separate IDE buses and 2 SATA drives (also on separate buses) rip from the DVD to drive 1 and encode from drive 1 to 2. OF course it all depends on a variety of factors but using that certainly helped that.
  
  --
  I may agree with what you say, but I will defend to the death your right to face the consequences of saying it.
16. Re:ffmpeg by Nikker · 2008-07-23 14:18 · Score: 4, Informative
  
  Running multiple cores with an ide interface is going to kill you regardless because you are only encoding in memory not really storing much there. Basically you have a cap of about 40MB/s for anything larger than about 40MB.
  
  --
  A loop, by its nature, continues. If that didn't make sense, start reading this sentence again.
17. Re:ffmpeg by ksheff · 2008-07-23 15:21 · Score: 5, Informative
  
  That's the point. If the xvid encoder is single threaded, then to keep all the cores busy, one must run multiple instances of ffmpeg with each one encoding a different file. For the given Makefile, that is what make will do when the -j switch is used.
  
  --
  the good ground has been paved over by suicidal maniacs
18. Re:ffmpeg by QuoteMstr · 2008-07-23 17:21 · Score: 3, Informative
  
  You're still missing the OP's point. Let me spell it out for you:
  Say you have four videos to encode, and four cores.
  1) You can either use one core at a time and encode one video at a time. Let's say that takes time T.
  2) You can encode one video at a time, but use all four cores while doing it. Your total time is T/4.
  3) You can encode four videos at a time, one on each core. Your total time is T/4.
  The OP was advocating strategy #3. It's a fine approach.
transcode, of course! by morgan_greywolf · 2008-07-23 08:55 · Score: 5, Informative

transocde uses separate processes for everything.

--
My blog
x264 by Anonymous Coward · 2008-07-23 08:58 · Score: 3, Insightful

x264 use slices and scales pretty well across multiple cores. I use it on windows via megui, but you could easily use it in Linux as well. You could use mencoder to pipe out raw video to a fifo and use x264 to do the actual conversion, for instance.
VisualHub... by e4g4 · 2008-07-23 08:59 · Score: 3, Informative

...makes excellent use of multiple cores. It is however Mac-only. Interestingly, what it does is split a file into chunks and spawns multiple ffmpeg processes to do the conversion. Which is to say, perhaps you can do some (relatively simple) scripting with ffmpeg that will do the job.

--
The secret to creativity is knowing how to hide your sources. - Albert Einstein
Beat me to it! by BLKMGK · 2008-07-23 09:05 · Score: 4, Informative

x264 via meGUI from Doom9 is what I use to compress HD-DVD and BD movies - also on a quad core. I have some tutorials posted out and about on how I'm doing it. Near as I can tell you cannot dupe the process on Linux due to the crypto - Slysoft's AnyDVD-HD is needed.
Playback - I use XBMC for Linux. It is also SMP enabled using the ffmpeg cabac patch. the developers of this project have been VERY aggressive at taking cutting edge improvements to the likes of ffmpeg and incorporating them into the code. Since Linux has no video acceleration of H.264 SMP really helps on high bitrate video!

--
Build it, Drive it, Improve it! Hybridz.org
Load balancing: Why? by DigitAl56K · 2008-07-23 09:09 · Score: 4, Insightful

don't balance the workload evenly across processors
Why is balancing the load evenly important, as long as one thread is not bottlenecking the others? Loading a particular core or set of cores might even be beneficial depending on the cache implementation, especially when other applications are also contending for CPU time.
Sure, a nice even load distribution might be an indicator for good design, but it doesn't have to apply in every case. I don't think software should be designed so you can be pleased with the aesthetics of the charts in task manager.
1. Re:Load balancing: Why? by DigitAl56K · 2008-07-23 09:34 · Score: 4, Insightful
  
  It's still possible to load all cores 100%.
  A video decoder that I'm working with, for example, currently uses only as many threads as necessary for real-time playback. So for example if one core can do the job only one core is used. If the decoder looks like it might start falling behind more threads are given work to do. Ultimately, if your system is failing to keep up all cores will be fully leveraged.
  However, so long as only some cores are required the others are 100% available to other processes, including their cache (if it's independent). I'm not sure how power management is implemented but perhaps it's even possible for the unused cores to do power saving, leading to longer batter life for laptops/notebooks, etc.
  
  the idea is to make maximal use of your available resources, right?
  No, the idea is to make the best use of your resources. I'm not trying to say that load balancing is wrong. I'm just saying that processes that don't appear to be balanced are not necessarily poorly designed or operating incorrectly.
Handbrake by vfs · 2008-07-23 09:18 · Score: 5, Informative

Handbrake has always used both of the cores on my system for transcoding.
1. Re:Handbrake by catmistake · 2008-07-23 09:57 · Score: 4, Informative
  
  that's because Handbrake uses ffmpeg
  
  --
  The Admin and the Engineer
Re:Which part of Open Source didn't you get? by pushing-robot · 2008-07-23 09:20 · Score: 4, Informative

OP is asking for open source tools. You cited a commercial one that doesn't provide source.
VisualHub (the front-end app) may be closed, but ffmpeg is LGPL.
And the GP was suggesting using ffmpeg, not VisualHub.

--
How can I believe you when you tell me what I don't want to hear?
F(next) = F(current) + Delta(F(current:next)) by Lumenary7204 · 2008-07-23 09:21 · Score: 5, Insightful

The problem with MPEG encoding and decoding is that the data itself is not well suited to multi-threaded analysis.
Multi-threading is most efficient when it is applied to discrete data sets that have little or no dependency on each other.
For example, suppose I have a table with four columns -- three holding input values (A, B, and C) and one holding an output value (X). If the data in a given row of the table has nothing to do with the data in any other row, multi-threading works efficiently, because none of the threads are waiting for data from any of the other threads. If I want to process multiple rows at once, I simply spawn additional threads.
On the other hand, for data such as MPEG video, the composition of the next frame is equal to the composition of the current frame, plus some delta transformation - the changed pixels.
This introduces a dependency which precludes efficient multi-threaded processing, because each succeeding frame depends on the output of the calculations used to generate the prior frame. Even if more than one core is dedicated to processing the video stream, one core would wind up waiting on another, because the output from the first core would be used as the input to the second.
1. Re:F(next) = F(current) + Delta(F(current:next)) by Omega996 · 2008-07-23 09:36 · Score: 4, Insightful
  
  theoretically, couldn't an encoder scan the data stream for keyframes, chunk the data from keyframe to the next keyframe, and then queue up the keyframe+delta information for multiple cores? That way, each core has something to do that isn't dependent upon the completion of something else.
  i'd think that n-1 cores/threads/whatever to process the chunked data, and the last core/thread/whatever to handle overhead and i/o scheduling would run pretty nicely on a multi-core machine.
2. Re:F(next) = F(current) + Delta(F(current:next)) by John+Betonschaar · 2008-07-23 09:41 · Score: 4, Insightful
  
  You could of course split each frame in slices, and process these in parallel. Or skip the video N frames between each core, with N being the number of frames between MPEG keyframes. Or have core 1 do the luma and core 2 and 3 the chroma channels. Or pipeline the whole thing and have core 1 do the DCT, core 2 the dequant etc. and have core 3 reconstruct the output reference frame while core 1 already starts the next frame.
  Plenty of ways to parallelize decoding, and even more for encoding...
Re:Simple... by j00r0m4nc3r · 2008-07-23 09:21 · Score: 5, Informative

Running multiple instances of the same code concurrently in multiple threads is simple. Even running mutually exclusive parts of the same code concurrently in separate threads is easy. Converting complex serial algorithms to effectively utilize multiple cores is generally not simple. And writing code that can scale and balance across n number of cores/threads is extremely hard. There are all sorts of synchronization issues to deal with, scheduling issues, data transport issues, etc.. and it becomes increasingly hard to debug code the more cores/threads you throw in. I think the stigma is justified.
keyframes by Anonymous Coward · 2008-07-23 09:29 · Score: 5, Informative

Actually, the MPEG stream resets itself every n frames or so (n is often a number like 8, but can vary depending on the video content). These are called keyframes (K) and the delta frames (called P and I frames) are generated against them. Because of this, it is really easy to apply parallel processing to video encoding.
1. Re:keyframes by DigitAl56K · 2008-07-23 10:14 · Score: 4, Informative
  
  Actually, the MPEG stream resets itself every n frames or so (n is often a number like 8, but can vary depending on the video content).
  That is not true for MPEG-4 unless you have specifically constrained the I/IDR interval to an extremely short interval, and doing so severely impacts the efficiency of the encoder because I-frames are extremely expensive compared to other types.
  Keyframes are usually inserted when temporal prediction fails for some percentage of blocks, or using some RD evaluation based on the cost of encoding the frame. Therefore unless the encoder has reached the maximum key interval the I frame position requires that motion estimation is performed, and thus you can't know in advance where to start a new GOP.
  In H.264 due to multiple references you would certainly have issues to contend with since long references might cross I-frame boundaries, which is why there is the distinction of "IDR" frames, and this would certainly not be possible threading at keyframe level.
  Granted, for MPEG1&2 encoders threading at keyframes is a possibility, although still not one I'd personally favor.
2. Re:keyframes by TwinkieStix · 2008-07-23 11:26 · Score: 3, Informative
  
  This may be true for sending entire frames to threads, but in mpeg4, frames are broken up into chunks. Motion vectors are created that allow these chunks to move about the image from frame to frame. Other filters are used to remove blockiness, compress the image, do motion detection and macroblock detection, and do various other tasks. MPEG4, especially H.264, can be easily multi-threaded: http://ietisy.oxfordjournals.org/cgi/content/abstract/E88-D/7/1623 http://adsabs.harvard.edu/abs/2004SPIE.5308..384L http://www.electronicsweekly.com/Articles/2007/05/02/41296/aspex-targets-parallel-processor-at-blu-ray-dvd.htm When doing a two-pass encode, this is even easier because the keyframes are discovered on the first (faster) pass, so (if encoding already couldn't be threaded) it could by taking advantage of the known keyframe markers in at least the second pass. But, that's not necessary. I use handbrake to create H.264 videos under Linux all the time on my dual core machine, and both processors stay between 80%-90% utilization from start to finish regardless of the number of passes.
Windows? VirtualDub 1.8.x + ffdshow-tryouts by tdelaney · 2008-07-23 09:44 · Score: 3, Informative

You don't say if you're running on Windows or Linux or something else. If you are running on Windows, the latest versions of VirtualDub have made big improvements to SMT/SMP encoding.
VirtualDub home
VirtualDub 1.8.1 announcement
VirtualDub downloads
Make sure you grab 1.8.3 - 1.8.1 was pretty good, but had a few teething problems. 1.8.2 has a major regression which is fixed in 1.8.3. The comments in the 1.8.1 announcement contain a few important tips for using the new features (some of which I posted BTW).
The two major new features that would be of interest to you are:
1. You can run all VirtualDub processing in one thread, and the codec in another. This works very well in conjunction with a multi-threaded codec - this one change improved my CPU utilitisation from approx 75% to 95% on my dual-core machines - with an equivalent increase in encoding performance.
2. VD now has simple support for distributed encoding. You can use a shared queue across either multiple instances of VD on a single machine, or across multiple machines (must use UNC paths for multiple machines). Each instance of VD will pick the next job in the queue when it finishes its current job. Instances can be started in slave mode (in which case they will automatically start processing the queue).
I use 3 machines for encoding (all dual-core). With VD 1.8.x I start VD on two of the machines in slave mode, and one in master mode. I add jobs to the queue on the master instance, and the other two instances immediately pick up the new jobs and start encoding. When I've added all the jobs, I then start the master instance working on the job queue.
To achieve a similar effect on your quad-code, start two instances of VD on the same machine - one slave, the other master.
It's not perfect (if you've only got one job, you won't use your maximum capacity) but it has greatly simplified my transcoding tasks, and reduced the time to transcode large numbers of files.
avidemux by Unit3 · 2008-07-23 09:46 · Score: 5, Informative

I've noticed a lot of talk about commandline options, but not the nice guis that use them. Avidemux is open source, cross-platform, gives you a decent interface, and uses multithreaded libraries like ffmpeg and x264 on the backend to do the encoding, so it generally makes optimal use of your multicore system.

--
-- sudo.ca
Re:Simple... by Cyrano+de+Maniac · 2008-07-23 11:20 · Score: 3, Insightful

Exactly. Too many people assume that any given programmer can write any given program. What isn't generally realized (at least by the masses) is that programming really is about acquiring expertise in a particular domain and then solving problems in that domain through the use of computer programs. Generally some of the most effective programs I've seen have been written, on their first pass, by a person with intimate domain knowledge, and mediocre programming/computer knowledge. The program then becomes a standout when someone with intense programming and computer architecture knowledge improves the code from there (they need not be a subject domain expert, but it helps).
I do take issue with sexconker assuming that I "just don't get it". Heh. If s/he only knew. Whatever, no biggie. I do agree that distributed algorithms are generally more difficult to implement/design than non-distributed, but that's not exactly the same thing as serial versus parallel algorithms (non-distributed generally involves access to data through a common address space, distributed doesn't, though even those pseudo-definitions come up a bit short).
Again and again I read in industry rags and on various web sites that multi-threaded programming is hard, and nobody knows how to do it, and that it's difficult to debug, and all that. I believe what they're really saying is "The set of programmers who are accustomed to multi-threaded programming/debugging is (relatively) small, and thus applications aren't going to make good use of the shift to multicore CPU packages." Familiarity with a skill, and the supply of labor familiar with said skill, is distinct from it being easy or hard.
Anyway, I stand by my belief that parallel programming is not as difficult as most people are led to believe. Some problems don't lend themselves well to parallel solutions, or don't merit the added complexity, but many many of them do. In ten years time I predict that most computer programming education will assume the use of threading, and that anyone who isn't competent with threading will severely limit their own job prospects.

--
Cyrano de Maniac
Not as simple as you would think by sjf · 2008-07-23 11:51 · Score: 4, Insightful

As other commenters have said, decoding video is not, per se, a trivially parallelized algorithm. Especially for modern codecs with lots of temporal encoding. MJPEG would be easily parallelized, buy you'd have to be dealing with fairly ancient sources...MediaComposer 1 for instance.
However, there are different classes of "video app" that are good targets for parallelization. Real world video editing for instance: consider multiple streams of video with overlays, rotations, effects etc. Video and audio decoding can happen in parallel, you can pipeline the effects stages so that each effect is handed off to another core. Modern video editing systems do this with aplomb.
I'm from the commercial end of this so, I can't comment much on open source alternatives. But I will say that a lot of the algorithms in certain products are highly tuned to the particular CPU type.
And they're smart enough to distribute work across only as many cores as actually exist.
Finally. Don't forget that optimization is hard. You have to consider the speed of the hard drive, the cost of sharing data between threads and cpu caches and a bunch of other real constraints. Any half decent cpu of the last five years or so can easily decode most video faster than it can be read and written to disk. So long as this is true, you won't get any benefit from parallelization.
Re:Use Mac OS X... by Anonymous Coward · 2008-07-23 16:44 · Score: 3, Informative

But Mac users have been living with SMP since 2001

Just for reference:
UNIX System V R4-MP 1993
Windows NT 1993
OS/2 2.11 1993
Linux 2.0 1996