NVIDIA Shaking Up the Parallel Programming World
An anonymous reader writes "NVIDIA's CUDA system, originally developed for their graphics cores, is finding migratory uses into other massively parallel computing applications. As a result, it might not be a CPU designer that ultimately winds up solving the massively parallel programming challenges, but rather a video card vendor. From the article: 'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores. While this might seem like a trivial thing, there are all kinds of issues which arise. Suppose you are writing a gaming engine and there must be coordination between the location of the characters in the 3D world, coupled to their movements, coupled to the audio. All of that has to be synchronized. What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.'"
where's the MIT CS guys when you need them?
Wow, I bet nobody on slashdot knew that!
Do it yourself, because no one else will do it yourself. [beta blockade 10-17 Feb]
The articles sums up the hurdles of parallel programming and says that NVIDIA's CUDA is doing something to solve them but it doesn't say what. Even the short Wikipedia entry at http://en.wikipedia.org/wiki/CUDA tells more about it.
Libertarian Leaning Political Discussion Forum.
Topic is rather interesting, especially for game developers, among whom I sometimes lurk , but what's the point of simplifying descriptions and problems up to the point of being meaningless and useless ?
This is just hype, it is well known that for real high-performance applications cuda is compute-bound, i.e. a lot of bandwidth is waste. Cuda is just another platform for niche applications, never to compete with commodity processors.
I wonder what the hell I've been doing when I needed multiple threads to have a consistent view of state.
Oh wait, its called SYNCHRONISATION.
I'm sure its exciting to the OP and all, but hell, this is basic CS101 shit.
So make it all synchronize to the lowest fps, the video of course.. We are talking about one game object after all.
In real application, the audio/video must be calculated for many of objects, and it is a static 30 or 60 fps video, and always static samples per second audio, perhaps cd quality 44100 samples per second but likely less.
This synchronization is not unsolved. Every slice of game time is divided between how many $SampleRate frames of audio divided by game objects producing audio, and how many triangles versus amount triangles possible.
You take the lowest amount possible per slice of game time (here a second) and call that the target amount. You don't put more game objects than resources in your environment at a time, the AI can know this secret detail and not show up.
How does more than one core processor issue ever help to a game object to be expressed better? There is only AI left. Use the extra cores to display more objects, give the objects better AI, or give better sounding/looking objects.
Many moons ago, when most slashdotters were nippers, a British company named INMOS provided an extensible hardware and software platform that solved the problem of parallelism, in many ways similar to CUDA.
Ironically, some of the first demos I saw using transputers was raytracing demos.
The problem of parallelism and the solutions available are quite old (more than 20 years), but it's only now that limits are reached that we see the true need for it. But the true pioneers is not NVIDIA, because there were others long before them.
> The concept of writing individual programs which run on multiple cores is called multi-threading
What the hell has happened to you, dear slashdot? I told you visual basic was bad for your brain...
When I came up through my CS degree, object-oriented programming was new. Programming was largely a series of sequentially ordered instructions. I haven't programmed in many years now, but if I wanted to write a parallel program I would not have a clue.
But why should I?
What is needed are new, high-level programming languages that figure out how to take a set of instructions and best interface with the available processing hardware on their own. This is where the computer smarts need to be focused today, IMO.
All computer programming languages, and even just plain applications, are abstractions from the computer hardware. What is needed are more robust abstractions to make programming for multiple processors (or cores) easier and more intuitive.
A work that expires before its copyright never enters the public domain and thus enjoys eternal copyright protection.
Like I was saying in another post, since everything per game object must be synchronized to the slowest procedure (video rendering of the object), the way to not wasted cpu cycles is to spend it on AI.
In essence, the faster your CPU then, (static on consoles), the more time you can devote to making your game objects smarter after you're done the audio visual.
This tells me nothing. Why would you want a game (Common single threaded-programmed application) to compete with your divx compression and ray tracing bryce3d application running in the background? Are they (Intel, AMD, IBM) all saying that we need to hook up 8 or 12 or 24 processor cores at 3ghz each to get an actual speed of 4ghz while each one waits around wasting processing cycles to get something to do? That is the lamest thing I've heard in a long time. I'd much rather have a SINGLE CORE Graphene processor at 12Ghz, than quadcore or oct-core at 4ghz.
"News for Nerds, Stuff that matters".
But not if posted by The Ignorant.
What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.
If a student of mine wrote this, a Fail will be the immediate consequence. How can 400 fps be 'only'? And why is threading bad, if the character movement is ready after 1/400 second? There is not 'a lot of waiting'; instead, there are a lot of cycles to calculate something else. and 'waiting' is not 'synchronisation'.
[The audio-rate of 7000 fps gave the author away; and I stopped reading. Audio does not come in fps.]
While we all agree on the problem of synchronisation in parallel programming, and maybe especially in the gaming world, we should not allow uninformed blurb on Slashdot.
CUDA helps by moving more work to the GPU - where the biggest bottleneck is.
:-(
Um, no, that can't be right...
No sig today...
Oh and could you figure out some way to timestamp FPS game captures for the upcoming olympic video games?
Except for being somewhat more cumbersome to program and less parallel than previous hardware, there is nothing really new about the nVidia parallel programming model. And their graphics-oriented approach means that their view of parallelism is somewhat narrow.
Maybe nVidia will popularize parallel programming, maybe not. But I don't see any "shake up" or break throughs there.
Why not start again with a massively parallel GPU, skipping all the years of catchup that will be necessary with multi-core cpu's. Make an OS for your chips...
Back in the early 80's I was working in Bristol UK for TDI (who were the UCSD p-system licensees) porting it to various machines... Well, we had one customer who wanted a VAX p-system so we trotted off to INMOS's office and sat around in the computer room. (VAX 11/780 I think). At the time they were running Transputer simulations on the machine so the VAX p-system took er... about 30 *minutes* to start. Just for comparison an Apple ][ running IV.x would take less than a minute. Almost an hour to make a tape. (About 15 users running emulation I think). Fond memories of the transputer. Almost bought a kit to play with it... Andy
CUDA is an interesting way to utilize NVIDIA's graphics hardware for tasks it wasn't really designed for, but it's not a solution to parallel computing in and of itself. (more on that momentarily) A few people have gotten their nice high end Quadros to do some pretty cool stuff, but to date it's been limited primarily to relatively minor academic purposes. I don't see CUDA becoming big in gaming circles anytime soon. Let's face it, most gamers buy *one* reasonably good video card and leave it at that. Your video card has better things to do than handle audio or physics when your multi-core CPU is probably being criminally underutilized. Nvidia, of course, wants people to buy wimpy CPU's and then load up on massive SLI rigs and then do all their multi-purpose computation in CUDA. Not gonna happen.
First of all, there are very few general purpose applications that special purpose NVIDIA hardware running CUDA can do significantly better than a real general purpose CPU, and Intel intends to cut even that small gap down within a few product cycles. Second, nobody wants to tie themselves to CUDA when it's built entirely for proprietary hardware. Third, CUDA still has a *lot* of limitations. It's not as easy to develop a physics engine for a GPU using CUDA as it is for a general purpose CPU.
Now, I haven't used CUDA lately, so I could be way off base here. However, multi-threading isn't the real challenge to efficient use of resources in a parallel computing environment. It's designing your algorithms to be able to run in parallel in the first place. Most multi-threaded software out there still has threads that have to run on a single CPU, and the entire package bottlenecks on the single CPU running that thread even if other threads are free to run on other processors. This sort of bottleneck can only be avoided at the algorithm level. This isn't something CUDA is going to fix.
Now, I can certainly see why NVIDIA is playing up CUDA for all they're worth. Video game graphics rendering could be on the cusp of a technological singularity. Namely, ray tracing. Ray tracing is becoming feasible to do in real time. It's a stretch at present, but time will change that. Ray tracing is a significant step forward in terms of visual quality, but it also makes coding a lot of other things relatively easy. Valve's recent "Portal" required some rather convoluted hacks to render the portals with acceptable performance, but in a ray tracing engine those same portals only take a couple lines of code to implement and have no impact on performance. Another advantage of ray tracing is that it's dead simple to parallelize. While current approaches to video game graphics are going to get more and more difficult to work with as parallel processing rises, ray tracing will remain simple.
The real question is whether NVIDIA is poised to do ray-tracing better than Intel in the next few product cycles. Intel is hip to all of the above, and they can smell blood in the water. If they can beef up the floating point performance of their processors then dedicated graphics cards may soon become completely unnecessary. NVIDIA is under the axe and they know it, which might explain all the recent anti-Intel smack-talk. Still, it remains to be seen who can actually walk the walk.
The original Inmos Transputer was designed to solve such problems and relied on fast inter-processor links, and the AMD Hypertransport bus is a modern derivative.
So I disagree with you. The processing hardware is not so much the problem. If GPUs are small, cheap and address lots of memory, so long as they have the necessary instruction sets they will do the job. The issue to focus on is still interprocessor (and hence interprocess) links. This is how hardware affects parallelism.
I have on and off worked with multiprocessor systems since the early 80s, and always it has been fastest and most effective to rely on data channels rather than horrible kludges like shared memory with mutex locks. The code can be made clean and can be tested in a wide range of environments. I am probably too near retirement now to work seriously with Erlang, but it looks like a sound platform.
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
NVidia is one of the major voices in the Khronos Group, the organization that promised to release the OpenGL 3.0 API over six months ago. The delay is embarrassing, and many are turning to DirectX.
It occurs to me that NVidia may not want OpenGL to succeed. Maybe they're holding up OpenGL development to give CUDA a place in the sun. Does anyone else get the same impression?
From my experience, CUDA was much harder to take advantage of then multi-core programming. CUDA requires you to use a specific model of programming that can make it difficult to take advantage of the full hardware. The restricted caching scheme makes memory management a pain, and the global synchronization mechanism is very crude - there's a barrier after each kernel execution, and that's it. It took me a week to 'parallelize' port some simple code I had written to CUDA, whereas it took my an hour or so to add the OpenMP statements to my 'reference' CPU code. Sorry Nvidia - there is no silver bullet. By making some parts of parallel programming easy, you make others hard or impossible.
Why was that headline greenlit? Next we'll be like the NY Times and have to avoid those confusing acronyms and spell out Central Processing Unit and Redundant Array of Inexpensive Disk [drives].
That in a nutshell is why I suggested that investment in Erlang would be a good idea. It's better to start with the right approach and optimise it, than go off into computer science blue sky and try to design a perfect language for paralleling GPUs - which practically nobody will ever really use.
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
Nvidia unleashes Cuda attack on parallel-compute challenge
Avoid the blog spam. This is the actual article in EE times: Nvidia unleashes Cuda attack on parallel-compute challenge.
Nvidia is showing signs of being poorly managed. CUDA is a registered trademark of another hi-tech company.
The underlying issue is apparently that Nvidia will lose most of its mid-level business when AMD/ATI and Intel/Larrabee being shipping integrated graphics. Until now, Intel integrated graphics has been so limited as to be useless in many mid-level applications. Nvidia hopes to replace some of that loss with sales to people who want to use their GPUs to do parallel processing.
Multi-threaded programming is a fundamentally hard problem, as is the more general issue of maximally-efficient scheduling of any dynamic resource. No one idea, tool or company is going to "solve" it. What will happen is that lots of individual ideas, approaches, tools and companies will individually address little parts of the problem, making it incrementally easier to produce efficient multi-threaded code. Some of these approaches will work together, others will be in opposition, there will be engineering tradeoffs to be made (particularly between efficiency of execution and ease of development) and the incremental improvements will not so much make it easier to to multi-threaded programming as make it feasible to attack more complex problems.
Pretty much just like the history of every other part of software development.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
I live in Minnesota, home of the legendary Cray Research. I've met with several old timers that developed the technologies that made the Cray Supercomputer what it was. Hearing about the problems that multi-core developers are facing today reminds me of the stories I heard about how the engineers would have to build massive cable runs from processor board to processor board to memory board just to synchronize the clocks and operations so that when the memory was ready to read or write data, it could tell the processor board... half a room away.
As I recall:
The processor, as it was sending the data to the bus, would have to tell the memory to get ready to read data through these cables. The "cables hack" was necessary because the cable path was shorter than the data bus path, and the memory would get the signal just a few mS before the data arrived at the bus.
These were fun stories to hear but now seeing what development challenges we face in parallel programming multi-core processors gives me a whole new appreciation for those old timers. These are old problems that have been dealt with before, just not on this scale. I guess it is true what they say, history always repeats itself.
Good security is based upon reality and common sense. Common sense is a function of having common knowledge.
I always wondered why the parallelization problem couldn't be solved using a concept similar to TCP/IP. If you think of CPU instructions like a "packet" and assign them a sequence number, then the CPU can keep track of what order the instruction results should come out. In addition to L1 and L2 cache, there should be a cache to hold the results of CPU instructions until they can be streamed out in the correct order. In essence, it would look like a single-core CPU to the outside world, but using buffers and sequencing tricks, perform in parallel.
The entire reason why CUDA works and is powerful is exactly because it is limited. Nvidia knows that there is no silver bullet. They're not claiming that this is one (David Kirk has said so himself at conferences). CUDA is a fairly elegant way of mapping embarrassingly data parallel programs to a large array of single precision FP units. If your problem fits into the model, the performance you get via CUDA will smoke just about anything else (except maybe an FPGA in some scenarios).
Your notion about particular models making some parts of parallel programming easy while other parts are hard is what people really need to learn to accept about parallel programming. If you're expecting a single model to make everything easy for you, trust me, stop programming right now.
You need to pick the programming model that matches the parallelism in your application- there will never be one solution. When sitting down to write code, you have to ask yourself: what is the right model for this algorithm? Is it:
Data parallel (SIMD, Vector)
Message Passing
Actors
Dataflow
Transactional
Streaming (pipe and filter)
Sparse Graph
Etc...
There are many models out there, and many languages + hardware substrates for these models that will give you orders of magnitude speedup for parallel programs. They key is to just to sit down, think about the problem, and pick the right one (or combinations).
The real research focus in parallel programming should be to make a taxonomy of models and start coming up with a unified infrastructure to support intelligent selection of models, mixing and matching, and compilation.
BOINC is only a framework for organizing job-level massive parallelism. It's not an abstraction for parallelism at the application level: when you write a BOINC application, you don't get any parallelism for "free". It's still up to you as the application developer to target your app for a specific platform, let alone hardware, because BOINC simply hands off / manages execution of your application. The app developer must write for x86/Win, x86/Linux, x86/Mac, PPC/Mac, etc. Most critically, that means that you have the privilege and responsibility of exploiting the hardware (x86, amd64, PPC, PS3, etc.) yourself, specific to your application's needs, at the application level. BOINC will then do handle job management and scheduling between your server and each instance of the client.
So your question is actually a bit ill-formed. Instead of asking "could we run a framework on the PS3?", which would provide no free parallelism, you probably meant to ask "could we run BOINC applications on the PS3?". The problem lies not in porting BOINC to PS3 but in having yet another platform which users (application authors such as SETI@H or Einstein@H) would need to target. Some (most...) of those guys are fairly small operations and stick to x86 hardware and often only Windows at that, at least for a while until they get Mac and Linux clients working alongside.
The Folding@home operation is well-organized and has more resources than most, and they don't run on BOINC. They're the ones who have a PS3 client (which is much tougher to write than an x86 client to exploit the given hardware), and who even support a handful of ATI's recent but disparate GPUs (Windows only I believe). It's not that BOINC on PS3 (or whatever) is impossible; it's that it gains the application developer nothing without a LOT more effort. The question of whether or not it's worth that effort falls to the user and not the authors of BOINC.
...NOT want to succeed. Microsoft does NOT want OpenGL to succeed.
But you knew that already.
There is a language that makes programming in threads much easier. See erlang.org.
... programs are still only as fast as their slowest link.
Use Ada.
No offence, but I'm perplexed as to how this rubbish made it past the firehose.
College-Pages.com - Online Colleges, Degrees, and Programs
The real problem is not parallelizing applications. It is staging the data so your CPU isn't spending all of its time waiting for data.
GPUs deal with this by having so many threads that they can afford to swap in other threads when they are waiting for data, but this has a *huge* overhead in terms of thread state storage, and generates "flocking" effects in the caches.
The real reason CUDA isn't the revolution NVIDIA wants you to think it is is simply that the GPUs from NVIDIA only work well if the threads all do the same thing. (At least in batches of 16.) If you run 1024 different threads on a GPU you get about 1/16th the performance, and now you're not looking much better than Intel, and a lot worse than Cell...
I am no Erlang expert, but isn't it supposed to be a language that is inherently parallel, thus allowing programs to "automatically" take advantage of multi-core systems?
When we need different systems to run thousands of cues for a show (lighting, pyrotechnics (with a deadman switch), special effects, audio, video projection and automated staging) we use SMPTE time code . IANACD I am not a chipset developer but if you could feed the multiple cores a time code pre-processor then everything post processor should sync up on cue in your various outputs.
Beware of those who profit off the docile and persecute the unbelievers.